Choose your index carefully
When designing a relational database schema, just adding an index on every column that might be in involved in a where-clause might not be enough. It might be downright wrong.
Suppose we have a simple table in a SQL database used to store measurements (for example temperature) taken at certain locations. We'd start out with a database table similar to this:
CREATE TABLE measurement (
id bigserial NOT NULL,
LogDate timestamp NOT NULL,
Location varchar(255) NOT NULL,
Value double precision NOT NULL,
PRIMARY KEY(id)
);
We expect that this table might eventually contain millions of rows, and that we might want to locate rows based on the date of the measurement or where it was logged. So we create two indexes on the relevant rows:
CREATE INDEX measurement_logdate on measurement(LogDate); CREATE INDEX measurement_location on measurement(Location);
The application turns out to be a little simpler than we expected and it turns out that the only queries we use are a simple INSERT statement append data to the measurement table, and SELECT queries similar to the following:
SELECT LogDate,Value FROM measurement WHERE
Location='Gobabis'
AND LogDate >= '2009-10-01 00:00:00' AND LOGDATE < '2009-10-02 00:00:00'
Initially it seems to work well and the system goes into production. But as the number of measurements approaches a hundred million it slows down noticeably.
Analysis reveals that when you run the above query, the database first consults measurement_logdate to find rows matching the time range we're interested in, then scans measurement_location to find rows matching the location, and finally it calculates the intersection between the aforementioned sets of rows, retrieves the rows and returns the result.
This is bad for at least two reasons. The set of rows matching the date criteria might be a very big set, and a large portion of the measurement_logdate index is scanned only to be ignored later because it involves locations other than the one we want.
My first attempt at improving this situation was to use the partial index feature in Postgresql. Creating a simple index per location quadrupled the speed:
CREATE INDEX partial_gobabis on measurement(Location) WHERE Location='Gobabis';
I wasn't particularly fond of this solution, because it involved a potentially large number of indexes, and I'm sure that deciding which index to use will introduce unnecessary overhead in the planner. And then it dawned on me that a multi-column index might just do the same job:
DROP INDEX partial_gobabis; CREATE INDEX measurement_location_logdate on measurement(Location, LogDate);
According to the Postgresql documentation, you should order the columns in the index placing the columns on which equality constraints will be applied first. This index also quadruples the original speed.
When deciding what indexes your tables need, you should also consider that when new data is appended to the table, the index needs to be updated, and this is generally a log(n) operation. You therefore don't want to maintain any indexes that do not add value. In our application, I found that measurements are always queried in the context of a specific location, never alone on the date or the location, so that my original two indexes only waste time when data is appended. I therefore dropped the original indexes and kept only the new compound index.
And if you found yourself wondering where Gobabis is, it is a small town in Namibia close to where I grew up.






