Whenever you are considering creation of multicolumn index, there is a question what should be the order of columns in this index. I'll try to analyze various cases of this situation.
We need to consider two separate issues:
- what operators are you going to use
- What is selectivity of the fields
First, let's consider that you want to use inequality operators (<, >, <= or &gr;=), and you're using both columns in your query. In this case, you can generally assume both conditions use index.
Some test data:
$ CREATE TABLE t1 ( a int4, b int4 ); CREATE TABLE $ INSERT INTO t1 (a,b) SELECT random()*100000, random()*100000 FROM generate_series(1,1000000); INSERT 0 1000000 $ CREATE INDEX i1 ON t1 (a,b); CREATE INDEX $ CREATE TABLE t2 ( a int4, b int4 ); CREATE TABLE $ INSERT INTO t2 (a,b) SELECT random()*100000, random()*100000 FROM generate_series(1,1000000); INSERT 0 1000000 $ CREATE INDEX i2 ON t2 (b,a); CREATE INDEX
And let's see – simple select that should return now much rows:
$ EXPLAIN analyze SELECT * FROM t1 WHERE a < 1000 AND b > 99000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON t1 (cost=424.96..1074.04 ROWS=182 width=8) (actual TIME=1.673..2.879 ROWS=205 loops=1) Recheck Cond: ((a < 1000) AND (b > 99000)) -> Bitmap INDEX Scan ON i1 (cost=0.00..424.92 ROWS=182 width=0) (actual TIME=1.624..1.624 ROWS=205 loops=1) INDEX Cond: ((a < 1000) AND (b > 99000)) Total runtime: 2.940 ms (5 ROWS)
And with index the other way around:
$ EXPLAIN analyze SELECT * FROM t2 WHERE a < 1000 AND b > 99000; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON t2 (cost=229.76..592.85 ROWS=103 width=8) (actual TIME=1.095..1.768 ROWS=95 loops=1) Recheck Cond: ((b > 99000) AND (a < 1000)) -> Bitmap INDEX Scan ON i2 (cost=0.00..229.74 ROWS=103 width=0) (actual TIME=1.056..1.056 ROWS=95 loops=1) INDEX Cond: ((b > 99000) AND (a < 1000)) Total runtime: 1.824 ms (5 ROWS)
Now, what about a mix of equality and inequality?
First, with equality being used on first column in index:
$ EXPLAIN analyze SELECT * FROM t1 WHERE a = 123 AND b >= 99000; QUERY PLAN ------------------------------------------------------------------------------------------------------------ INDEX ONLY Scan USING i1 ON t1 (cost=0.43..8.45 ROWS=1 width=8) (actual TIME=0.027..0.027 ROWS=0 loops=1) INDEX Cond: ((a = 123) AND (b >= 99000)) Heap Fetches: 0 Total runtime: 0.083 ms (4 ROWS)
and what about the case where inequality is first?
$ EXPLAIN analyze SELECT * FROM t2 WHERE a = 123 AND b >= 99000; QUERY PLAN -------------------------------------------------------------------------------------------------------------- INDEX ONLY Scan USING i2 ON t2 (cost=0.42..233.75 ROWS=1 width=8) (actual TIME=0.831..1.005 ROWS=1 loops=1) INDEX Cond: ((b >= 99000) AND (a = 123)) Heap Fetches: 1 Total runtime: 1.049 ms (4 ROWS)
Hurray – it also works. But one thing – the second case is more expensive. To show it, let's make a special case:
$ CREATE TABLE t3 (a int4, b int4); CREATE TABLE $ INSERT INTO t3 SELECT i, j FROM generate_series(1,10000) i, generate_series(1,100) j; INSERT 0 1000000 $ CREATE INDEX i3 ON t3 (a, b); CREATE INDEX $ CREATE TABLE t4 (a int4, b int4); CREATE TABLE $ INSERT INTO t4 SELECT i, j FROM generate_series(1,10000) i, generate_series(1,100) j; INSERT 0 1000000 $ CREATE INDEX i4 ON t4 (b, a); CREATE INDEX
As you can see column a has pretty good selectivity, while column b much lower.
Now, when I'll first use equality, and then inequality:
EXPLAIN analyze SELECT * FROM t3 WHERE a = 123 AND b < 50; QUERY PLAN --------------------------------------------------------------------------------------------------------------- INDEX ONLY Scan USING i3 ON t3 (cost=0.42..91.64 ROWS=48 width=8) (actual TIME=0.064..0.084 ROWS=49 loops=1) INDEX Cond: ((a = 123) AND (b < 50)) Heap Fetches: 49 Total runtime: 0.141 ms (4 ROWS)
If it's the other way around:
EXPLAIN analyze SELECT * FROM t4 WHERE a = 123 AND b < 50; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- INDEX ONLY Scan USING i4 ON t4 (cost=0.42..10434.23 ROWS=48 width=8) (actual TIME=0.077..32.651 ROWS=49 loops=1) INDEX Cond: ((b < 50) AND (a = 123)) Heap Fetches: 49 Total runtime: 32.676 ms (4 ROWS)
Of course one can say – OK, you used equality for very selective, and inequality for not really selective column. What would happen if the conditions were swapped?
Let's see:
$ EXPLAIN analyze SELECT * FROM t3 WHERE a < 100 AND b = 80; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- INDEX ONLY Scan USING i3 ON t3 (cost=0.42..402.17 ROWS=107 width=8) (actual TIME=0.045..2.292 ROWS=99 loops=1) INDEX Cond: ((a < 100) AND (b = 80)) Heap Fetches: 99 Total runtime: 2.353 ms (4 ROWS) $ EXPLAIN analyze SELECT * FROM t4 WHERE a < 100 AND b = 80; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON t4 (cost=5.32..318.02 ROWS=87 width=8) (actual TIME=0.094..0.283 ROWS=99 loops=1) Recheck Cond: ((b = 80) AND (a < 100)) -> Bitmap INDEX Scan ON i4 (cost=0.00..5.29 ROWS=87 width=0) (actual TIME=0.067..0.067 ROWS=99 loops=1) INDEX Cond: ((b = 80) AND (a < 100)) Total runtime: 0.344 ms (5 ROWS)
This shows, in my opinion that in mixed cases (one field using =, and the other some inequality), it's better to have index first on field used for equality.
Which leaves us with relatively simple case – what about case where we have both columns compared using equality, just their selectivity is widely different.
And let's make it really very different. Like:
$ CREATE TABLE t5 (a int4, b int4); CREATE TABLE $ INSERT INTO t5 SELECT i, j FROM generate_series(1,5000000) i, generate_series(1,2) j; INSERT 0 10000000 $ CREATE INDEX i5 ON t5 (a, b); CREATE INDEX $ CREATE TABLE t6 (a int4, b int4); CREATE TABLE $ INSERT INTO t6 SELECT i, j FROM generate_series(1,5000000) i, generate_series(1,2) j; INSERT 0 10000000 $ CREATE INDEX i6 ON t6 (b, a); CREATE INDEX
Another set of tables. In both, a has 500,000 different values, while b has just two.
How are the speeds now?
EXPLAIN analyze SELECT * FROM t5 WHERE a = 2314 AND b = 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------ INDEX ONLY Scan USING i5 ON t5 (cost=0.43..8.46 ROWS=1 width=8) (actual TIME=0.009..0.009 ROWS=1 loops=1) INDEX Cond: ((a = 2314) AND (b = 1)) Heap Fetches: 1 Total runtime: 0.028 ms (4 ROWS) EXPLAIN analyze SELECT * FROM t6 WHERE a = 2314 AND b = 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------ INDEX ONLY Scan USING i6 ON t6 (cost=0.43..8.46 ROWS=1 width=8) (actual TIME=0.010..0.010 ROWS=1 loops=1) INDEX Cond: ((b = 1) AND (a = 2314)) Heap Fetches: 1 Total runtime: 0.028 ms (4 ROWS)
(No, I didn't edit the results – I made script which called both explains 3 times in a row, and stored output of last one)
Looks like there is no difference. Of course – it might be that my sample size is too small, but still – 5 million rows in an index is pretty good test case.
So, to wrap it up:
- in case you're using a = ? and b = ? – there is no time difference between index on (a,b) and (b,a)
- in case you're using a = ? and b </> ? – index on (a,b) is better
- In case you're using a </>, b </> – both searches should be done using index, but, given results from t3/t4 test above – I would put first the column that has better selectivity
Hope it helps.
What about if both (a,b) and (b,a) indexes exist? Will PostgreSQL use both in a single query or prefer to scan just one index or is this choice dependent on index scan cost?
@Capt.:
PostgreSQL has the capability to use two indexes in single query, but I don’t see any scenario where using them both would be beneficial (over single index usage). As for test – sure, I can make it – just tell me what data distribution you’re curious about, because I just don’t see it.
I am curious about what choices the planner will make when faced with the above distributions and both (a,b) and (b,a) indexes are available and how this choice is affected by index scan cost.
Brief: order the columns in the index accordingly to the selectivity (Better goes to the left).
Pretty nice hack Depesz! Hope see you at FOSDEM.
@Emanuel:
generally your brief is OK, unless your less-selective column is used for “=”, and the more selective from >/>=/<=.
@Capt.:
Well, it’s simple to check. So:
1. for random distribution (1st case), and search for inequality/inequality, Pg uses single index.
2. for the 2nd cast (t3 table), it’s using index on (a,b), ignoring (b,a)
3. for 3rd case (t5 table): it uses, quite surprisingly, index on (b,a) (b is much less selective).
There is a very nice website/book about using indices. Have a look at http://use-the-index-luke.com/ if you already have not known it.