On 11th of February, Tom Lane committed patch:
Remove GROUP BY columns that are functionally dependent on other columns. If a GROUP BY clause includes all columns of a non-deferred primary key, as well as other columns of the same relation, those other columns are redundant and can be dropped from the grouping; the pkey is enough to ensure that each row of the table corresponds to a separate group. Getting rid of the excess columns will reduce the cost of the sorting or hashing needed to implement GROUP BY, and can indeed remove the need for a sort step altogether. This seems worth testing for since many query authors are not aware of the GROUP-BY-primary-key exception to the rule about queries not being allowed to reference non-grouped-by columns in their targetlists or HAVING clauses. Thus, redundant GROUP BY items are not uncommon. Also, we can make the test pretty cheap in most queries where it won't help by not looking up a rel's primary key until we've found that at least two of its columns are in GROUP BY. David Rowley, reviewed by Julien Rouhaud
This is really nice.
Consider this example:
$ CREATE TABLE t1 (id serial PRIMARY KEY, c_1 text, c_2 text, c_3 text, c_4 text); \copy t1 (c_1, c_2, c_3, c_4) FROM base.data
base.data file was created using this simple ruby script:
$pc = [] ("a".."z").each { |x| $pc.push x } ("A".."Z").each { |x| $pc.push x } ("0".."9").each { |x| $pc.push x } def random_string len = (rand() * 20 + 5).to_i (1..len).map { $pc[rand() * $pc.size] }.join end f = File.open("base.data", "w") 100000.times { |i| f.write("#{random_string}\t#{random_string}\t#{random_string}\t#{random_string}\n") } f.close
Now, let's add second table:
$ CREATE TABLE t2 (id serial PRIMARY KEY, t1_id int4 NOT NULL REFERENCES t1 (id));
And let's add there some data:
$ INSERT INTO t2 (t1_id) SELECT random() * 99999 + 1 FROM generate_series(1,1000000); INSERT 0 1000000
Now, let's assume we want to get all columns from t1, with count of rows from t2.
Up until 9.0, we would have to write:
$ SELECT t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4, COUNT(t2.id) FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.t1_id GROUP BY t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4;
Which obviously worked, with this plan:
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=184764.34..203264.34 ROWS=100000 width=68) (actual TIME=4855.161..8573.527 ROWS=100000 loops=1) GROUP KEY: t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4 -> Sort (cost=184764.34..187264.34 ROWS=1000000 width=68) (actual TIME=4855.134..8211.074 ROWS=1000003 loops=1) Sort KEY: t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4 Sort Method: external MERGE Disk: 79624kB -> Hash RIGHT JOIN (cost=4526.00..44090.00 ROWS=1000000 width=68) (actual TIME=31.439..795.827 ROWS=1000003 loops=1) Hash Cond: (t2.t1_id = t1.id) -> Seq Scan ON t2 (cost=0.00..14425.00 ROWS=1000000 width=8) (actual TIME=0.007..100.085 ROWS=1000000 loops=1) -> Hash (cost=2201.00..2201.00 ROWS=100000 width=64) (actual TIME=31.337..31.337 ROWS=100000 loops=1) Buckets: 8192 Batches: 4 Memory Usage: 2402kB -> Seq Scan ON t1 (cost=0.00..2201.00 ROWS=100000 width=64) (actual TIME=0.004..11.283 ROWS=100000 loops=1) Planning TIME: 60.083 ms Execution TIME: 8589.124 ms (13 ROWS)
But it's not optimal – c_1 to c_4 columns are irrelevant when we're grouping by t1.id. So, in 9.1 we got ability to write the query like this:
$ SELECT t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4, COUNT(t2.id) FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.t1_id GROUP BY t1.id;
which works, and shows this plan:
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=127758.13..155315.14 ROWS=100000 width=68) (actual TIME=698.280..1413.087 ROWS=100000 loops=1) GROUP KEY: t1.id -> MERGE LEFT JOIN (cost=127758.13..149315.14 ROWS=1000000 width=68) (actual TIME=698.265..1283.547 ROWS=1000003 loops=1) MERGE Cond: (t1.id = t2.t1_id) -> INDEX Scan USING t1_pkey ON t1 (cost=0.29..3808.29 ROWS=100000 width=64) (actual TIME=0.006..26.109 ROWS=100000 loops=1) -> Materialize (cost=127757.34..132757.34 ROWS=1000000 width=8) (actual TIME=698.253..1068.716 ROWS=1000000 loops=1) -> Sort (cost=127757.34..130257.34 ROWS=1000000 width=8) (actual TIME=698.250..965.620 ROWS=1000000 loops=1) Sort KEY: t2.t1_id Sort Method: external MERGE Disk: 17544kB -> Seq Scan ON t2 (cost=0.00..14425.00 ROWS=1000000 width=8) (actual TIME=0.008..88.914 ROWS=1000000 loops=1) Planning TIME: 0.164 ms Execution TIME: 1419.439 ms (12 ROWS)
Please note that amount of disk used by sort step is much lower – after all, we need to group by, now, with only single int4 column, and not int4 and four texts.
Now, with this new patch for 9.6, PostgreSQL is smart enough to remove unnecessary columns from grouping, even if programmer will still specify them:
$ EXPLAIN analyze SELECT t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4, COUNT(t2.id) FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.t1_id GROUP BY t1.id, t1.c_1, t1.c_2, t1.c_3, t1.c_4; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=132183.13..159740.18 ROWS=100000 width=68) (actual TIME=457.996..920.404 ROWS=100000 loops=1) GROUP KEY: t1.id -> MERGE LEFT JOIN (cost=132183.13..153740.18 ROWS=1000000 width=68) (actual TIME=457.986..830.782 ROWS=1000002 loops=1) MERGE Cond: (t1.id = t2.t1_id) -> INDEX Scan USING t1_pkey ON t1 (cost=0.29..3808.34 ROWS=100000 width=64) (actual TIME=0.003..16.410 ROWS=100000 loops=1) -> Materialize (cost=132182.34..137182.34 ROWS=1000000 width=8) (actual TIME=457.978..694.204 ROWS=1000000 loops=1) -> Sort (cost=132182.34..134682.34 ROWS=1000000 width=8) (actual TIME=457.974..628.213 ROWS=1000000 loops=1) Sort KEY: t2.t1_id Sort Method: external MERGE Disk: 17544kB -> Seq Scan ON t2 (cost=0.00..18850.00 ROWS=1000000 width=8) (actual TIME=4.010..51.518 ROWS=1000000 loops=1) Planning TIME: 0.293 ms Execution TIME: 925.351 ms (12 ROWS)
Which is pretty sweet, because it has the power of making the same query, with no changes, significantly faster (by reducing dataset required for sorting/grouping). Thanks a lot.
You are right, this functionality is really nice.
This is great – one of those things where I’ve always wondered why databases aren’t clever enough to do it.
I think redundant GROUP BY clauses are more often a symptom of dumb ORMs rather than ignorant query authors though.