During last month or so, Tom Lane commited changes in PostgreSQL, which were foundations for adding hash-based versions of popular features.
I already described first such feature – DISTINCT.
Now, there were 3 more commits which were related to this:
All 3 were committed by Tom, on 7th of August. Commit messages:
Teach the system how to use hashing for UNION. (INTERSECT/EXCEPT will follow, but seem like a separate patch since most of the remaining work is on the executor side.) I took the opportunity to push selection of the grouping operators for set operations into the parser where it belongs. Otherwise this is just a small exercise in making prepunion.c consider both alternatives. As with the recent DISTINCT patch, this means we can UNION on datatypes that can hash but not sort, and it means that UNION without ORDER BY is no longer certain to produce sorted output.
Support hashing for duplicate-elimination in INTERSECT and EXCEPT queries. This completes my project of improving usage of hashing for duplicate elimination (aggregate functions with DISTINCT remain undone, but that's for some other day). As with the previous patches, this means we can INTERSECT/EXCEPT on datatypes that can hash but not sort, and it means that INTERSECT/EXCEPT without ORDER BY are no longer certain to produce sorted output.
Improve INTERSECT/EXCEPT hashing by realizing that we don't need to make any hashtable entries for tuples that are found only in the second input: they can never contribute to the output. Furthermore, this implies that the planner should endeavor to put first the smaller (in number of groups) input relation for an INTERSECT. Implement that, and upgrade prepunion's estimation of the number of rows returned by setops so that there's some amount of sanity in the estimate of which one is smaller.
What it means for average user? Basically union, intersect and except can now work faster.
Since the changes are quite obvious I'll skip specific examples and plans of execution, but I would like to use this space to send personal big THANK YOU to Tom Lane for these changes, as I use set operators a lot, and speedup of them is very welcome 🙂
What happened to the usual depesz-style benchmark? You always did this for PostgreSQL performance enhancements. 🙂
@intgr:
To be honest – I don’t have a pre-8.4 version right now. So I can’t really benchmark 8.4 against anything. And I just got a bit too lazy to install 8.3 just for 5 minutes of benchmarking.
Sorry. I’ll do better next tme.
Hum, so is the order of ‘UNION ALL’ not garanteed any more with 8.4, or is it just for ‘UNION’ ?
I’m using “SELECT foo FROM bar WHERE baz UNION ALL SELECT baz2 LIMIT 1” to return a default value if none is in foo, will I have to add explicit ordering ?
@moltonel:
“union all” – as far as i know *never* guarateed any order.
union did guarantee, but only in pg <= 8.3. your approach at getting default value looks flawed, as it relies on order of rows from union all. why don't you simply use coalesce?
@depesz:
Not using COALESCE because I also have NULLS in my table, and want those if they exist. The default value is non-null. I’m on 8.2 right now, so no “NULLS FIRST” available.
I can fix my method by adding explicit ordering (wont look as nice :p). But if I have missed a smarter way to do this, I’m all ears (I’d rather not create a function for this, though).
Just noticed the followup questions here. The behavior of UNION ALL shouldn’t be changed at all by these patches: it still just appends the results of the second query to the results of the first. What these patches are about is enabling use of hashing to recognize duplicate tuples for the various set-ops that require recognizing that; which is to say, all of them *except* UNION ALL.