On 19th of January, Robert Haas committed patch:
Use abbreviated keys for faster sorting of text datums. This commit extends the SortSupport infrastructure to allow operator classes the option to provide abbreviated representations of Datums; in the case of text, we abbreviate by taking the first few characters of the strxfrm() blob. If the abbreviated comparison is insufficent to resolve the comparison, we fall back on the normal comparator. This can be much faster than the old way of doing sorting if the first few bytes of the string are usually sufficient to resolve the comparison. There is the potential for a performance regression if all of the strings to be sorted are identical for the first 8+ characters and differ only in later positions; therefore, the SortSupport machinery now provides an infrastructure to abort the use of abbreviation if it appears that abbreviation is producing comparatively few distinct keys. HyperLogLog, a streaming cardinality estimator, is included in this commit and used to make that determination for text. Peter Geoghegan, reviewed by me.
To be honest, I somehow missed this commit, but I've read Peter's blogpost about it, and decided to check it out.
For starters – this commit will not give you any more functionality. But it should make certain operations faster.
The biggest winner should be index creation, so I figured I'll test it.
Since the change was added to 9.5, I'll compare index creation time on 9.4.0 and 9.5, and we'll see how it works out.
Both Pg servers are setup on the same hardware with the same config, so while there could be some variance in results, it shouldn't be all that big – aside from effects of speedup given by the patch.
For tests I used 3 different datasets:
- short – 1,000,000 random, unique, strings, 5 characters long
- long – 1,000,000 random, unique, strings, 80 characters long
- long_prefixed – 1,000,000 random, unique, strings, 80 characters long, but first 20 characters have only 100 different possibilities, and the prefix is 18 zeros, and then 2 digits
Sample data:
=$ SELECT * FROM short ORDER BY random() LIMIT 10; x -------- luASGO KANanb xLxNyQ bGWGuI w8xujk wN58nP 9BFQA1 g7cZBn 1XAy54 iVrAXP (10 ROWS) =$ SELECT * FROM long ORDER BY random() LIMIT 10; x ---------------------------------------------------------------------------------- RMdALzft60a7SpM3a4oH70SvJmVMyrwNlUNFKGsqSlXYDX1WvxQbAEix0iHcwiddjriQlfjgeqNxLA7o OtyeodMPArSrdf5XnZDUpPv73sQv4bryTGZ0isfnH14UyPRbcbgGotxDe3Gxmj7UE4cF7RNs9B2sejUz oRvw19mA4BwkSswUr5ie6EOH9bfiZ7Z924qcyWb2nBGNKRL1xqSYoDMU8h7vIB8SQqKkWW1ajXRH1z2h B9h0hw1MFyqE2gPNRujt0VhNjEayd9dzztLW6ArDOvX6f8DCvup5v8zPHuvPE7Ma5OkbsCd7AxLOY0W2 yK7pL1mIIYDo83ZX1NszRaiVlgTMBi0dbDNglQs88IcXWCKzbVq8m8osgys43vfAcHCHgcyDvp81ZLUR fya0uVWSrSMF9j1RReMvDeZDrFl8jRfyFQcyggHTkqPBAKVvAYPugC7AXo6q9GBzcxtVDg6KXCYy4dQd RfSqoc3l4a5DoPmFc7djuRPMJOcBsDIUGXPOH1LKJbE0rm7zBYMh1mU6JEaEE3KWHN6b7sB66KPUj4Lv 8QcL9NQc2Q7uyiUlHN14Bk37sCAxkmdQ9lUcIpqLarFuiIxgCT1Vnl4vs7hMAq4ioZ9vrCwOKrjTGLqp sFiTu8IHbIAcOjWTyaxofjKEHAWHd8fqDAztYrMxaaiAYAPLOnODh5bQvFX10p191dhWpVJJnBaizfMx DSQvqayewNR1uwxQwT7GyYVo0NubofjGYZywLEOOZxyJHnHCQJn4RfBzWQC4n8B8xmYfxhxFy0u5PDHd (10 ROWS) =$ SELECT * FROM long_prefixed ORDER BY random() LIMIT 10; x ---------------------------------------------------------------------------------- 00000000000000000038LQMpDuy15O11PzrWnbaad84GNpAoyOaSkAP9k8U6uAy4J5B1VCkmP81LICyz 00000000000000000017PgIm4SqFCS2EQE6PjyVanMJGtfh7dte28Bm8Yk43yjH4ut43wFTudcLiUesW 00000000000000000015R14GKcjKjX7CY22Paoe4DLqaU5Q1B1xJYGDm7QhsiA2ecGf5Mkw3YyqHcmQm 000000000000000000060MmvCNrOvnzE9tOMmIWWynS7U4Ce5S50TyyQwbmL2srL4gYkXZfLWXS5Vkwp 00000000000000000093rhUKB05HkCzW4z7z7G2K0kliFKHWp5VvZveRPXr1WqxpQU5B9E0L6yn7bfLc 000000000000000000277U8V7J8TG4hhHkuEw5JDYx5GjTWdAH8icIFW2rLzTv2n8K9xSiaFYvdRX842 00000000000000000042cOJ7dX3NEqxvK0wWjkqZ7wEhBE2gUVkWt864vvufvo8BbR8OOHMSFfLl16CG 00000000000000000032a1otXaIAPYCW2MJC8k4A5L54wJJkNXQx0aWxHQEd2rnuSjihAoshJLQOLsUV 00000000000000000052peumks0hKU0BojAA4063BMPINg7OblC1R203cdGh20gDMPKdqXrhY1sBlaIc 00000000000000000089Chw9QVnKR99cbWVw75JXMhWw9QuhygEgn9NxjYgXqJZwLsCcsmCGITTbttdW (10 ROWS)
To test the speed, I ran:
CREATE INDEX q ON _TABLE_ (x); DROP INDEX q;
10 times on each of the tables. Then I took the best time (out of these 10, and only time of create index, not dropping) for comparison. Results:
table: | 9.4: | 9.5: | difference: |
---|---|---|---|
short | 3,801.084 ms | 1,069.608 ms | – 71% |
long | 7,696.992 ms | 4,359.055 ms | – 43% |
long_prefixed | 23,795.578 ms | 24,538.389 ms | + 3% |
So, it looks that in two cases the speedup is there, and is pretty significant. In one case, specifically designed to be worst case scenario – we lost ~ 3% of the performance.
I think that it's definitely worth if – after all, if you have 1 million rows and they all have the same 18 characters – just get rid of the common prefix 🙂
In any way – great stuff, thanks Peter & Robert.
I’m not suggesting this would be a *useful* thing, but what’s the performance on an index on long_prefixed(substr(x,1,10),substr(x,11))?
I read that memcmp() is used for a quick direct-equality test, and if that’s the case there should be a speedup from that.
@Corey:
9.4: 36846.757 ms, 9.5: 16462.407 ms, diff of -55%. Nice 🙂
Which means we can expect speedups on indexes with very low cardinality. While you’re at it, could you test index on the substr(x,1,10) by itself? I’d ask for one on substr(x,11) but that seems like it’d have the same results as the “long” test.
In some collations sorting by prefix of 10 first characters, and then by the rest doesn’t result in the same sort order as sorting by the whole string. For example see this MySQL bug: http://bugs.mysql.com/bug.php?id=12519
@Anssi:
Luckily pg does it correctly, even with 9.5 and abbrev-sorting:
I’m not surprised PostgreSQL does it correctly.
I was referring to the substr index test case, which I believe doesn’t work correctly.
How to create these three tables:short,long,long_prefixed
@puqun: write a short script (one-liner?) that generates the data, and then issue “create table …” and “copy … from …”
@puqun I’ve recently published a [faker_fdw](http://github.com/guedes/faker_fdw) that can helps on generating randomly data for tests. The faker_fdw supports locale e can generate texts, names, phone_number, addresses and a bunch of other things that can be useful when random() is not enough, so it could be another option.