How much speed you’re leaving at the table if you use default locale?

I've been to PGConf.dev recently, and one of the talks was about collations.

The whole talk was interesting (to put it mildly), but the thing that stuck with me is that we really shouldn't be using default collation provider (libc with locale based collation), unless it's really needed, because we're wasting performance. But how much of a hit is it?

Figured that I can simply test it.

With PostgreSQL 17 we have three separate collation providers:

  • libc – default usually
  • icu – handles pretty complex collations related to unicode, with various cool tricks, like proper number sorting
  • new, builtin provider. Very limited in functionality (only C and C.UTF-8 collations), but should be fast

Based on some talks decided to test following collation provider/collations:

  • libc / C
  • libc / C.UTF-8
  • libc / en_US.UTF-8
  • icu / und (which is basically like libc/C.UTF-8)
  • icu / en-US
  • builtin / C
  • builtin / C.UTF-8

For all of them I will test following cases:

  • select specific row, without index, based on equality check of text column – without index
  • select subset of 1000 rows, based on ≥/≤ check – without index
  • create index on table with the same 1 million rows (unique values, but index definition will not use unique)
  • add 2nd million of rows to table that has already 1 million rows inserted, and index created
  • select specific row, without index, based on equality check of text column – with index
  • select subset of 1000 rows, based on ≥/≤ check – with index

First things first, table creation is:

=$ CREATE TABLE test_locale_speed_libc_en (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "en_US"
);
=$ CREATE TABLE test_locale_speed_libc_c (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "C"
);
=$ CREATE TABLE test_locale_speed_libc_cutf8 (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "C.utf8"
);
=$ CREATE TABLE test_locale_speed_icu_und (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "und-x-icu"
);
=$ CREATE TABLE test_locale_speed_icu_en (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "en-US-x-icu"
);
=$ CREATE TABLE test_locale_speed_builtin_c (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "ucs_basic"
);
=$ CREATE TABLE test_locale_speed_builtin_cutf8 (
    id INT8 generated always AS IDENTITY PRIMARY KEY,
    the_text TEXT COLLATE "pg_c_utf8"
);

Then, I will load en-first-1m.txt file into each of the tables, using:

=$ \copy test_locale_speed_X (the_text) FROM en-first-1m.txt

And then we're off to races. Selecting single row will be made by running test-01.sql.

This test gets 100 rows from the table using equality on text column. There are 100 queries, each selecting single row. To compare I took summarized time across 100 queries, and I repeated whole test 5 times picking the best time. First results:

Test case: Time (ms): Time vs. best:
test_locale_speed_builtin_c 4921.880 + 24.82%
test_locale_speed_builtin_cutf8 4898.380 + 24.22%
test_locale_speed_icu_en 5961.120 + 51.18%
test_locale_speed_icu_und 5913.930 + 49.98%
test_locale_speed_libc_c 3943.170
test_locale_speed_libc_cutf8 5966.460 + 51.31%
test_locale_speed_libc_en 5943.030 + 50.72%

Second test: test-02.sql runs 10 queries each selecting around 1000 rows using ≥ and ≤ comparisons.

Again, I sum execution time of all 10 queries, and then repeat the whole thing 5 times picking the best time. Results:

Test case: Time (ms): Time vs. best:
test_locale_speed_builtin_c 741.743 + 27.79%
test_locale_speed_builtin_cutf8 741.690 + 27.78%
test_locale_speed_icu_en 1037.260 + 78.70%
test_locale_speed_icu_und 1021.280 + 75.95%
test_locale_speed_libc_c 580.443
test_locale_speed_libc_cutf8 1034.950 + 78.30%
test_locale_speed_libc_en 1202.010 + 107.08%

Next test: test-03.sql: index creation. Very simple, just drop index if it already exists, and then time how long it takes to create new index on each of the tables. Results:

Test case: Time (ms): Time vs. best:
test_locale_speed_builtin_c 440.399 + 1.12%
test_locale_speed_builtin_cutf8 436.802 + 0.29%
test_locale_speed_icu_en 613.745 + 40.92%
test_locale_speed_icu_und 606.109 + 39.17%
test_locale_speed_libc_c 435.518
test_locale_speed_libc_cutf8 718.431 + 64.96%
test_locale_speed_libc_en 1288.768 + 195.92%

Next test, adding one more million of rows is a bit tricker, as I still want to repeat it 5 times and get best time, but for each test I will have to clean the db and recreate everything from scratch. This needs slightly more complex test file: test-04.sql. If you'd like to repeat it for yourself, you will also need en-second-2m.txt.bz2 file, with second million of rows.

Results:

Test case: Time (ms): Time vs. best:
test_locale_speed_builtin_c 4202.005 + 4.13%
test_locale_speed_builtin_cutf8 4299.999 + 6.55%
test_locale_speed_icu_en 5137.570 + 27.31%
test_locale_speed_icu_und 5102.079 + 26.43%
test_locale_speed_libc_c 4035.496
test_locale_speed_libc_cutf8 4988.285 + 23.61%
test_locale_speed_libc_en 6314.744 + 56.48%

And then back to selecting rows. First test-05.sql, which is like test-01.sql but it will work with 2 million rows, and an index. Results:

Test case: Time (ms): Time vs. best:
test_locale_speed_builtin_c 1.572 + 4.80%
test_locale_speed_builtin_cutf8 1.783 + 18.87%
test_locale_speed_icu_en 2.017 + 34.47%
test_locale_speed_icu_und 2.001 + 33.40%
test_locale_speed_libc_c 1.500
test_locale_speed_libc_cutf8 1.743 + 16.20%
test_locale_speed_libc_en 1.974 + 31.60%

And then back to selecting rows. First test-06.sql, which is like test-02.sql but it will work with 2 million rows, and an index. Results:

Test case: Time (ms): Time vs. best:
test_locale_speed_builtin_c 14.500 + 4.47%
test_locale_speed_builtin_cutf8 14.393 + 3.70%
test_locale_speed_icu_en 14.298 + 3.01%
test_locale_speed_icu_und 13.964 + 0.61%
test_locale_speed_libc_c 14.356 + 3.43%
test_locale_speed_libc_cutf8 13.880
test_locale_speed_libc_en 13.977 + 0.70%

Since all this data can be hard to read a small chart:

[chart showing relative performance of various collations]

It looks that in almost every case libc/C won. In one case it didn't, it was ~ 3.5% slower.

Now, does that say that we should switch all our dbs to use libc/C? Well, not really. For a whole lot of cases C is not sensible. We need ordering that makes sense for us – humans. But for some cases, like identifiers, slugs, anything that doesn't have to strictly conform to order as humans would expect – switching to C collation gives some non trivial performance improvements.

What struck me the most is that builtin C/C.UTF-8 didn't really fare all that well. I don't know why, but especially in seq scan test (which involved checking every single row in table) performance was over 20% worse than libc/C. Otoh, virtually every db I use uses libc/en_US, which proved to be the worst in majority of cases.

Well, whatever the results, I hope you'll find it interesting 🙂

3 thoughts on “How much speed you’re leaving at the table if you use default locale?”

  1. Thanks for the follow-up and for publishing. If I’m reading right, I think these results show the builtin providers are generally faster than ICU and the GNU C Library. But you’re right – I expected libc-C (doesn’t actually using glibc) to be the same performance as the builtin provider, which isn’t what these results show. I’m curious what’s going on here.

  2. Thanks for another interesting article! I’m having a little trouble correlating the bulleted list of collations (beneath “Based on some talks decided to test following collation provider/collations:”) and the CREATE TABLE statements. Are they in the same order?

  3. @Philip:
    no, not the same order, sorry.

    collations are:

    – test_locale_speed_libc_en – libc provider, en_US locale
    – test_locale_speed_libc_c – libc provider, C locale
    – test_locale_speed_libc_cutf8 – libc provider, C.UTF-8 locale
    – test_locale_speed_icu_und – icu provider, und “locale”
    – test_locale_speed_icu_en, icu provider, en-US locale.
    – test_locale_speed_builtin_c, builtin provider, C locale
    – test_locale_speed_builtin_cutf8, builtin provider, C.UTF-8 locale.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.