On 8th of February, Peter Eisentraut committed patch:
Per-column collation support This adds collation support for columns and domains, a COLLATE clause to override it per expression, and B-tree index support. Peter Eisentraut reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch Branch ------ master
This is really, really great.
Let's see. I have 9.1 database with American locale (en_US.UTF-8):
$ SELECT name, setting FROM pg_settings WHERE name ~ 'lc_'; name | setting -------------+------------- lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 lc_messages | en_US.UTF-8 lc_monetary | en_US.UTF-8 lc_numeric | en_US.UTF-8 lc_time | en_US.UTF-8 (6 ROWS)
Now, let's assume we want some Polish words:
$ CREATE TABLE normal_polish ( some_text text ); CREATE TABLE $ copy normal_polish FROM '/tmp/polish'; COPY 14 $ SELECT * FROM normal_polish ; some_text ----------- ćma łódka śnieg mama ser car lama żółw zorza tata źdźbło coś alfa szopa (14 ROWS)
and some Czech:
$ CREATE TABLE normal_czech ( some_text text ); CREATE TABLE $ copy normal_czech FROM '/tmp/czech'; COPY 8 (depesz@[LOCAL]:5910) 13:56:54 [depesz] $ SELECT * FROM normal_czech ; some_text ----------- 1234 červený chleba cihla damašek žluťoučký kůň zelí (8 ROWS)
The problem is that while sorted outputs can look sensible to others – they are not properly sorted in Polish and Czech languages:
$ SELECT * FROM normal_czech ORDER BY some_text; some_text ----------- 1234 červený chleba cihla damašek kůň zelí žluťoučký (8 ROWS) $ SELECT * FROM normal_polish ORDER BY some_text; some_text ----------- alfa car ćma coś lama łódka mama ser śnieg szopa tata źdźbło żółw zorza (14 ROWS)
Up to 9.0 you couldn't really do anything about it.
But now, I can:
$ CREATE TABLE collated_polish (some_text text COLLATE "pl_PL.utf8"); CREATE TABLE $ copy collated_polish FROM '/tmp/polish'; COPY 14 $ SELECT * FROM collated_polish ORDER BY some_text; some_text ----------- alfa car coś ćma lama łódka mama ser szopa śnieg tata zorza źdźbło żółw (14 ROWS)
It's interesting, because I had pl_PL locales installed when doing initdb, but I didn't have cs_CZ, so it will fail:
$ CREATE TABLE collated_czech (some_text text COLLATE "cs_CZ.utf8"); ERROR: collation "cs_CZ.utf8" FOR CURRENT DATABASE encoding "UTF8" does NOT exist LINE 1: CREATE TABLE collated_czech (some_text text COLLATE "cs_CZ.u... ^
Luckily, with all locale files for cs_CZ installed, fixing it is simple:
$ CREATE COLLATION "cs_CZ.utf8" ( locale = 'cs_CZ.UTF-8' ); CREATE COLLATION $ CREATE TABLE collated_czech (some_text text COLLATE "cs_CZ.utf8"); CREATE TABLE $ copy collated_czech FROM '/tmp/czech'; COPY 8 $ SELECT * FROM collated_czech ORDER BY some_text; ERROR: could NOT CREATE locale "cs_CZ.UTF-8": Success
The last error is actually pretty bad. I mean – we all know stories about error dialogs saying “error: success", but apparently now it did hit PostgreSQL too.
The problem in here is that I added the locale to system, but didn't restart PostgreSQL. So, a quick restart later:
$ SELECT * FROM collated_czech ORDER BY some_text; some_text ----------- cihla červený damašek chleba kůň zelí žluťoučký 1234 (8 ROWS)
And how can you check what collations there are already? Simple, just:
$ SELECT * FROM pg_collation;
Of course you can have columns of different collations within single table, like this:
$ CREATE TABLE collated_mix (english text, polish text COLLATE "pl_PL.utf8", czech text COLLATE "cs_CZ.utf8", basic text COLLATE ucs_basic); CREATE TABLE $ \d collated_mix TABLE "public.collated_mix" COLUMN | TYPE | Modifiers ---------+------+-------------------- english | text | polish | text | COLLATE pl_PL.utf8 czech | text | COLLATE cs_CZ.utf8 basic | text | COLLATE ucs_basic
Now. I did put there 10000 rows, each containing the same (within a row) values. Each value is 10 character string containing only these characters:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!@#$%^&*()-=_+[]{},./<>?;:"|
Data looks like this:
$ SELECT * FROM collated_mix LIMIT 10; english | polish | czech | basic ------------+------------+------------+------------ [m49rLoOdL | [m49rLoOdL | [m49rLoOdL | [m49rLoOdL 4HIQT-Kc*c | 4HIQT-Kc*c | 4HIQT-Kc*c | 4HIQT-Kc*c )#*aD>$e^k | )#*aD>$e^k | )#*aD>$e^k | )#*aD>$e^k /s5;@wIO/v | /s5;@wIO/v | /s5;@wIO/v | /s5;@wIO/v j;_!NEjhqP | j;_!NEjhqP | j;_!NEjhqP | j;_!NEjhqP .Y0z)EvlsY | .Y0z)EvlsY | .Y0z)EvlsY | .Y0z)EvlsY 4l|9i!^@;y | 4l|9i!^@;y | 4l|9i!^@;y | 4l|9i!^@;y VgwGF.UyZ/ | VgwGF.UyZ/ | VgwGF.UyZ/ | VgwGF.UyZ/ ?RR;|8But# | ?RR;|8But# | ?RR;|8But# | ?RR;|8But# 2xM1&^#kez | 2xM1&^#kez | 2xM1&^#kez | 2xM1&^#kez (10 ROWS)
Sorting the rows using different columns show how's that different:
$ SELECT * FROM collated_mix ORDER BY english LIMIT 5; english | polish | czech | basic ------------+------------+------------+------------ <01a=WS%gp | <01a=WS%gp | <01a=WS%gp | <01a=WS%gp 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; :01OJsyOf; | :01OJsyOf; | :01OJsyOf; | :01OJsyOf; 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 (5 ROWS) $ SELECT * FROM collated_mix ORDER BY polish LIMIT 5; english | polish | czech | basic ------------+------------+------------+------------ <01a=WS%gp | <01a=WS%gp | <01a=WS%gp | <01a=WS%gp 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; :01OJsyOf; | :01OJsyOf; | :01OJsyOf; | :01OJsyOf; 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 (5 ROWS) $ SELECT * FROM collated_mix ORDER BY czech LIMIT 5; english | polish | czech | basic ------------+------------+------------+------------ AAA>P>#>^_ | AAA>P>#>^_ | AAA>P>#>^_ | AAA>P>#>^_ +!AA/B/Zug | +!AA/B/Zug | +!AA/B/Zug | +!AA/B/Zug Aa=Ca)K71b | Aa=Ca)K71b | Aa=Ca)K71b | Aa=Ca)K71b aAFJ^1EwsY | aAFJ^1EwsY | aAFJ^1EwsY | aAFJ^1EwsY aAi<dQAhld | aAi<dQAhld | aAi<dQAhld | aAi<dQAhld (5 ROWS) $ SELECT * FROM collated_mix ORDER BY basic LIMIT 5; english | polish | czech | basic ------------+------------+------------+------------ !!1<bJ;jio | !!1<bJ;jio | !!1<bJ;jio | !!1<bJ;jio !!>N(g"mYq | !!>N(g"mYq | !!>N(g"mYq | !!>N(g"mYq !!N>s{*;c^ | !!N>s{*;c^ | !!N>s{*;c^ | !!N>s{*;c^ !!QuMts/1^ | !!QuMts/1^ | !!QuMts/1^ | !!QuMts/1^ !"I7<d.33g | !"I7<d.33g | !"I7<d.33g | !"I7<d.33g (5 ROWS)
The interesting thing is ucs_basic collation, which works like “C" locale. That is – it doesn't care about real order of letters, just compares their numerical values. Which is usually not helpful, until you'd want to do like, on them:
$ CREATE INDEX i1 ON collated_mix (english); CREATE INDEX $ CREATE INDEX i2 ON collated_mix (polish); CREATE INDEX $ CREATE INDEX i3 ON collated_mix (czech); CREATE INDEX $ CREATE INDEX i4 ON collated_mix (basic); CREATE INDEX
And now queries:
$ EXPLAIN SELECT * FROM collated_mix WHERE english LIKE 'abc%'; QUERY PLAN --------------------------------------------------------------- Seq Scan ON collated_mix (cost=0.00..219.00 ROWS=1 width=44) FILTER: (english ~~ 'abc%'::text) (2 ROWS) $ EXPLAIN SELECT * FROM collated_mix WHERE polish LIKE 'abc%'; QUERY PLAN --------------------------------------------------------------- Seq Scan ON collated_mix (cost=0.00..219.00 ROWS=1 width=44) FILTER: (polish ~~ 'abc%'::text) (2 ROWS) $ EXPLAIN SELECT * FROM collated_mix WHERE czech LIKE 'abc%'; QUERY PLAN --------------------------------------------------------------- Seq Scan ON collated_mix (cost=0.00..219.00 ROWS=1 width=44) FILTER: (czech ~~ 'abc%'::text) (2 ROWS) $ EXPLAIN SELECT * FROM collated_mix WHERE basic LIKE 'abc%'; QUERY PLAN ------------------------------------------------------------------------ INDEX Scan USING i4 ON collated_mix (cost=0.00..8.27 ROWS=1 width=44) INDEX Cond: ((basic >= 'abc'::text) AND (basic < 'abd'::text)) FILTER: (basic ~~ 'abc%'::text) (3 ROWS)
Of course, if you know anything about PostgreSQL, you know that you could use text_pattern_ops, and varchar_pattern_ops to get like ‘…%' from indexes, but the thing is – that you couldn't do it on primary key index!
Which made it necessary to have 2 indexes.
Now, you can have table, with primary key being email in ucs_basic collation, and the like operator will work well:
$ CREATE TABLE users ( email text COLLATE ucs_basic PRIMARY KEY ); NOTICE: CREATE TABLE / PRIMARY KEY will CREATE implicit INDEX "users_pkey" FOR TABLE "users" CREATE TABLE $ EXPLAIN SELECT * FROM users WHERE email LIKE 'dep%'; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan ON users (cost=4.32..14.47 ROWS=7 width=32) FILTER: (email ~~ 'dep%'::text) -> Bitmap INDEX Scan ON users_pkey (cost=0.00..4.32 ROWS=7 width=0) INDEX Cond: ((email >= 'dep'::text) AND (email < 'deq'::text)) (4 ROWS)
YEAH! Congratulations and thanks (a lot of them) go to author and reviewers.
That’s just superb! Looks like this will be great release! 🙂
A big thank you to Peter and all the reviewers for the feature.
And thank you Depesz for the ‘Waiting for’ series.
Am I the only person who is disappointed that “per-column” collation has been accepted rather than doing something with operator classes?
The actual use case for arbitrary collation is being able to change the locale based on a user-specified property, like Accept-Language on HTTP. If a user comes to my website and speaks French, I need to sort the data I return to him in French collation order. If my website primarily serves users who speak English and French, it would be great to be able to have two indexes, one specified with English collation, and one specified with French collation.
The easiest way to make that a reality, meanwhile, is to make this a property of an operator class. Much like I can use text_pattern_ops in order build indexes or do order by operations using the C locale, even when my database is set to en_US, I should be able to get text_eu_US_pattern_ops. In an /ideal/ world, using a “meta operator class”, like text_locale_ops(‘en_US’).
Can someone please explain to me in what way is this ever a property of a column? What is the use case for having multiple columns, each with their own fixed locale?
@Jay:
Reason is very simple. If it was query-based, than you couldn’t have index on the column.
Indexes make sense only if there is well defined less-than and greater-than operators. Which – if you’d make the collation query-definable – wouldn’t be well defined, because it would be different from one query to another.
@Depesz:
Yes, your right. But wy not use functional indexes, like this:
CREATE INDEX same_text_idx ON test (collation(same_text, ‘pl_PL.utf8’));
CREATE INDEX same_text_idx ON test (collation(same_text, ‘cz_CZ.utf8’));
…
SELECT * FROM test ORDER BY collation(same_text, ‘cz_CZ.utf8’);
?
@Marcin:
Function based collation is possible in earlier Pgs too, just write smart enough collation() function.
So it’s not that it has been chosen instead, but rather – new way has been added, which is based on normal indexes, and one-time definition of data collation (instead of redefinition on every query).
Yes, but it’s pity that PostgreSQL don’t offer method like collation(str, locale) from my example, just “out of the box”;-)
@Depesz:
The problem with defining it with the data is that it isn’t a property of the data. You now expect to find 7 identical columns in a row that all collate differently, but otherwise have the exact same contents. What you want are simply 7 different indexes, and we already have that feature using operator classes: you define indexes for the queries you want to accelerate, and you can specify the collations you want to support. You already can already use a custom operator during ORDER BY (thanks to USING), and you already can use a custom operator for comparison (thanks to, well, operators), but what you can’t do is have a “meta operator class”, something that takes an argument (the name of a collation) and returns an operator class: we don’t have a solution for the fact that there are an arbitrary number of collations. If this one feature were added, then this collation feature would already be supported, and we wouldn’t need to duplicate our data in our rows: the engine is already smart enough to choose the right index from the 7 indexes we have based on the operators being used (and if you don’t believe that, an obvious case is text_pattern_ops, which already smooths over a specific collation issue between whatever you are currently using and C collation for byte-based ordering).