On 27th of November 2024, Peter Eisentraut committed patch:
Support LIKE with nondeterministic collations This allows for example using LIKE with case-insensitive collations. There was previously no internal implementation of this, so it was met with a not-supported error. This adds the internal implementation and removes the error. The implementation follows the specification of the SQL standard for this. Unlike with deterministic collations, the LIKE matching cannot go character by character but has to go substring by substring. For example, if we are matching against LIKE 'foo%bar', we can't start by looking for an 'f', then an 'o', but instead with have to find something that matches 'foo'. This is because the collation could consider substrings of different lengths to be equal. This is all internal to MatchText() in like_match.c. The changes in GenericMatchText() in like.c just pass through the locale information to MatchText(), which was previously not needed. This matches exactly Generic_Text_IC_like() below. ILIKE is not affected. (It's unclear whether ILIKE makes sense under nondeterministic collations.) This also updates match_pattern_prefix() in like_support.c to support optimizing the case of an exact pattern with nondeterministic collations. This was already alluded to in the previous code. (includes documentation examples from Daniel Vérité and test cases from Paul A Jungwirth) Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
This is awesome. But why?
First, what are nondeterministic collations. Despite the name it doesn't mean that they are random. Deterministic collation means that equality can only be achieved when dealing with strings that are byte identical.
Which means, for example, that with deterministic collation “żółw" ( U&'\017c\00f3\0142w' ) and “żółw" ( U&'z\0307o\0301\0142w' ) are not the same, because their byte representation differs.
While this might sound esoteric for you, think about collations that are case insensitive. If you want ‘A' being equal to ‘a' – you need collation that is not deterministic. Like this one:
=$ CREATE COLLATION case_insensitive ( provider = icu, locale = 'und-u-ks-level2', deterministic = FALSE ); =$ CREATE TABLE testit (x text COLLATE case_insensitive ); =$ INSERT INTO testit SELECT unnest('{a,A,b,B}'::text[]); =$ SELECT * FROM testit WHERE x = 'a'; x ─── a A (2 ROWS)
The problem is, though, that will Pg before v18, you can't use like on such columns:
=# SELECT * FROM testit WHERE x LIKE 'a'; ERROR: nondeterministic collations are NOT supported FOR LIKE
But now, in pg18, I can use it:
=$ SELECT * FROM testit WHERE x LIKE 'b'; x ─── b B (2 ROWS)
Sweet. Thanks to all involved 🙂