On 30th of October, Tom Lane committed patch:
Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
Yeah – I have to write about it, because I actually do love regexps, and I always was bitching that our (PostgreSQL) regexp engine sucks. It still does, but slightly less, so I have to write about it.
What was added – lookbehind. What it is is (somewhat) described above, so let's see how to use it. Unfortunately I don't have ready examples there lookbehind would be the only possible choice of solving the problem, but at the very least you'll see the syntax.
Basically, we got two new “operators" in regexp:
- (?<=...) - positive lookbehind
- (?
Let's assume we have following string:
xxaa12 yyab22 zzbb33
And we want to find a number(s) that are preceded by two letters from (‘a', ‘e', ‘i', ‘o', ‘u', ‘y') set (only “12" in our example, as it's preceeded by “aa"), and return the number together with it's prefixing string.
This can be done, with:
$ SELECT regexp_matches( 'xxaa12 yyab22 zzbb33', E'([^0-9[:space:]]+)(?<=[aeiouy]{2})([0-9]+)', 'g' ); regexp_matches ---------------- {xxaa,12} (1 ROW)
Please note that despite having three elements in ()s, we only got two matching elements – that's because lookbehind (and lookahead) are constraints, and not groupings.
And what if we'd want all numbers that are not immediately after two “vowels"?
$ SELECT regexp_matches( 'xxaa12 yyab22 zzbb33', E'([^0-9[:space:]]+)(?<![aeiouy]{2})([0-9]+)', 'g' ); regexp_matches ---------------- {yyab,22} {zzbb,33} (2 ROWS)
Two matches, as expected.
While zero-width constraints are not very popular, they do have their uses, and I, for one, am very happy that we have it now. If we only could get greedy/non-greedy parts of single RE, that would be simply amazing. And named unicode character classes…
Anyway – I'm happy that it got committed, thanks Tom.
Description of second operators seems became HTML’s tag, it ate a part of the description:
Basically, we got two new “operators” in regexp:
(?<=…) – positive lookbehind
(?