I was lately writing some program for a client of mine, which used UPC codes matching.
Since the codes are given in various ways, there was decision to pad the codes with leading zeros – up to 12 characters.
The code has been done, and worked like this:
# SELECT lpad('123456789', 12, '0'); lpad -------------- 000123456789 (1 ROW)
Same trick was used on the “other side" of matching, which worked quite well. Until we found that it actually doesn't.
After some insight, and manual rereading it came to me that lpad truncates string to given length. This was a shock for me. I mean – it is documented, but it seems to be so counter-intuitive (at least for me and guys that work in the client company).
The problem was, that from one source we got the code to be: 0123456789012 and from the other: 123456789012.
After lpad normalization we got:
# SELECT lpad('0123456789012', 12, '0'); lpad -------------- 012345678901 (1 ROW) # SELECT lpad('123456789012', 12, '0'); lpad -------------- 123456789012 (1 ROW)
Which is clearly different, and will not match.
I asked on IRC, and breinbaas suggested this approach:
SELECT repeat('0', 12 - LENGTH(col)) || col
This is how it works:
# SELECT repeat('0', 12 - LENGTH(col)) || col FROM (VALUES ('0123456789012'),('123456789012')) AS x (col); ?COLUMN? --------------- 0123456789012 123456789012 (2 ROWS)
Well, it's still not perfect, but we can change “12" to “20", and it will work ok (actually, we will just strip leading 0s, and compare upc after stripping).
The issue with the repeat approach is that it looks bad. Query seems to be overly complicated, and not easy to understand.
Luckily, it's quite easy to fix. I will just add a function that does the padding, but doesn't truncate:
# CREATE FUNCTION lpad_nt(text, int4, VARCHAR(1)) RETURNS text AS $$ SELECT repeat($3, $2 - LENGTH($1)) || $1; $$ LANGUAGE SQL;
As you can see the function is very simple – it's just a wrapper around repeat/length trick. And this is how it works:
# SELECT lpad_nt(col, 14, '0') FROM (VALUES ('0123456789012'),('123456789012')) AS x (col); lpad_nt ---------------- 00123456789012 00123456789012 (2 ROWS)
And that would actually be all – nothing really fancy, just a reminder to myself to be very cautious when using [rl]pad functions.
Why not just cast for bigint and then to text again?
euler=# select lpad(col::bigint::text, 14, ‘0’) from (values (‘0123456789012’), (‘123456789012’)) as x(col);
lpad
—————-
00123456789012
00123456789012
(2 registros)
well, the simplest reason is: adding such casts would require:
1. me being 100% sure that there are always only numbers and nothing else (and I’m not 100% sure)
or
2. adding proper exception handling for cases when cast will break. and such handling will make the thing simply slower.
Or, you could strip leading zeroes, then lpad…
lpad(ltrim(thecolumn,’0′),14,’0′)
On select * from depesz;, Hubert Lubacziewski exposes an lpad() and rpad() gotcha.
Log Buffer #135