couple of times is was brought to attention of #postgresql channel – how to find longest prefix.
case: you have phone number, and you want to find which carrier it is bound to.
there are at least 3 ways of finding it, and i decided to take a look at which is fastest.
first, datastructure.
basically simple table like:
CREATE TABLE prefixes (id serial PRIMARY KEY, prefix text NOT NULL UNIQUE);
would be enough, but i got data from dim and it contained more columns, and i was lazy enough not to remove them from test data.
so, finally i used this table structure:
CREATE TABLE prefixes ( id serial PRIMARY KEY, prefix text NOT NULL UNIQUE, operator text, something1 text, something2 text );
now, we have 3 separate ways:
1. andrewsn suggested way:
< AndrewSN> depesz: the idea IS TO pick a prefix LENGTH, e.g. 4, AND CREATE an INDEX ON SUBSTRING(prefix FOR 4) WHERE LENGTH(prefix) >= 4 < AndrewSN> depesz: THEN CREATE separate functional indexes ON (prefix) WHERE LENGTH(prefix) = 3, AND = 2, AND = 1 < AndrewSN> depesz: THEN the query conditions look LIKE (assuming $1 IS the NUMBER TO be looked up) < AndrewSN> depesz: WHERE ((LENGTH(prefix) >= 4 AND LENGTH($1) >= 4 AND SUBSTRING(prefix FOR 4)=SUBSTRING($1 FOR 4)) OR (LENGTH(prefix)=3 AND prefix=SUBSTRING($1 FOR 3)) OR (LENGTH(prefix=2) A ND prefix=SUBSTRING($1 FOR 2)) OR (LENGTH(prefix=1) AND prefix=SUBSTRING($1 FOR 1))) AND $1 LIKE (prefix || '%') ORDER BY LENGTH(prefix) DESC < AndrewSN> depesz: which looks insane, I know, but the idea IS that it should be a BitmapAnd OF the functional indexes < AndrewSN> depesz: I mean BitmapOr OF course
2. dim decided to code custom gist index opclass (available on pgfoundry)
3. i thought of simply using “where prefix in (…) order by length(prefix) desc limit 1", where this “in" would be generated by getting all prefixes of phone number.
for example for phone number 0123456789, we would have: prefix in (‘0′, '01', ‘012', ‘0123', ‘01234', ‘012345', …)
andrewsn way has one uncertainty: which prefix length to pick.
since max(length(prefix)) in my test data is 7 i decided to test andrewsn solution 7 times on 7 separate tables with chosen prefix length 1, 2, 3, 4, 5, 6 and 7.
so, finally all tables had the same structure, but they had different indexes:
- my solution: no extra indexes
- dim: CREATE INDEX idx_prefix ON prefixes_dim USING GIST(prefix gist_prefix_ops);
- andrewsn solution with picked length 1:
CREATE INDEX pa1_1 ON prefixes_andrewsn_1 (SUBSTRING(prefix FOR 1));
- andrewsn solution with picked length 2:
CREATE INDEX pa2_1 ON prefixes_andrewsn_2 (prefix) WHERE LENGTH(prefix) = 1; CREATE INDEX pa2_2 ON prefixes_andrewsn_2 (SUBSTRING(prefix FOR 2)) WHERE LENGTH(prefix) >= 2;
- andrewsn solution with picked length 3:
CREATE INDEX pa3_1 ON prefixes_andrewsn_3 (prefix) WHERE LENGTH(prefix) = 1; CREATE INDEX pa3_2 ON prefixes_andrewsn_3 (prefix) WHERE LENGTH(prefix) = 2; CREATE INDEX pa3_3 ON prefixes_andrewsn_3 (SUBSTRING(prefix FOR 3)) WHERE LENGTH(prefix) >= 3;
- andrewsn solution with picked length 4:
CREATE INDEX pa4_1 ON prefixes_andrewsn_4 (prefix) WHERE LENGTH(prefix) = 1; CREATE INDEX pa4_2 ON prefixes_andrewsn_4 (prefix) WHERE LENGTH(prefix) = 2; CREATE INDEX pa4_3 ON prefixes_andrewsn_4 (prefix) WHERE LENGTH(prefix) = 3; CREATE INDEX pa4_4 ON prefixes_andrewsn_4 (SUBSTRING(prefix FOR 4)) WHERE LENGTH(prefix) >= 4;
- andrewsn solution with picked length 5:
CREATE INDEX pa5_1 ON prefixes_andrewsn_5 (prefix) WHERE LENGTH(prefix) = 1; CREATE INDEX pa5_2 ON prefixes_andrewsn_5 (prefix) WHERE LENGTH(prefix) = 2; CREATE INDEX pa5_3 ON prefixes_andrewsn_5 (prefix) WHERE LENGTH(prefix) = 3; CREATE INDEX pa5_4 ON prefixes_andrewsn_5 (prefix) WHERE LENGTH(prefix) = 4; CREATE INDEX pa5_5 ON prefixes_andrewsn_5 (SUBSTRING(prefix FOR 5)) WHERE LENGTH(prefix) >= 5;
- andrewsn solution with picked length 6:
CREATE INDEX pa6_1 ON prefixes_andrewsn_6 (prefix) WHERE LENGTH(prefix) = 1; CREATE INDEX pa6_2 ON prefixes_andrewsn_6 (prefix) WHERE LENGTH(prefix) = 2; CREATE INDEX pa6_3 ON prefixes_andrewsn_6 (prefix) WHERE LENGTH(prefix) = 3; CREATE INDEX pa6_4 ON prefixes_andrewsn_6 (prefix) WHERE LENGTH(prefix) = 4; CREATE INDEX pa6_5 ON prefixes_andrewsn_6 (prefix) WHERE LENGTH(prefix) = 5; CREATE INDEX pa6_6 ON prefixes_andrewsn_6 (SUBSTRING(prefix FOR 6)) WHERE LENGTH(prefix) >= 6;
- andrewsn solution with picked length 7:
CREATE INDEX pa7_1 ON prefixes_andrewsn_7 (prefix) WHERE LENGTH(prefix) = 1; CREATE INDEX pa7_2 ON prefixes_andrewsn_7 (prefix) WHERE LENGTH(prefix) = 2; CREATE INDEX pa7_3 ON prefixes_andrewsn_7 (prefix) WHERE LENGTH(prefix) = 3; CREATE INDEX pa7_4 ON prefixes_andrewsn_7 (prefix) WHERE LENGTH(prefix) = 4; CREATE INDEX pa7_5 ON prefixes_andrewsn_7 (prefix) WHERE LENGTH(prefix) = 5; CREATE INDEX pa7_6 ON prefixes_andrewsn_7 (prefix) WHERE LENGTH(prefix) = 6; CREATE INDEX pa7_7 ON prefixes_andrewsn_7 (SUBSTRING(prefix FOR 7)) WHERE LENGTH(prefix) >= 7;
as for queries.
- depesz way:
SELECT * FROM prefixes_depesz WHERE prefix IN (?, ?, ?, ?, ?, ?, ?) ORDER BY LENGTH(prefix) DESC LIMIT 1
- dim way:
SELECT * FROM prefixes_dim WHERE prefix @> ? ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn1 way:
SELECT * FROM prefixes_andrewsn_1 WHERE SUBSTRING(prefix FOR 1) = ? ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn2 way:
SELECT * FROM prefixes_andrewsn_2 WHERE ( LENGTH(prefix) = 1 AND prefix = ? ) OR ( LENGTH(prefix) >= 2 AND SUBSTRING(prefix FOR 2) = ? ) ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn3 way:
SELECT * FROM pixes_andrewsn_3 WHERE ( LENGTH(prefix) = 1 AND prefix = ? ) OR ( LENGTH(prefix) = 2 AND prefix = ? ) OR ( LENGTH(prefix) >= 3 AND SUBSTRING(prefix FOR 3) = ? ) ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn4 way:
SELECT * FROM pixes_andrewsn_4 WHERE ( LENGTH(prefix) = 1 AND prefix = ? ) OR ( LENGTH(prefix) = 2 AND prefix = ? ) OR ( LENGTH(prefix) = 3 AND prefix = ? ) OR ( LENGTH(prefix) >= 4 AND SUBSTRING(prefix FOR 4) = ? ) ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn5 way:
SELECT * FROM pixes_andrewsn_5 WHERE ( LENGTH(prefix) = 1 AND prefix = ? ) OR ( LENGTH(prefix) = 2 AND prefix = ? ) OR ( LENGTH(prefix) = 3 AND prefix = ? ) OR ( LENGTH(prefix) = 4 AND prefix = ? ) OR ( LENGTH(prefix) >= 5 AND SUBSTRING(prefix FOR 5) = ? ) ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn6 way:
SELECT * FROM pixes_andrewsn_6 WHERE ( LENGTH(prefix) = 1 AND prefix = ? ) OR ( LENGTH(prefix) = 2 AND prefix = ? ) OR ( LENGTH(prefix) = 3 AND prefix = ? ) OR ( LENGTH(prefix) = 4 AND prefix = ? ) OR ( LENGTH(prefix) = 5 AND prefix = ? ) OR ( LENGTH(prefix) >= 6 AND SUBSTRING(prefix FOR 6) = ? ) ORDER BY LENGTH(prefix) DESC LIMIT 1
- andrewsn7 way:
SELECT * FROM prefixes_andrewsn_7 WHERE ( LENGTH(prefix) = 1 AND prefix = ? ) OR ( LENGTH(prefix) = 2 AND prefix = ? ) OR ( LENGTH(prefix) = 3 AND prefix = ? ) OR ( LENGTH(prefix) = 4 AND prefix = ? ) OR ( LENGTH(prefix) = 5 AND prefix = ? ) OR ( LENGTH(prefix) = 6 AND prefix = ? ) OR ( LENGTH(prefix) >= 7 AND SUBSTRING(prefix FOR 7) = ? ) ORDER BY LENGTH(prefix) DESC LIMIT 1
i wrote a simple program which:
- generates list of 100 numbers, each of them is 12digit long
- repeats 1000 times:
- for every method of finding prefix, find prefixes for all 100 numbers
- show time summary
generated output:
depesz : 65.059978 : 100.000 andrewsn3 : 85.131860 : 130.851 andrewsn2 : 90.736707 : 139.466 andrewsn4 : 94.233569 : 144.841 andrewsn5 : 107.538550 : 165.291 andrewsn6 : 122.616948 : 188.468 andrewsn7 : 138.959002 : 213.586 dim : 198.355827 : 304.881 andrewsn1 : 289.945295 : 445.658
first column is data access method name, second – time in seconds (this time was taken to find 100,000 prefixes), third column – how many percent of fastest method it took.
according to this data it looks like the simplest solution is the fastest. but. andrewsn solution is very promising and i think that there might be datasets that it would be faster than mine. dim solution doesn't look impressive at first sight, but as far as i know, this is very early stage of the project, so in long term this solution has very good chance to be the fastest one.
if you'd like to repeat the benchmark yourself, feel free to grab the test script.
as for test data – i'm not sure if i can distribute it, i'll find out, and if yes – i'll link it here. otherwise – just generate your own data, it's not complicated 🙂
Looks like we came to the same conclusion. 😉
http://trac.asterisk2billing.org/cgi-bin/trac.cgi/changeset/469
depesz, can you test how fast compared to others is:
select * from prefixes where $1 like prefix||’%’ and prefix > substring($1 from 1 for 1) order by prefix desc limit 1;
Can’t get perl compiling on my laptop :7
The problem with phone numbers is usually not with the prefix but with the suffix. So, you usually have country_code-area_code-switch_prefix-subscriber_number (e.g. +1 212 123-4567). Why the suffix matter? Because when using caller IDs you end up receiving just part of the number instead of it all (so you might receive, for example, 123-4567 or 212 123-4567). Matching those is a more interesting problem specially because of the diversity and common cases one might encounter (i.e. we can have 212 123-4567 and also 716 123-4567 and both are different numbers). As a rule, if you don’t receive the area code you can assume it is the same as the number that is receiving the call, so you can add that programmatically (and go up to the country code) to help… (And then reduce to your prefix solution.)
Anyway, it is very nice to see this post here. Congratulations.
Hi,
What do you think of my method of searching the longest prefix?
BestRegards~~Piotrek~~pe3no.
CREATE TABLE prefixes (prefix varchar(20) PRIMARY KEY, description varchar(50) NOT NULL)
CREATE TABLE calls (call_id int PRIMARY KEY, called_num varchar(20) NOT NULL)
INSERT INTO prefixes (prefix, description)
VALUES
(‘0048’, ‘Poland’),
(‘004822’, ‘Poland – Warsaw’),
(‘0048228’, ‘Poland – Warsaw – Centrum ;)’),
(‘004860’, ‘Poland – Mobile’),
(‘0048601’, ‘Poland – Mobile – Plus GSM’),
(‘0048602’, ‘Poland – Mobile – Era GSM’)
INSERT INTO calls (call_id, called_num)
VALUES
(1, ‘0048127654321’),
(2, ‘0048229876543’),
(3, ‘0048606987654’),
(4, ‘0048601234567’),
(5, ‘0048602787878’),
(6, ‘0048228262626’),
(7, ‘0048121234567’),
(8, ‘0048221234567’),
(9, ‘0048607123456’),
(10, ‘0048601987654’),
(11, ‘0048602131313’),
(12, ‘0048228888888’)
SELECT c.call_id, c.called_num,
(SELECT MAX(p.prefix) FROM prefixes AS p
WHERE c.called_num LIKE p.prefix || ‘%’) AS selected_prefix
FROM calls AS c
@Kristo Kaiv:
i will test it, but i have to do it at home to have comparable results. so i will post results today in the evening.
@Piotr:
your query will search for prefix for one given number using:
SELECT MAX(p.prefix) FROM prefixes AS p
WHERE ‘SOME-NUMBER’ LIKE p.prefix || ‘%’
which enforces sequential scan and thus is very slow.
it might be good in the example you showed (when you’re finding prefixes for multiple phone numbers at once) but for finding prefix for one phone number it will be very slow (i mean – prefix tables are usually not very big, so it will not be very-very slow, but it will be slow in comparison to any indexable method).
your idea is very nice.
I frote function prefixes:
create or replace function prefixes(varchar)
returns varchar[] as
$$select array(select substring($1 from 1 for i) from generate_series(1, length($1)) g(i))::varchar[];
$$ language sql immutable;
postgres=# select * from prefixesx where costcode = any (prefixes(‘420724191000’)) order by length(costcode) desc limit 1;
costcode_name | costcode
———————————+———-
Czech Republic Mobile – EuroTel | 42072
(1 row)
it’s 4x faster than like on 4800 prefixes.
depesz, we actually use this in live and belive me it does not do seq scan.
explain analyze select * from prefixes where ‘372123123’ like prefix||’%’ and prefix > substring(‘372123123’ from 1 for 1) order by prefix desc limit 1;
QUERY PLAN
————————————————————————————————————————————————
Limit (cost=0.00..19.88 rows=1 width=132) (actual time=5.664..5.665 rows=1 loops=1)
-> Index Scan Backward using prefixes_prefix_key on prefixes (cost=0.00..99.40 rows=5 width=132) (actual time=5.660..5.660 rows=1 loops=1)
Index Cond: (prefix > ‘3’::text)
Filter: (‘372123123’::text ~~ (prefix || ‘%’::text))
Total runtime: 5.701 ms
(5 rows)
forgot to add postgres 8.2 table =
d prefixes
Table “public.prefixes”
Column | Type | Modifiers
————+———+——————————————————-
id | integer | not null default nextval(‘prefixes_id_seq’::regclass)
prefix | text | not null
operator | text |
something1 | text |
something2 | text |
Indexes:
“prefixes_pkey” PRIMARY KEY, btree (id)
“prefixes_prefix_key” UNIQUE, btree (prefix)
@Kristo Kaiv:
i didn’t say your solution uses seq scan. so there is no “belive me” in this.
i know it can use index for this part of query: prefix > substring(’372123123′ from 1 for 1)
but i never said that the query wouldn’t use index.
i said that query of piotr will not use index. and his query is different from yours.
i think while cutting the whole unrelated date stuff from the function i left out some important details also. so the query to test is actually :
explain analyze select * from prefixes where substring(‘372123123’,1,length(‘372123123’)) = prefix and prefix >= substring(‘372123123’ from 1 for 1) and prefix Index Scan Backward using prefixes_prefix_key on prefixes (cost=0.00..2.27 rows=1 width=132) (actual time=0.048..0.048 rows=0 loops=1)
Index Cond: ((‘372123123’::text = prefix) AND (prefix >= ‘3’::text) AND (prefix <= ‘372123123’::text))
Total runtime: 0.078 ms
(4 rows)
@Kristo Kaiv:
hmm .. your wquery (as i look at it) seems to *guarantee* no rows returned.
this part makes is look very bad:
where substring(’372123123′,1,length(’372123123′)) = prefix
this is basically:
where ’372123123′ = prefix
which i find highly unlikely to work (prefixes are usually smaller strings).
it might be that you wanted:
where substring(’372123123′,1,length(prefix)) = prefix
but i’m not sure. in mean time i will test your solution from your first comment:
select * from prefixes where $1 like prefix||’%’ and prefix > substring($1 from 1 for 1) order by prefix desc limit 1;
which looks quite slow, but at the very least looks like it will return correct rows 🙂
i am the master of typos 😛
explain analyze select * from prefixes where substring(‘372123123’,1,length(prefix)) = prefix and prefix >= substring(‘372123123’ from 1 for 1) and prefix Index Scan Backward using prefixes_prefix_key on prefixes (cost=0.00..15.66 rows=1 width=132) (actual time=0.197..0.197 rows=1 loops=1)
Index Cond: ((prefix >= ‘3’::text) AND (prefix <= ‘372123123’::text))
Filter: (“substring”(‘372123123’::text, 1, length(prefix)) = prefix)
Total runtime: 0.271 ms
ok. i tested the method from your first comment. here are results:
depesz : 65.585102 : 100.000
andrewsn3 : 84.334670 : 128.588
andrewsn2 : 89.335328 : 136.213
andrewsn4 : 94.785839 : 144.523
andrewsn5 : 108.169379 : 164.930
andrewsn6 : 122.670495 : 187.040
andrewsn7 : 140.192055 : 213.756
kristo : 192.252063 : 293.134
dim : 199.715589 : 304.514
andrewsn1 : 326.747219 : 498.203
ok. i tested also second version from kristo:
‘kristo’ => “select * from prefixes_kristo where ? like prefix||’%’ and prefix > substring(? from 1 for 1) order by prefix desc limit 1”,
‘kristo2’ => “select * from prefixes_kristo where substring(?,1,length(prefix)) = prefix and prefix >= substring(? from 1 for 1)”,
results:
depesz : 67.580065 : 100.000
kristo : 131.246875 : 194.209
kristo2 : 133.002210 : 196.807
must give your version to the QA guys to test then 🙂 thanks for the feedback. I must say your blog is one of the most interesting ones out there 😉
@Pavel Stehule:
sorry, i somehow managed to overlook your comment.
basically your idea is the same as my, but you make prefix list in postgresql, and i made it in client app.
so, i added your function to my test database, and ran test with this sqls:
‘depesz’ => ‘select * from prefixes_depesz where prefix in (?, ?, ?, ?, ?, ?, ?) order by length(prefix) desc limit 1’,
‘pavel’ => ‘select * from prefixes_depesz where prefix = any(prefixes(?)) order by length(prefix) desc limit 1’,
results:
depesz : 64.798963 : 100.000
pavel : 74.329229 : 114.707
your solution is a bit slower, but it is anyway faster than any other that i tested. so it might be perfect for programmer/languages where generating prefix list is more complicated than in perl.
i also tested your approach, but with function written in pl/perlu:
create or replace function prefixes_pl(text)
returns text[] as
$$
my $num = shift;
return [ map { substr( $num, 0, $_ ) } ( 1 .. length($num) ) ];
$$ language plperlu immutable;
plus, i modified my base approach to test all prefixes, and not only first 7. results:
=> ./prefixes.check.pl
pavel_pl : 67.981123 : 100.000
depesz : 70.161244 : 103.207
pavel : 72.278121 : 106.321
now. that’s a nice one!
final sqls:
‘depesz’ => ‘select * from prefixes_depesz where prefix in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) order by length(prefix) desc limit 1’,
‘pavel’ => ‘select * from prefixes_depesz where prefix = any(prefixes(?)) order by length(prefix) desc limit 1’,
‘pavel_pl’ => ‘select * from prefixes_depesz where prefix = any(prefixes_pl(?)) order by length(prefix) desc limit 1’,
test codes:
sub test_depesz {
my $num = shift;
return $dbh->selectall_arrayref( $sqls{ ‘depesz’ }, undef, map { substr( $num, 0, $_ ) } ( 1 .. length($num) ) );
}
sub test_pavel {
my $num = shift;
return $dbh->selectall_arrayref( $sqls{ ‘pavel’ }, undef, $num );
}
sub test_pavel_pl {
my $num = shift;
return $dbh->selectall_arrayref( $sqls{ ‘pavel_pl’ }, undef, $num );
}
btw, the prefixes that you are generating in the testscript, are they random or based on the numbers in the prefix table? In real life env something like 95% of searches or more actually find a match.
Kristo Kaiv:
totally random.
i could change/fix it, but frankly – i’m not sure if there is any point.
there is actually, i once created a func that was optimized for prefix not found scenarios and it performed 3x better than the one pasted here. however when doing performance tests on live CDR data it failed miserably. It was using basically the same kind of appoach you used though i am not sure about the exact implementation as it was already almost 1.5 years ago. Thats why i am also so interested in this topic 🙂
I implemented a custom gist index opclass for this about 7 months ago and have been using it since. The usage is the same as dim’s (using the @> operator). I was curious about the performance relative to these methods so I ran the test script:
jordan : 3.669391 : 100.000
depesz : 4.454818 : 121.405
andrewsn2 : 5.786777 : 157.704
andrewsn3 : 6.819685 : 185.853
andrewsn4 : 8.394706 : 228.777
andrewsn1 : 8.956821 : 244.096
andrewsn5 : 10.161373 : 276.923
andrewsn6 : 12.161559 : 331.433
andrewsn7 : 14.305387 : 389.857
I’m unsure of how sensitive the performance is to the distribution of test data, but it seems to be doing quite well here.
@jordan:
that’s great. is there any chance you could get in contact with dim and share the code with him? he has prefix project available on pgfoundry, and i think that we could all use such a robust solution.
The new version of prefix is now implementing a prefix_range datatype and its operators.
As a result, lookups are much more faster than when indexing text.
I’ve also implemented Jordan picksplit() idea, and here are the results:
dim ~/PostgreSQL/pgfoundry/prefix time perl prefixes.CHECK.pl
Generating numbers… done
benching 10 runs: ……….
dim_pr : 4.772673 : 100.000
dimjordan : 4.877317 : 102.193
depesz : 5.029574 : 105.383
andrewsn3 : 5.097337 : 106.803
andrewsn4 : 5.226833 : 109.516
andrewsn2 : 5.435897 : 113.896
andrewsn5 : 5.475583 : 114.728
depesz_pg : 5.661525 : 118.624
andrewsn6 : 5.704492 : 119.524
andrewsn7 : 5.856008 : 122.699
dim : 6.246600 : 130.883
andrewsn1 : 6.302079 : 132.045
——————————–
real 1m5.749s
user 0m2.704s
sys 0m0.612s
dim ~/PostgreSQL/pgfoundry/prefix time perl prefixes.CHECK.pl
Generating numbers… done
benching 100 runs: ……………………………………………………………………………………….
dim_pr : 49.185813 : 100.000
dimjordan : 49.369493 : 100.373
depesz : 49.654974 : 100.954
andrewsn3 : 51.572550 : 104.852
andrewsn4 : 52.429459 : 106.595
andrewsn2 : 52.659229 : 107.062
andrewsn5 : 54.624177 : 111.057
depesz_pg : 55.086981 : 111.998
andrewsn6 : 56.706777 : 115.291
andrewsn7 : 58.600456 : 119.141
dim : 61.699087 : 125.441
andrewsn1 : 68.686213 : 139.646
——————————–
real 11m0.359s
user 0m26.462s
sys 0m5.388s
The new testing code is available at http://pgsql.tapoueh.org/prefix/prefixes.check.pl
The new prefix GiST indexing code is available at the CVS page of the project at pgfoundry, see http://pgfoundry.org/projects/prefix