let's assume we have a simple table:
Table "public.test" Column | Type | Modifiers --------+---------+----------- a | integer | not null b | integer | not null Indexes: "test_pkey" PRIMARY KEY, btree (a, b) "q" UNIQUE, btree (b, a)
now, let's insert some rows to it:
# INSERT INTO test SELECT * FROM generate_series(1,100) AS a, generate_series(1,100) AS b^J; INSERT 0 10000
remove rows with a = b:
# DELETE FROM test WHERE a = b; DELETE 100
and prepare test-case by randomly removing some rows:
# DELETE FROM test WHERE random() < 0.002; DELETE 17
the question is – find all pairs of (a,b) where there is no row (a',b') where (a'=b and b'=a).
in other words – every row (a,b) should be paired. rows with a = 2 and b = 3, is paired by row with a = 3 and b = 2.
how to find incomplete pairs?
the basic approach would be to do something like:
SELECT * FROM test t WHERE NOT EXISTS (SELECT * FROM test t2 WHERE (t2.a, t2.b) = (t.b, t.a))
looks reasonable?
no. unfortunately no.
execution plan for this query consisted of 1 seq scan of test table, and nested index scan for every row of it.
does it look bad? with 10k rows, it is not really bad – after all, 1 seq scan over 10k rows, plus 10k index scans – it's doable.
but what if you have 70 million rows in the table?
in this situation it goes simply to 70 million index scans. quick test showed that (out of 10 tries) i got:
- minimal time: 0.047ms
- maximal time: 215ms
- average time: around 0.9-1ms.
full seq scan of the table took about 40 seconds. assuming best situation (0.047ms per index scan), it adds up to nearly 1 hour. with average time of 1ms, total time escalates to … 49 days!
oops. now, i don't have that much time for the task 🙂
better options?
perhaps left outer join approach?
SELECT t1.* FROM test t1 LEFT OUTER JOIN test t2 ON (t1.a, t1.b) = (t2.b, t2.a) WHERE t2.a IS NULL
this becomes 2 sequential scans joined with hash join (with small 10k rows table it becomes merge join of 2 index (but full table) scans.
better, but can it be done faster?
actually yes:
SELECT least(a,b), greatest(a,b), COUNT(*) FROM test GROUP BY least(a,b), greatest(a,b) HAVING COUNT(*) = 1;
this query does only 1 table scan, with hash aggregate. time is unbeatable 🙂
how does it work?
basically i thought about finding a way to tell postgresql that pair (12,50) and (50,12) are “the same", so i could use grouping.
there is a simple way to do it – using least() and greatest() functions. least() always returns smaller, and greatest() bigger number.
so, for both (12,50) and (50,12), output of (least(), greatest) is (12,50)!
now, that i have a way to “group it", i can use standard “having" trick to filter only the groups that contain only 1 row.
effect – total time of execution – 5minutes 🙂 for 70 million rows.
and how to add missing pairs?
first, let's copy the group info to temp table:
CREATE temp TABLE depesz_temp AS SELECT least(a,b) AS x, greatest(a,b) AS y, COUNT(*) FROM test GROUP BY least(a,b), greatest(a,b) HAVING COUNT(*) = 1
now – we need to find what row exactly is there, and insert it's pair.
to make the story short – full insert looks like this:
INSERT INTO test (a,b) SELECT (CASE WHEN t.a = q.x THEN q.y ELSE q.x END), (CASE WHEN t.b = q.x THEN q.y ELSE q.x END) FROM test t, depesz_temp q WHERE (t.a, t.b) = (q.x, q.y) OR (t.b, t.a) = (q.x, q.y)
you might be tempted to reuse least/greatest trick in the where condition in above insert/select, but if you'd do – indexes on (a,b) and (b,a) wouldn't be used, so it would become really slow 🙂
insert into test (a, b)
select min(b), min(a)
from test
group by least(a,b), greatest(a,b)
having count(*) = 1
but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a
insert into test (a, b)
select min(b), min(a)
from test
group by least(a,b), greatest(a,b)
having count(*) = 1
but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a<b” check constraint trigger that would insert (b,a) if an attempt was made to insert (a,b) where not (a<b).
Makes one wonder why both entries are stored, instead of storing one of them and generating the second on the fly.
Anyway, that’s still a neat trick.
@Andreas:
it’s an interesting idea, but then when searching for rows where some given “id” is in any of the fields – it becomes complicated.
@Rod Taylor:
these are connection between nodes on some kind of graph (not directional).
having both of the rows in database simplifies (a lot) searches.
@depesz: Re: searching for rows where some given “id” is in any of the fields – it becomes complicated
No, you would just use the view, which makes it look like both rows are stored.
@Andreas:
complicated not as “for human” but “for computer”.
it becomes *two* index scans instead of *one*
I like
select a,b from test
except
select b,a from test
Anyway: Is Your solution really faster? I think that such group by must (for large tables) be building some kind of index for easy locating of the groups, which would be slower.