this questions pops every once in a while on irc.
some guy has a table, and it contains duplicated rows.
basically there are 2 possible scenarios:
- all columns are duplicated
- only some columns are duplicated
so, how to remove duplicates from such tables?
let's check what to do when all columns are duplicated, like in here:
# SELECT * FROM test; some_text | some_int -----------+---------- a | 1 a | 1 b | 1 b | 2 b | 2 (5 ROWS)
the simplest approach to fix the table is to:
# CREATE temp TABLE temp_test AS SELECT DISTINCT * FROM test; TRUNCATE test; INSERT INTO test SELECT * FROM temp_test; DROP TABLE temp_test;
after this – contents of the table looks like this:
# SELECT * FROM test; some_text | some_int -----------+---------- a | 1 b | 1 b | 2 (3 ROWS)
but what if you don't want to copy values between tables?
it's also quite simple – we will simply add new column, enter some unique data there, and then use it to differentiate the rows:
# CREATE temp SEQUENCE temp_seq; CREATE SEQUENCE # ALTER TABLE test ADD COLUMN unique_id int4; ALTER TABLE # UPDATE test SET unique_id = NEXTVAL('temp_seq'); UPDATE 5 # DELETE FROM test AS t1 WHERE EXISTS (SELECT * FROM test t2 WHERE (t2.some_text, t2.some_int) = (t1.some_text, t1.some_int) AND t2.unique_id > t1.unique_id); DELETE 2 # ALTER TABLE test DROP COLUMN unique_id; ALTER TABLE
after this it is good to do vacuum of the table.
also – if you dont have index on (some_text, some_int) – it will be good to add such a index to speed up the removal.
and what in case not all of the columns are duplicated? like in this case:
# SELECT * FROM test; id | some_text | some_int ----+-----------+---------- 2 | a | 1 3 | a | 1 4 | b | 1 5 | b | 2 6 | b | 2 (5 ROWS)
it's simple – use the trick shown above, but this time you don't have to add new column – you already have it.
interesting thing is when you consider that only some_text should be unique. and if many records have the same some_text – we should keep the one with lowest some_int. and if more then one have the same some_int – keep the newest (with highest id).
how to do it?
simple:
# DELETE FROM test AS t1 WHERE EXISTS (SELECT * FROM test t2 WHERE t2.some_text = t1.some_text AND (t2.some_int < t1.some_int OR (t2.some_int = t1.some_int AND t2.id > t1.id))); DELETE 3 # SELECT * FROM test; id | some_text | some_int ----+-----------+---------- 3 | a | 1 4 | b | 1 (2 ROWS)
Instead of using a Sequence, is it possible to use the CTID to distinguish between two row?
Use ctid
http://www.pgsql.cz/index.php/SQL_Triky#Smaz.C3.A1n.C3.AD_duplicitn.C3.ADch_.C5.99.C3.A1dek.2C
@Richard Broersma Jr.:
it is of course possible (like Pavel showed), but i’m not really enthusiastic about using system-data in standard (ekhem) sql queries.
the idea and usage is great. i just have “moral” issues with it 🙂
@depesz
Your solution is corelated subquery :-(.
@Pavel Stehule:
why “:-(“? something is wrong with them?
@depesc, ofcourse .. on bigger tables can be slow, corelated subqueries are efective in 10% very specific cases and it’s better doesn’t use it
What research book would you suggest?