For example, assume we have the following table named duplicate_rows:
id | name | sport
----+------+---------------
1 | Matt | Skiing
2 | Nate | Rock Climbing
1 | Matt | Skiing
You can see there are 2 entries for Matt. Sure it would be a lot easier to just drop the table and recreate it, but that would defeat the purpose of the example. Here are some of the more common ways to delete the duplicate row.
Adding a Column and Populating it with a Sequence
It is very easy to add a column to a table and can be done with the following command:ALTER TABLE duplicate_rows ADD COLUMN new_pk serial;
Serial is a special data type that gets its value from a sequence. The nice think about it is that it will automatically populate the rows of the table so you don't have to do anything else. Querying the table using
SELECT * FROM duplicate_rows
returns the following results:
id | name | sport | new_pk
----+------+---------------+--------
1 | Matt | Skiing | 1
2 | Nate | Rock Climbing | 2
1 | Matt | Skiing | 3
You can then use the following DELETE statement to remove the duplicate row:
DELETE FROM duplicate_rows WHERE new_pk = 3;
Your table will not contain the correct data. You can drop the column using:
ALTER TABLE duplicate_rows DROP COLUMN new_pk;
In addition to removing the column, the sequence automatically created for the serial data type will also be dropped. Unfortunately there is a side-effect from adding a column and then dropping it. PostgreSQL never really drops the column, it just hides it from you. For those concerned about hidden columns being left around, this solution ends up being a little messy. There must be a better way.
Using One of PostgreSQL's Hidden Columns
PostgreSQL tables contain a number of hidden columns that you can use to help delete duplicate rows. Some suggest using the object ID column or OID but not all tables are created with the OID column. My preference is to use the current tuple ID or CTID column combined with the EXISTS statement as in the following:DELETE FROM duplicate_rows AS d
WHERE EXISTS (SELECT 'x',
FROM duplicate_rows AS s
WHERE s.id = d.id
AND s.name = d.name
AND s.sport = d.sport
AND s.ctid < d.ctid);
The SQL command references itself twice and so it is helpful to use a table alias which appears after AS. The DELETE command aliases the table using the letter "d" while the SELECT command uses the letter "s." The command goes through each row in the duplicate_rows table and runs a query to see if there is another row in the table where one current tuple ID is less than the other.
WARNING: If you make the mistake of using the not equal operator (!= or <>), you will delete both copies of the row and that is not what you want.
The downside to this method is that it can be rather long for a table with a lot of rows as you must compare each column with itself. However it does have the advantage of not creating unneeded rows that become hidden from ou.
No comments:
Post a Comment