Thursday, February 6, 2014

Deleting Duplicate Rows

A good database developer will never create a situation where two rows are exactly alike in a table. Everyone knows that it is important to have a primary key that uniquely defines each row in the table. Sometimes it is an artificial key like a unique integer created by a sequence while other times it a natural key like a social security number combined with the person's name (note: social security numbers do not uniquely identify individuals without a name). Yet somehow someone will be testing something and a duplicate row will appear. Perhaps the primary key gets dropped or one was never created and you all of a sudden discover a duplicate row or two.

For example, assume we have the following table named duplicate_rows:

 id | name |     sport
----+------+---------------
  1 | Matt | Skiing
  2 | Nate | Rock Climbing
  1 | Matt | Skiing

You can see there are 2 entries for Matt. Sure it would be a lot easier to just drop the table and recreate it, but that would defeat the purpose of the example. Here are some of the more common ways to delete the duplicate row.

Adding a Column and Populating it with a Sequence

It is very easy to add a column to a table and can be done with the following command:

ALTER TABLE duplicate_rows ADD COLUMN new_pk serial;

Serial is a special data type that gets its value from a sequence. The nice think about it is that it will automatically populate the rows of the table so you don't have to do anything else. Querying the table using

SELECT * FROM duplicate_rows

returns the following results:

 id | name |     sport     | new_pk
----+------+---------------+--------
  1 | Matt | Skiing        |    1
  2 | Nate | Rock Climbing |    2
  1 | Matt | Skiing        |    3

You can then use the following DELETE statement to remove the duplicate row:

DELETE FROM duplicate_rows WHERE new_pk = 3;

Your table will not contain the correct data. You can drop the column using:

ALTER TABLE duplicate_rows DROP COLUMN new_pk;

In addition to removing the column, the sequence automatically created for the serial data type will also be dropped. Unfortunately there is a side-effect from adding a column and then dropping it. PostgreSQL never really drops the column, it just hides it from you. For those concerned about hidden columns being left around, this solution ends up being a little messy. There must be a better way.

Using One of PostgreSQL's Hidden Columns

PostgreSQL tables contain a number of hidden columns that you can use to help delete duplicate rows. Some suggest using the object ID column or OID but not all tables are created with the OID column. My preference is to use the current tuple ID or CTID column combined with the EXISTS statement as in the following:

DELETE FROM duplicate_rows AS d
      WHERE EXISTS (SELECT 'x',
                      FROM duplicate_rows AS s
                     WHERE s.id    = d.id
                       AND s.name  = d.name
                       AND s.sport = d.sport
                       AND s.ctid  < d.ctid);

The SQL command references itself twice and so it is helpful to use a table alias which appears after AS. The DELETE command aliases the table using the letter "d" while the SELECT command uses the letter "s." The command goes through each row in the duplicate_rows table and runs a query to see if there is another row in the table where one current tuple ID is less than the other.

WARNING: If you make the mistake of using the not equal operator (!= or <>), you will delete both copies of the row and that is not what you want.

The downside to this method is that it can be rather long for a table with a lot of rows as you must compare each column with itself. However it does have the advantage of not creating unneeded rows that become hidden from ou.

No comments:

Post a Comment