Aaron Akin

February 19, 2009

Deleting Duplicate Records

Filed under: TSQL — Aaron Akin @ 6:42 pm

I was helping someone out today on Experts-Exchange who had a quite a few duplicate records in a table that he wanted to get rid of.  Here’s an example of the data.

SELECT * FROM People

GO
LastName FirstName

-------- ---------
Smith    John
Smith    John
Smith    John
Jackson  Tom
Jones    Sarah
Jones    Sarah
Williams Jessica

As you can see, there is nothing in this table that makes each row unique, and therefore you cannot delete one of the records for John Smith without deleting both records.

The easiest way to delete these duplicates is to add a new IDENTITY column to the table.

ALTER TABLE People ADD ID INT IDENTITY(1,1)

GO

SELECT * FROM People
GO
ID          LastName FirstName

----------- -------- ---------
1           Smith    John
2           Smith    John
3           Smith    John
4           Jackson  Tom
5           Jones    Sarah
6           Jones    Sarah
7           Williams Jessica

This will provide us with a unique value across all records, which we can use to find and delete the duplicate rows.

DELETE FROM People WHERE ID NOT IN

  (
  --Return only the ID for records we want to keep
  SELECT ID
    FROM
         (
         --Return one ID for each person
         SELECT MIN(ID) AS ID, LastName, FirstName
           FROM People
           GROUP BY LastName, FirstName
         ) a
  )
GO

SELECT * FROM People
GO
ID          LastName FirstName

----------- -------- ---------
1           Smith    John
4           Jackson  Tom
5           Jones    Sarah
7           Williams Jessica

We are now left with only one record for each person.  You can now drop the ID column from the table.

ALTER TABLE People DROP COLUMN ID

GO

-A

Advertisements

4 Comments »

  1. One of my friends over at http://www.sqlservernation.com, posted another way to delete duplicate records that’s actually much easier if you’re using SQL 2005 or 2008. Instead of having to make changes to the structure of the table, he used a Common Table Expression (CTE) to delete the duplicate records. http://www.sqlservernation.com/blogs/tipweek/archive/2009/02/23/how-to-delete-duplicate-values.aspx

    Here’s an example of using a CTE with the data used in this post. This is for SQL 2005 or 2008 only since CTEs and ranking functions are new to SQL 2005.


    ;WITH CTE_People (FirstName, LastName, Ranking)
    AS
    (
    SELECT FirstName, LastName, DENSE_RANK() OVER(PARTITION BY FirstName, LastName ORDER BY NEWID() ASC)
    FROM People
    )
    DELETE FROM CTE_People
    WHERE Ranking > 1

    Comment by Aaron Akin — February 26, 2009 @ 9:02 pm | Reply

  2. I like. It’s one of those simple things that us novices don’t always think about.

    Comment by Chris — March 9, 2009 @ 7:35 am | Reply

  3. Dude. Stop giving the answers to my interview questions. 😉

    Comment by Erik Klein — March 8, 2010 @ 5:36 am | Reply

    • It sounds like your candidates could use some help! Hope things are going well at GS.

      Comment by Aaron Akin — March 8, 2010 @ 8:28 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: