Deleting Duplicate Records

By Bill Graziano on 26 March 2001 | Tags: DELETEs


Seema writes "There is a Table with no key constraints. It has duplicate records. The duplicate records have to be deleted (eg there are 3 similar records, only 2 have to be deleted). I need a single SQL query for this." This is a pretty common question so I thought I'd provide some options.

First, I'll need some duplicates to work with. I use this script to create a table called dup_authors in the pubs database. It selects a subset of the columns and creates some duplicate records. At the end it runs a SELECT statement to identify the duplicate records:
select au_lname, au_fname, city, state, count(*)
from dup_authors
group by au_lname, au_fname, city, state
having count(*) > 1
order by count(*) desc, au_lname, au_fname

The easiest way I know of to identify duplicates is to do a GROUP BY on all the columns in the table. It can get a little cumbersome if you have a large table. My duplicates look something like this:

au_lname        au_fname   city                 state             
--------------- ---------- -------------------- ----- ----------- 
Smith           Meander    Lawrence             KS    3
Bennet          Abraham    Berkeley             CA    2
Carson          Cheryl     Berkeley             CA    2

except there are thirteen additional duplicates identified.

Second, backup your database. Third, make sure you have a good backup of your database.

Temp Table and Truncate

The simplest way to eliminate the duplicate records is to SELECT DISTINCT into a temporary table, truncate the original table and SELECT the records back into the original table. That query looks like this:

select distinct *
into #holding
from dup_authors

truncate table dup_authors

insert dup_authors
select *
from #holding

drop table #holding

If this is a large table, it can quickly fill up your tempdb. This also isn't very fast. It makes a copy of your data and then makes another copy of your data. Also while this script is running, your data is unavailable. It may not be the best solution but it certainly works.

Rename and Copy Back

The second option is to rename the original table to something else, and copy the unique records into the original table. That looks like this:
sp_rename 'dup_authors', 'temp_dup_authors'

select distinct *
into dup_authors
from temp_dup_authors

drop table temp_dup_authors

This has a couple of benefits over the first option. It doesn't use tempdb and it only makes one copy of the data. On the downside, you'll need to rebuild any indexes or constraints on the table when you're done. This one also makes the data unavailable during the process.

Create a Primary Key

Our last option is more complex. It has the benefit of not making a copy of the data and only deleting the records that are duplicates. It's main drawback is that we have to alter the original table and add a sequential record number field to uniquely identify each record. That script looks like this:

-- Add a new column 
-- In real life I'd put an index on it
Alter table dup_authors add NewPK int NULL
go

-- populate the new Primary Key
declare @intCounter int
set @intCounter = 0
update dup_authors
SET @intCounter = NewPK = @intCounter + 1

-- ID the records to delete and get one primary key value also
-- We'll delete all but this primary key
select au_lname, au_fname, city, state, RecCount=count(*), PktoKeep = max(NewPK)
into #dupes
from dup_authors
group by au_lname, au_fname, city, state
having count(*) > 1
order by count(*) desc, au_lname, au_fname

-- delete dupes except one Primary key for each dup record
delete	dup_authors
from	dup_authors a join #dupes d
on	d.au_lname = a.au_lname
and	d.au_fname = a.au_fname
and	d.city = a.city
and	d.state = a.state
where	a.NewPK not in (select PKtoKeep from #dupes)

-- remove the NewPK column
ALTER TABLE dup_authors DROP COLUMN NewPK
go

drop table #dupes

It's actually possible to combine the SELECT INTO #dupes and the DELETE into one DELETE statement. My script is easier to read and understand and shouldn't be much slower. This will run a single delete statement against your table and only remove the duplicate records. If you have a large table the join statement can get kind of large. I guess that's the price you pay for letting duplicates into your database.

We have whole section on Database Design. If you find yourself deleting duplicates I'd take a look at some of the articles. I'd also think about creating some Primary Keys that enforce uniqueness.


- Advertisement -