Deduping very large table - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Deduping very large table

Author

Topic

shawnafisher
Starting Member

5 Posts

Posted - 2009-04-21 : 17:48:31

Good afternoon all,

I am working on a script that will remove duplicate records from a very large table (40 mil + records). It is taking forever to accomplish, and I have two questions:
1. Does anyone know of a better way than my query below to accomplish this?
2. Could I add indexes to my _DeDup table to help?

Any guidance would be appreciated!


------------------------------------------------------------------------------------------
--DEDUP tDuplicatesExist
------------------------------------------------------------------------------------------
--Back up the data
Select * into MyDatabase.dbo.tDuplicatesExist_090421 from tDuplicatesExist<br/>
--get data to dedup
Select * into tDuplicatesExist_dedup from tDuplicatesExist--Truncate the original table to insert distinct records back in 
Truncate table tDuplicatesExist  SET IDENTITY_INSERT MyDatabase.dbo.tDuplicatesExist ON --Keep orig PK/RecordNumber

Insert tDuplicatesExist
Select     tDuplicatesExist_DeDup.RecordNumber,  --Pk/Unique Identifier
    tDuplicatesExist_DeDup.EmployeeID,
    tDuplicatesExist_DeDup.AccountID,
    tDuplicatesExist_DeDup.TransactionType,
    tDuplicatesExist_DeDup.TransactionData,    tDuplicatesExist_DeDup.TransactionDate		
from tDuplicatesExist
JOIN
(   Select         distinct EmployeeID,
        AccountID,
        TransactionType,
        convert(datetime,convert(varchar(10),TransactionDate,101))TDate,
        Max(RecordNumber) RecordNumber
    From tDuplicatesExist_DeDup d
    Group by 
        EmployeeID,        AccountID,        TransactionType,
        convert(datetime,convert(varchar(10),TransactionDate,101)
) as A
on tDuplicatesExist_DeDup.RecordNumber = A.RecordNumber

SET IDENTITY_INSERT  MyDatabase.dbo.tDuplicatesExist OFF

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-04-21 : 17:54:00

Can you delete in batches ?

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-04-22 : 04:51:23

I'd be interested to know if this:

(a) works
(b) performs


CREATE TABLE #delList (
	[recordNumber] INT PRIMARY KEY
	)

INSERT INTO #delList ([recordNumber])
SELECT
	dups.[rn]
FROM
	
	(
	SELECT
		[pos] = ROW_NUMBER() OVER(
			PARTITION BY
				[EmployeeID]
				, [AccountID]
				, [TransactionType]
				, [TransactionData]
			ORDER BY
				[RecordNumber])
		, [rn] = [recordNumber]
	FROM
		tDuplicatesExist
	)
	dups
WHERE
	dups.[pos] > 1
GO

DECLARE @rowcount INT
SET @rowcount = 1

WHILE @rowcount > 0 BEGIN
	DELETE TOP (50) tde
	FROM
		tDuplicatesExist tde
		JOIN #delList dl ON dl.[recordNumber] = tde.[recordNumber]
	
	SET @rowcount = @@ROWCOUNT
END

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2009-04-22 : 04:54:14

First we must now the ratio of the deleted records vs total numbers of records.
If the ratio is high, it's better to create a new table and only insert the valid records, drop the original table and then rename the new table to the name of old table.

E 12°55'05.63"
N 56°04'39.26"

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources