Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Identify Petential duplicate

Author  Topic 

shaggy
Posting Yak Master

248 Posts

Posted - 2015-04-22 : 09:33:56
Hi friends,

There is one report to identify potential duplicate in a table and it is performing poor.I'm now tuning the existing SP and got struck in modifiying it. Can anyone pls guide me to rewrite the query in a best way. I just pasted below an example of query which is now in a report.

The report will be run every week currently the table has 10 million records, and every week there will 5k to 10k will be added up so with that 5k to 10 k we have to check all the 10 miilion rows that if it is duplciated the logic is (surname = surmane or forename = forename or DOB =DOB )

Create table #employee
(
ID int,
empid varchar(100),
surname varchar(100),
forename varchar(100),
DOB datetime,
empregistereddate datetime,
Createdate datetime
)

Insert #employee
select 1,'EMP1','Joe','Bonnie','1980-04-22 18:24:39.397','2014-08-11 18:33:00.920',getdate()
union
select 2,'EMP2','Joe','Corey','1984-04-12 18:24:39.397','2014-07-12 18:33:00.920',getdate()
union
select 3,'EMP3','Chase','Corner','1985-01-01 18:24:39.397','2014-06-13 18:33:00.920',getdate()
union
select 4,'EMP4','Yang','Corner','1980-03-25 18:24:39.397','2014-05-14 18:33:00.920',getdate()
union
select 5,'EMP5','McDowell','Julia','1990-08-22 18:24:39.397','2014-04-15 18:33:00.920',getdate()
union
select 6,'EMP6','Norton','Julia','1991-01-22 18:24:39.397','2014-03-16 18:33:00.920',getdate()
union
select 7,'EMP7','Mayo','Julia','2015-04-22 18:24:39.397','2014-02-17 18:33:00.920',getdate()
union
select 8,'EMP8','David','Allen','2015-04-22 18:24:39.397','2014-01-18 18:33:00.920',getdate()
union
select 9,'EMP9','David','Bonnie','1991-01-22 18:24:39.397','2015-01-01 18:33:00.920',getdate()
union
select 10,'EMP10','Monroe','Bonnie','1991-01-22 18:24:39.397','2015-04-22 18:33:00.920',getdate()



select a.ID,DUP.id DUPid,a.empid,DUP.empid DUPempid,a.surname,DUP.surname DUPsurname,a.forename,DUP.forename DUPforename,a.DOB,DUP.DOB DUPdob,a.empregistereddate,DUP.empregistereddate DUPempregistereddate,a.Createdate,DUP.Createdate DUPCreatedate
FROM
#employee a
INNER JOIN #employee DUP
ON (a.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920')
AND ( (a.SURNAME = DUP.SURNAME )
OR (a.FORENAME = DUP.FORENAME )
OR a.DOB = DUP.DOB
)
WHERE a.id <> DUP.id

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-04-22 : 10:31:44
Are the join and where predicates operating on indexed columns?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2015-04-22 : 11:21:25
I suspect that it is the OR (in the JOIN) which is making it slow.

It might be faster (assuming columns in the JOIN predicates are indexed) to have three separate queries and UNION them together.

select a.ID,DUP.id DUPid,a.empid,DUP.empid DUPempid,a.surname,DUP.surname DUPsurname,
a.forename,DUP.forename DUPforename,a.DOB,DUP.DOB DUPdob,a.empregistereddate,DUP.empregistereddate DUPempregistereddate,
a.Createdate,DUP.Createdate DUPCreatedate
FROM
(
SELECT [A_ID] = A.ID, [DUP_ID] = DUP.ID
FROM #employee AS A
JOIN #employee AS DUP
ON DUP.id <> A.id
AND DUP.SURNAME = a.SURNAME
WHERE A.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920'
UNION
SELECT [A_ID] = A.ID, [DUP_ID] = DUP.ID
FROM #employee AS A
JOIN #employee AS DUP
ON DUP.id <> A.id
AND a.FORENAME = DUP.FORENAME
WHERE A.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920'
UNION
SELECT [A_ID] = A.ID, [DUP_ID] = DUP.ID
FROM #employee AS A
JOIN #employee AS DUP
ON DUP.id <> A.id
AND a.DOB = DUP.DOB
WHERE A.empregistereddate between '2015-01-01 00:33:00.920' and '2015-05-01 18:33:00.920'
) AS X
JOIN #employee AS A
ON A.id = X.A_ID
JOIN #employee AS DUP
ON DUP.id = X.DUP_ID

On the tests I have run, on your sample data but with appropriate indexes, your code runs better than mine ... but on 10M rows it might be different.

I would be interested in the Logical Reads / Scans - e.g. from this:

SET STATISTICS IO ON

... your actual query here ...

SET STATISTICS IO OFF
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2015-04-22 : 11:39:53
Yeah, I added 8,000 rows to the #employee (old with empregistereddate before the test period) and my query's logical reads was the same, whereas your query's jumped massively ...
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2015-04-22 : 14:44:59
[code]-- SwePeso
SELECT a.ID,
dup.ID AS DUPid,
a.EmpID,
DUP.EmpID AS DUPempid,
a.Surname,
dup.Surname AS DUPsurname,
a.Forename,
dup.Forename AS DUPforename,
a.DOB,
dup.DOB AS DUPdob,
a.EmpRegisteredDate,
dup.EmpRegisteredDate AS DUPempregistereddate,
a.CreateDate,
dup.CreateDate AS DUPCreatedate
FROM #Employee AS a
INNER JOIN #Employee AS dup ON dup.ID > a.ID
WHERE a.EmpRegisteredDate BETWEEN '20150101 00:33:00.920' AND '20150501 18:33:00.920'
AND (a.Surname = dup.Surname OR a.Forename = dup.Forename OR a.DOB = dup.DOB);[/code]Make sure there is an clustered index on ID column.


Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

shaggy
Posting Yak Master

248 Posts

Posted - 2015-04-23 : 01:08:25
Sorry for delay in reply.

gbritton -Yes the Surname,forename,DOB,ID all are indexed seperately.

Kristen - Yeh the OR clause is the culprit,I have tried your query and it got executed in 2 mins, where the original query will run for half a day.

SwePeso - There is no clustered index on this table and your query is crossing the threshold duration.



Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2015-04-23 : 03:00:01
quote:
Originally posted by shaggy

There is no clustered index on this table


You ought to have a clustered index on every table (it helps with some housekeeping tasks). Doesn't have to be the Primary Key, although that is usually the best candidate e.g. for tables involved in JOINs
Go to Top of Page

shaggy
Posting Yak Master

248 Posts

Posted - 2015-04-23 : 09:37:08
Thanks Kristen,

I understand that but now we cannot able to change it overnight but thanks for pointing it.
Go to Top of Page
   

- Advertisement -