deduping 20 files in certain order

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

deduping 20 files in certain order

Author

Topic

tryharder
Starting Member

17 Posts

Posted - 2013-09-30 : 14:29:09

Hi,

I have 20 files to dedupe in a certain order, I was going to build a match key based on surname and 6 characters of address line 1but does anyone have a script to run the dedupe based on the match key and by a unique keycode for each file?

Thanks in advance
Sean

SP

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-09-30 : 14:40:42

Like shown below. Didn't get what you meant by "unique keycode". Is that another column in the table that you want to order by to determine which rows to delete?

;WITH cte AS
(
	SELECT ROW_NUMBER() OVER
	(
		PARTITION BY surname + LEFT(addressline1,6)
		ORDER BY (SELECT NULL) -- or some other condition
	) AS RN
	FROM YourTable
)
DELETE FROM cte WHERE RN > 1;

tryharder
Starting Member

17 Posts

Posted - 2013-09-30 : 15:14:53

The keycode is just a way to identify which file that record came from. I'm not the best at explaining so bear with me pls

I have 20 files which are standardised and given a keycodebto identify which file the record comes from, the files are then merged and dedupe is carried starting at file 20 which is deduped against the 19 other files then file 19 is deduped against the other 18 files ect until you end up with a file full of unique records but I also need to know how many records where lost from each of the sub file ( based in keycode) and to which other file so i effectively get a matrix which shows the losses against each file.
Hope this makes sense, if not bear with me!! :-)

SP

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-09-30 : 15:52:13

Here is a way to do what you described - it is sort of a bruteforce approch, but the logic should be easy to follow. And, by the way, your explanation makes perfect sense :) Instead of the delete in each query, uncomment the select line and run that to see the count of duplicates.

-- Starting with file #20;
;WITH cte AS
(
	surname + LEFT(addressline1,6) AS DuplicateCriterion,
	SELECT ROW_NUMBER() OVER
	(
		PARTITION BY surname + LEFT(addressline1,6)
		ORDER BY (SELECT NULL) -- or some other condition
	) AS RN
	FROM YourTable
	WHERE keycode = 20;
)
--SELECT DuplicateCriterion,COUNT(*) AS Records FROM cte
--HAVING COUNT(*) > 1
DELETE FROM cte WHERE RN > 1;

-- For File 20 and File 19;
;WITH cte AS
(
	surname + LEFT(addressline1,6) AS DuplicateCriterion,
	SELECT ROW_NUMBER() OVER
	(
		PARTITION BY surname + LEFT(addressline1,6)
		ORDER BY (SELECT NULL) -- or some other condition
	) AS RN
	FROM YourTable
	WHERE keycode IN (19,20)
)
--SELECT DuplicateCriterion,COUNT(*) AS Records FROM cte
--HAVING COUNT(*) > 1
DELETE FROM cte WHERE RN > 1;

-- repeat for the remaining by changing the where clause to
-- WHERE keycode IN (18,19,20)
-- WHERE keycode IN (17,18,19,20)
-- etc.

tryharder
Starting Member

17 Posts

Posted - 2013-09-30 : 16:06:38

Many thanks for your help will give it a go tomorrow.

SP

tryharder
Starting Member

17 Posts

Posted - 2013-10-01 : 16:30:55

Many thanks for your help will give it a go tomorrow.

SP

waterduck
Aged Yak Warrior

982 Posts

Posted - 2013-10-01 : 21:27:50

shouldn't this enuf?


;WITH cte AS
(
	surname + LEFT(addressline1,6) AS DuplicateCriterion,
	SELECT ROW_NUMBER() OVER
	(
		PARTITION BY surname + LEFT(addressline1,6)
		ORDER BY keycode DESC-- or some other condition
	) AS RN
	FROM YourTable
)
DELETE FROM cte WHERE RN > 1;

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources