Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Eliminating Duplicate Records

Author  Topic 

mike.bull
Starting Member

9 Posts

Posted - 2008-05-13 : 09:42:37
Hey There.

I'm in the process of doing a major data clean up and I'm just wondering how I would go about eliminating some redundant data.

The Table Layout

Contracts

CNTRID CONTRACTNUM STARTDATE CUSTOMNUM
=======================================================
0 1234567 091885 A
1 1234567 091885 A
2 1111111 111111 B
3 1234567 081205 A


Equipment

EQUIPID DEVICENAME CNTRID CUSTOMNUM
=======================================================
0 DEVICE1 0 A
1 DEVICE2 2 B
2 DEVICE3 1 A
3 DEVICE4 3 A


You will notice that each customer may have multiple devices. Each device may be tied to a contract, and each contract may have one or more devices tied to it.

In the example above, you will notice in the contracts table the contracts with the IDs 0 and 1.

Fig 1.

CNTRID CONTRACTNUM STARTDATE CUSTOMNUM
=======================================================
0 1234567 091885 A
1 1234567 091885 A


These contracts have the exact same information.

Furthermore, if you look down the table you will notice the contract with the ID 3.

Fig 2.

CNTRID CONTRACTNUM STARTDATE CUSTOMNUM
=======================================================
3 1234567 081205 A

This contract shares the same contract and customer number, but has a different start date.


Now lets take a look devices in the equipment table that refer to these records.

EQUIPID DEVICENAME CNTRID CUSTOMNUM
=======================================================
0 DEVICE1 0 A
2 DEVICE3 1 A
3 DEVICE4 3 A

You will notice that DEVICE1 and DEVICE 3 refer to the contract records that contain identical data. (As shown in 'Fig 1')

My question is as follows:

How do I eliminate the any duplicate records from the contracts table, and update the records in the equipment table with id of the left over contract.

Results Should be as follows:

Contracts

CNTRID CONTRACTNUM STARTDATE CUSTOMNUM
=======================================================
0 1234567 091885 A
2 1111111 111111 B
3 1234567 081205 A


Equipment

EQUIPID DEVICENAME CNTRID CUSTOMNUM
=======================================================
0 DEVICE1 0 A
1 DEVICE2 2 B
2 DEVICE3 0 A
3 DEVICE4 3 A


Any help you may provide would be greatly appreciated!

Thanks
--mike

cat_jesus
Aged Yak Warrior

547 Posts

Posted - 2008-05-13 : 09:51:28
This topic has been covered many times before. You can do a search on it, but I'll give you a quick answer. Do a distinct insert into a temp table of all your records, truncate your source table and then insert back to the source from the temp.

Be sure to test it first.



An infinite universe is the ultimate cartesian product.
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-05-13 : 10:49:58
This will do update:-
;
With CTE(RowNo,CNTRID,CONTRACTNUM,STARTDATE,CUSTOMNUM) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY CONTRACTNUM,STARTDATE,CUSTOMNUM ORDER BY CNTRID) AS RowNo,
*
FROM Contracts
)


UPDATE e
SET e.CNTRID=c2.CNTRID
FROM Equipment e
INNER JOIN CTE c1
ON c.CNTRID =e.CNTRID
AND c.CUSTOMNUM=e.CUSTOMNUM
INNER JOIN CTE c2
ON c2.CONTRACTNUM=c1.CONTRACTNUM
AND c2.STARTDATE=c1.STARTDATE
AND c2.CUSTOMNUM=c1.CUSTOMNUM
AND c2.RowNo=1
AND c1.RowNo>1



and this will delete the duplicates:-
DELETE t
FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY CONTRACTNUM,STARTDATE,CUSTOMNUM ORDER BY CNTRID) AS RowNo,
*
FROM Contracts)t
WHERE t.RowNo >1
Go to Top of Page

mike.bull
Starting Member

9 Posts

Posted - 2008-05-13 : 12:24:22
Thanks a lot. Seem like it will work great.
Go to Top of Page

mike.bull
Starting Member

9 Posts

Posted - 2008-05-13 : 14:42:34
Hey visakh16,

I implemented this query (filling in the fields that i left out of the example). Before i ran the query I did a select distinct of all the fields except the identity field. The result was 300+ records. However, after i ran the query above, I'm left with less then 200 records. Should I not have the same number of rows as the select distinct?


mike
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-05-13 : 14:47:10
quote:
Originally posted by mike.bull

Hey visakh16,

I implemented this query (filling in the fields that i left out of the example). Before i ran the query I did a select distinct of all the fields except the identity field. The result was 300+ records. However, after i ran the query above, I'm left with less then 200 records. Should I not have the same number of rows as the select distinct?


mike


Cant say they will be same. Depends on what the values of filled in fields were. If some of those fields had distinct values the count can be different.Are you sure you wont get duplicate values for other columns for unique CONTRACTNUM,STARTDATE,CUSTOMNUM combination?
Go to Top of Page

mike.bull
Starting Member

9 Posts

Posted - 2008-05-13 : 14:59:47
Hey,
I outputted the results before and after to a CSV.

Before:

A0006,2419653,,20040425,,NO,NO,,,NULL,NULL
A0006,C00100,,20031001,,NO,NO,,,NULL,NULL
A0203,3487684,NULL,,,8x5x4,,,,,
A0203,3487684,,20070420,,,,,,NULL,NULL
B0002,2002184,NULL,,,YES,,,,,

After:
A0006,2419653,,20040425,,NO,NO,,,NULL,NULL
A0006,C00100,,20031001,,NO,NO,,,NULL,NULL
A0203,3487684,,20070420,,,,,,NULL,NULL
B0002,2002184,NULL,,,YES,,,,,


The first 5 results show a missing record for A0203.

Do you think null values messing up the inner join?

mike
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-05-13 : 15:01:55
quote:
Originally posted by mike.bull

Hey,
I outputted the results before and after to a CSV.

Before:

A0006,2419653,,20040425,,NO,NO,,,NULL,NULL
A0006,C00100,,20031001,,NO,NO,,,NULL,NULL
A0203,3487684,NULL,,,8x5x4,,,,,
A0203,3487684,,20070420,,,,,,NULL,NULL
B0002,2002184,NULL,,,YES,,,,,

After:
A0006,2419653,,20040425,,NO,NO,,,NULL,NULL
A0006,C00100,,20031001,,NO,NO,,,NULL,NULL
A0203,3487684,,20070420,,,,,,NULL,NULL
B0002,2002184,NULL,,,YES,,,,,


The first 5 results show a missing record for A0203.

Do you think null values messing up the inner join?

mike



With the limited info you provuide, i'm unable to tell. Can you just post the values with table columns names of these 5 records before and after?
Go to Top of Page

mike.bull
Starting Member

9 Posts

Posted - 2008-05-13 : 15:42:19
[code]
BEFORE
|CUSTOMNUM | CONTRACT | SMARTNET | SMARTNET_STOP_DATE | SMARTNET_START_DATE | ONSITE | PREMIUM_ONSITE | ONSITE_START_DATE | ONSITE_STOP_DATE | ONSITE_RENEW_DATE | NOTE

A0006|2419653| |20040425| |NO |NO| | |NULL|NULL
A0006|C00100 | |20031001| |NO |NO| | |NULL|NULL
A0203|3487684|NULL| | |8x5x4| | | | |
A0203|3487684| |20070420| | | | | |NULL|NULL
B0002|2002184|NULL| | |YES | | | | |

AFTER
|CUSTOMNUM | CONTRACT | SMARTNET | SMARTNET_STOP_DATE | SMARTNET_START_DATE | ONSITE | PREMIUM_ONSITE | ONSITE_START_DATE | ONSITE_STOP_DATE | ONSITE_RENEW_DATE | NOTE

A0006|2419653| |20040425| |NO |NO| | |NULL|NULL
A0006|C00100 | |20031001| |NO |NO| | |NULL|NULL
A0203|3487684| |20070420| | | | | |NULL|NULL
B0002|2002184|NULL | | |YES| | | | |

This is the query I ended up with


With CTE(RowNo, CUSTCNTR_ID, CUSTOMNUM, CONTRACT, SMARTNET, SMARTNET_START_DATE,
SMARTNET_STOP_DATE, ONSITE, PREMIUM_ONSITE,
ONSITE_START_DATE, ONSITE_STOP_DATE, ONSITE_RENEW_DATE, NOTE) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY CONTRACT, CUSTOMNUM, SMARTNET, SMARTNET_START_DATE,
SMARTNET_STOP_DATE, ONSITE, PREMIUM_ONSITE, ONSITE_START_DATE,
ONSITE_STOP_DATE, ONSITE_RENEW_DATE, NOTE ORDER BY CUSTCNTR_ID) AS RowNo, *
FROM ooss.ontrack_contracts
)

UPDATE e
SET e.CUSTCNTR_ID=c2.CUSTCNTR_ID
FROM ooss.ontrack_customerequiptmp e
INNER JOIN CTE c1
ON c1.CUSTCNTR_ID = e.CUSTCNTR_ID
AND c1.CUSTOMNUM = e.CUSTOMNUM
INNER JOIN CTE c2
ON c2.CONTRACT=c1.CONTRACT
AND c2.SMARTNET_START_DATE=c1.SMARTNET_START_DATE
AND c2.CUSTOMNUM=c1.CUSTOMNUM
AND c2.SMARTNET_STOP_DATE = c1.SMARTNET_STOP_DATE
AND c2.SMARTNET = c1.SMARTNET
AND c2.ONSITE = c1.ONSITE
AND c2.PREMIUM_ONSITE = c1.PREMIUM_ONSITE
AND c2.ONSITE_START_DATE = c1.ONSITE_START_DATE
AND ISNULL(c2.ONSITE_RENEW_DATE, '') = ISNULL(c1.ONSITE_RENEW_DATE, '')
AND ISNULL(c2.NOTE, '') = ISNULL(c1.NOTE, '')
AND c2.RowNo=1
AND c1.RowNo>1

DELETE t
FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY SMARTNET, SMARTNET_START_DATE,
SMARTNET_STOP_DATE, ONSITE, PREMIUM_ONSITE,
ONSITE_START_DATE, ONSITE_STOP_DATE, ONSITE_RENEW_DATE, NOTE ORDER BY CUSTCNTR_ID) AS RowNo, *
FROM ooss.ontrack_contracts) t
WHERE t.RowNo >1
[/code]

Go to Top of Page

mike.bull
Starting Member

9 Posts

Posted - 2008-05-14 : 08:54:21
Well its fixed. Thanks for your help.

It turns out, that in the partition statement I had customnum and contract reversed. I fixed this and I have the appropriate results.

Thanks again for all your help!

Mike
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-05-14 : 09:36:43
Ok?
The order of tables in PARTITION BY does not matter.



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

mike.bull
Starting Member

9 Posts

Posted - 2008-05-14 : 10:21:19
My apologizes,

I just compared the new and the old queries once again, Peso is right the order of the columns in the Partition statement have no effect on the results.


The issue lied in the delete statement at the bottom.


DELETE t
FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY SMARTNET, SMARTNET_START_DATE,
SMARTNET_STOP_DATE, ONSITE, PREMIUM_ONSITE,
ONSITE_START_DATE, ONSITE_STOP_DATE, ONSITE_RENEW_DATE, NOTE ORDER BY CUSTCNTR_ID) AS RowNo, *
FROM ooss.ontrack_contracts) t
WHERE t.RowNo >1



You will notice that I have not included the CUSTOMNUM and the CONTRACT fields in the partition statement here. Whoops.

Anyway, sorry for misleading.

Mike
Go to Top of Page
   

- Advertisement -