Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Complicated Deduping

Author  Topic 

mattrigbye
Starting Member

2 Posts

Posted - 2007-08-02 : 05:28:40
Hi all - I have a tricky problem that I'm trying to solve, and not really getting anywhere. I'm hoping someone might have some ideas.

Basically we receive data from a call centre that is quite messy and completely unnormalised (as I'm sure is usual) and I'm trying to make some use out of it. Here is a simplified form of the data:

ID CustID PhoneID
1    1        10
2    2        10
3    3        11
4    1        12
5    2        12
6    4        13
7    5        11
8    5        15
9    5        18

Some Customers are given multiple CustIDs, and multiple PhoneIDs, so I'm trying to collapse the data on both these fields to identify individual customers.

In the above example, row IDs (1, 2, 4, 5), (3, 7, 8, 9) and (6) would constitute 3 individual customers after grouping by both CustID and PhoneID

Now my question is does anyone know a way to do this with a sql script? I've tried doing an inner join on CustID OR PhoneID, but that obviously doesn't work. I can't think of any other apporaches other than going through it line by line, which I'd rather avoid given that the table has nearly a million records.

Any help or pointers or even just thoughts would be much appreciated, thanks.

Cheers,
Matt

jimf
Master Smack Fu Yak Hacker

2875 Posts

Posted - 2007-08-02 : 06:47:14
If all you want are individual customers, just do SELECT DISTINCT custID from yourTable. If that isn't what you need, post the results you are expecting so we can help you further.

Jim
Go to Top of Page

Ifor
Aged Yak Warrior

700 Posts

Posted - 2007-08-02 : 07:10:49
You could try creating a table to map New CustIDs to Old CustIDs and then use that to get further information.
Something like:

-- *** Test Data ***
CREATE TABLE #temp
(
CustID int NOT NULL
,PhoneID int NOT NULL
)
INSERT INTO #temp
SELECT 1, 10 UNION ALL
SELECT 2, 10 UNION ALL
SELECT 3, 11 UNION ALL
SELECT 1, 12 UNION ALL
SELECT 2, 12 UNION ALL
SELECT 4, 13 UNION ALL
SELECT 5, 11 UNION ALL
SELECT 5, 15 UNION ALL
SELECT 5, 18
-- *** End Test Data ***

CREATE TABLE #New2Old
(
NewCustID int NOT NULL
,OldCustID int NOT NULL
,PRIMARY KEY (NewCustID, OldCustID)
)

;WITH PhoneDup (PhoneID, CustID)
AS
(
SELECT T.PhoneID, T.CustID
FROM #temp T
JOIN (
SELECT T1.PhoneID
FROM #temp T1
JOIN #temp T2
ON T1.PhoneID = T2.PhoneID
AND T1.CustID <> T2.CustID
GROUP BY T1.PhoneID
) D
ON T.PhoneID = D.PhoneID
)
INSERT INTO #New2Old
SELECT DISTINCT
ISNULL(V.CustID, T.CustID) AS NewCustID
,ISNULL(V.DupCustID, T.CustID) AS OldCustID
FROM #Temp T
LEFT JOIN (
SELECT D1.CustID, D2.CustID As DupCustID
FROM PhoneDup D1
JOIN PhoneDup D2
ON D1.PhoneID = D2.PhoneID
AND D1.CustID < D2.CustID
) V
ON T.CustID = V.DupCustID

Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-08-02 : 07:31:07
quote:
Originally posted by mattrigbye

ID CustID PhoneID
1    1        10
2    2        10
3    3        11
4    1        12
5    2        12
6    4        13
7    5        11
8    5        15
9    5        18
What is your expected output?



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-08-02 : 07:32:06
Also asked here
http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=338&messageid=387293



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

mattrigbye
Starting Member

2 Posts

Posted - 2007-08-03 : 01:15:34
Hey guys

Thanks for your input so far. Sorry i thought I made it clearer in my initial post, but I'll explain it a bit further.

Basically in the output I want to know which customers are actually the same people. Obviously every record with CustID 1 is the same person, but we're also counting everyone with the same PhoneID as the same person too. Thus in many cases, the same customer will have multiple CustIDs, and often multiple PhoneIDs also.

So in this case, there should be 3 unique customers as follows:

ID CustID PhoneID ActualCust
1    1        10        A
2    2        10        A
3    3        11        B
4    1        12        A
5    2        12        A
6    4        13        C
7    5        11        B
8    5        15        B
9    5        18        B

I'm not to fussed about the actual output (more IDs or whatever), I just need ot be able to uniquely identify which rows belong to the same Customer.

I hope things makes things clearer. Thanks again for your help on this. Please let me know if you have any further questions.

Cheers,
Matt
Go to Top of Page
   

- Advertisement -