Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Removing Duplicates (Not the Usual)

Author  Topic 

paul.rowling
Yak Posting Veteran

81 Posts

Posted - 2003-08-06 : 10:25:43
Hi,

I use some software to identify duplicate records based on inital and surname for people who live at the same address.
There is a field in the database called dup_code which stores the identifying value e.g if two records were identified as duplicates they would both be given a dup code of 1. The next preceeding duplicate records would then be given a dup_code of two and so on.
The problem is this: for example the software identifies S Andrews, Sarah Andrews, Sharon Andrews as duplicates because they all have the same initial. What I want to be able to do in this scenario is say that none of them are duplicates as we cannot determine whether S Andrews is the same as Sarah or Sharon. On the other hand if the following scenario arose where S Andrews and Sarah Andrews were marked as duplicates you could assume that these are duplicates and can be left as so. In the first scenario I just want to update the dup_code to 0 as 0 means that the records are not duplicates.
The software automatically detects that Sharon Andrews and Sharon Andrews are duplicates and so there is nothing more I need to do with scenario's such as this.

Can anyone shed any light on this?

Cheers

Paul

X002548
Not Just a Number

15586 Posts

Posted - 2003-08-06 : 12:02:21
Anyway, Paul...You would need to more clearly define a dup...

How many positions of the first name for example (Sue, Susie, Susan, dups?)

Alos what else Birthdate?

I imagine 1st init, last name, birth date, state code..

somethings we've used in the past...by all means not an absoulte

Just the ability to get close...used a total positive match to exclude excats..then get fuzzy...

Use a join of 2 reult sets to get a single row and insert into a work table with an IDENTITY Columns


MOO



Brett

8-)

SELECT POST=NewId()
Go to Top of Page

paul.rowling
Yak Posting Veteran

81 Posts

Posted - 2003-08-06 : 17:33:16
Hi Brett,

Thanks for posting, I think I may have confused things with my previous description. The software that I use actually identifies the duplicates for me based on Initial, Surname, Property and Postcode so that I don't have to worry about doing that. The way that it identifies duplicate records is by entering a value in a dup_code field. The problem is that it's not that clever when you have for example the following records all at the same address:

S Andrews
Sarah Andrews
Sharon Andrews

The software will mark all these records as duplicates because the first letter of the "Forename" field is "S". The software identifies the records as duplicates in the following way:

Forename Surname Dup_Code

S Andrews 1
Sarah Andrews 1
Sharon Andrews 1
Paul Rowling 2
P Rowling 2
Ozzie Osborne 0

In this example any record that does not have a duplicate record is always given a dup_code of 0. All other records are given a unique dup_code which will match with an associated duplicate record.
Because there is no way to identify if S Andrews is a duplicate with Sarah or Sharon, I want to update their dup_code to 0, but leave all others intact.
Does this make sense now........my head hurts!!!

You may ask why I want to do this? Well, the DB that I have stores customer info in a customer table and transactions (such as what the customer bought, when they bought it and how much) in a transactions table. Therefore I want to be able to identify duplicate customer records, transfer the transactions of the duplicate person I want to delete to the person I want to keep and then delete duplicate person. In the instance of the "Andrews" example above, I don't know which person to transfer the transactions to and so because I cannot do this accurately I will just take a hit and leave them as they are.

Any info on this would be greatly appreciated.

Thanks

Paul
Go to Top of Page

vganesh76
Yak Posting Veteran

64 Posts

Posted - 2003-08-07 : 01:28:00
Hi Paul, I have solved ur problem in the following query. Hope it works for u.

Declare @t table (forename varchar(10),surname varchar(20),dupcode int)

insert into @t SELECT 'S', 'Andrews',null
insert into @t SELECT 'Sarah', 'Andrews', null
insert into @t SELECT 'Sharon', 'Andrews', null
insert into @t SELECT 'Paul', 'Rowling', null
insert into @t SELECT 'P', 'Rowling', null
insert into @t SELECT 'Ozzie', 'Osborne', null
insert into @t SELECT 'Ozzie1', 'Osborne', null
Declare @intctr int
Declare @vchname varchar(20)
set @intctr = 0
set @vchname =''
SELECT forename,surname
FROM @t a

UPDATE @t set @intctr = dupcode =@intctr + case when @vchname <> surname then 1 else 0 end,
@vchname = surname
FROM @t a
WHERE (SELECT count(*) FROM @t b WHERE a.surname=b.surname) > 1

UPDATE @t set dupcode = 0
FROM @t a
WHERE (SELECT count(*) FROM @t b WHERE a.surname=b.surname) = 1

SELECT * FROM @t


V.Ganesh
Net AssetManagement.Com
vganesh76@rediffmail.com

Enjoy working
Go to Top of Page

paul.rowling
Yak Posting Veteran

81 Posts

Posted - 2003-08-07 : 04:13:58
Hi Vganesh76,


"insert into @t SELECT 'S', 'Andrews',null
insert into @t SELECT 'Sarah', 'Andrews', null
insert into @t SELECT 'Sharon', 'Andrews', null
insert into @t SELECT 'Paul', 'Rowling', null
insert into @t SELECT 'P', 'Rowling', null
insert into @t SELECT 'Ozzie', 'Osborne', null
insert into @t SELECT 'Ozzie1', 'Osborne', null"

The data that I posted is just an example to help explain my problem. I will be running this on a table with thousands of records, therefore I won't know what records are in there to do the insert statements. Hope this makes sense?

Cheers


Paul
Go to Top of Page

vganesh76
Yak Posting Veteran

64 Posts

Posted - 2003-08-07 : 05:05:47
Hi Paul,
The code would work fine just by replacing the table variable (@t )with the actual table . Hope I dont find any problems with the query.

Enjoy working
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-08-07 : 05:17:39
How about the case:

S Andrews 1
Sarah Andrews 1
Sharon Andrews 1
Sharon Andrews 1

Do you want four 0 instead of 1?

- Vit
Go to Top of Page

Amethystium
Aged Yak Warrior

701 Posts

Posted - 2003-08-07 : 05:41:16
quote:
Originally posted by X002548
What the hell is that all about? (Crespo?)



Assumptions... assumptions.

Don't do it Brett.





----------------
Shadow to Light
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-08-07 : 06:13:43
quote:

Originally posted by Amethystium
Assumptions... assumptions.

Don't do it Brett.

What the hell is that all about? (Brett?)

- Vit
Go to Top of Page

Amethystium
Aged Yak Warrior

701 Posts

Posted - 2003-08-07 : 06:35:12
Vit,

I spotted this question on the forum yesterday and the first answer was quite rude. If you look at Brett's original reply you will find the posters response.

I was actually tempted to report it then I remembered something from the past and the sheer irony of it made me abandon my plan. When I came in this morning I found that the insulting post was taken down but somehow Brett managed to sneak in his own reply before it was taken down (obviously).
I kind of regret my reply to Brett but I just feel really strongly about people making assumptions. Seriously, 90% of mankind’s problems are based on false assumptions.

Anyway...


----------------
Shadow to Light
Go to Top of Page

paul.rowling
Yak Posting Veteran

81 Posts

Posted - 2003-08-07 : 06:46:09
Hi, In response to Stoad, the two Sharon Andrews should be left as 1 because they are duplicates with each other, the other two records would then be updated to 0.
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-08-07 : 07:36:05
update myTable t set dup_code=0
where
1=(select count(*) from myTable tt where
tt.col1=t.col1 and tt.col2=t.col2 and .... and tt.colN=t.colN)

replace 'myTable' with your table name and [col...] with names of
each column of your table. Hope there are not too many of them in there.

- Vit
Go to Top of Page

paul.rowling
Yak Posting Veteran

81 Posts

Posted - 2003-08-07 : 08:35:12
Hi Stoad, thanks for your reply. Sorry to be a pain, but I don't understand this line:

1=(select count(*) from myTable tt where
tt.col1=t.col1 and tt.col2=t.col2 and .... and tt.colN=t.colN)


Can you please explain?

Cheers

Paul
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-08-07 : 08:48:37
This line seek for number of records in the table
with the same values in corresponding columns as they
are in the 'current' record from outer 'update' instruction.
If it is found only one such record then it is obviously 'unique'
one and its dup_code value should be assign to 0. I think so...

- Vit
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2003-08-07 : 09:17:45
Finr...Fine...Fine...

GONE!

However, if it walks like a duck annd squawks like a duck and swims like a duck..then I must Assume it's a YAK...

Most of the worlds...oh forget it....

[EDIT]
...and it wasn't an assumption...it was an inquiry
[/EDIT]





Brett

8-)

SELECT POST=NewId()
Go to Top of Page

Amethystium
Aged Yak Warrior

701 Posts

Posted - 2003-08-07 : 09:24:18
Brett mate,

It's cool.

Let's just stick to SQL Server and music!


----------------
Shadow to Light
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2003-08-07 : 09:30:47
Yes, better left to the corral...



Brett

8-)

SELECT POST=NewId()
Go to Top of Page
   

- Advertisement -