Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Almost duplicate values - how to find them?

Author  Topic 

perels
Starting Member

23 Posts

Posted - 2007-01-13 : 19:25:29
Hi

My problem is that I have a list of 5000+ rows in a database, and there are rows that are almost duplicate values.

The layout is ProductID and ProductName. In the ProductName it could be something like

Fruity Loops
Fruiti Loops
Fruity Loops.
Fruity loop
fruity loop .


I have created something that will show exact duplicates, but this is not what I want:

SELECT A.ProductID,A.ProductName 
FROM Products A
JOIN (
SELECT ProductID, ProductName FROM Products GROUP BY ProductName,ProductID HAVING COUNT(*) > 1) B on A.ProductName = B.ProductName Order BY A.ProductName


I thought of using SOUNDEX, but I failed miserably trying to accomplish this...

Hope somebody can/will give it try :-)

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-01-14 : 04:23:16
Try the algorithm provided here.
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540

It is not fast but get's the job done. You can decide for yourself which accuracy you want.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

Alan Schofield
Starting Member

23 Posts

Posted - 2007-01-16 : 11:29:05
You can use this simple function I created to do the same thing. It simply strips all the characters out that are non alpha-numeric.
You can then use this in a join to test against.

=========================================================
CREATE FUNCTION fnStripToAlphaNumeric
(
@inputString nvarchar(4000)
)
RETURNS varchar(4000)
AS
BEGIN

DECLARE @Counter as int
DECLARE @strReturnVal varchar(4000)
DECLARE @Len as int
DECLARE @ASCII as int

SET @Counter=0
SET @Len=LEN(@inputString)

SET @strReturnVal = ''

WHILE @Counter<=@Len
BEGIN
SET @Counter = @Counter +1

SET @ascii= ASCII(SUBSTRING(@inputString,@counter,1))

IF(@ascii BETWEEN 65 AND 90) OR (@ascii BETWEEN 97 AND 122) OR (@ascii BETWEEN 48 AND 57)
BEGIN
SET @strReturnVal = @strReturnVal + (SUBSTRING(@inputString,@counter,1))
END
END

RETURN @strReturnVal

END

=========================================================




So in practice you can use the following

select a.* from Products a
join Products b
on dbo.fnStripToAlphaNumeric(a.ProductName) = dbo.fnStripToAlphaNumeric(b.ProductName)
where a.ProductID <> b.ProductID



Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-01-16 : 11:50:31
How would that make "Fruit Loop" and "Fruity Loops" the same?


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

Alan Schofield
Starting Member

23 Posts

Posted - 2007-01-16 : 12:00:40
quote:
Originally posted by Peso

How would that make "Fruit Loop" and "Fruity Loops" the same?


Peter Larsson
Helsingborg, Sweden



Well spotted ! :))

I guess I shold have read this more closely... :(

BTW: Larsson from Helsingborg ??? You've had all the Man United jokes by now then? ;-)
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-01-16 : 12:12:46
Like Larsson was dissed and the United fans screamed "He's too old!"?
I went to Old Trafford to see Beckham just before his move to Spain and the english soccer fans are insane!


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

perels
Starting Member

23 Posts

Posted - 2007-01-16 : 14:43:16
Thanks to you Peso - I haven't heard of Levenshtein before, but I will look into his algorithm :-)
Regarding to your message Alan Schofield, thanks for the effort, but I need something that will look on all characters and match them :-)
Go to Top of Page

perels
Starting Member

23 Posts

Posted - 2007-01-16 : 16:45:58
I am kind of stuck - I have grasped the levenshtein algo and has also read about Double Metaphone, Hammer distance and so on... the problem is now to construct the SQL - could you help me with that?
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-01-16 : 17:27:04
Start with this. It will show you how alike all the combinations of the data are.
--prepare sample data
declare @t table (productname varchar(100))

insert @t
select 'Fruity Loops' union all
select 'Fruiti Loops' union all
select 'Fruity Loops.' union all
select 'Fruity loop' union all
select 'fruity loop'

-- show the variances
select t1.productname,
t2.productname,
dbo.edit_distance(t1.productname, t2.productname)
from @t as t1
cross join @t as t2
where t1.productname < t2.productname
order by t1.productname,
t2.productname
I cannot do more since I have no idea of your business rules how to deal with this information.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page
   

- Advertisement -