Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Code for Levenstein algorithm in SQL?

Author  Topic 

mr2turbo18
Starting Member

10 Posts

Posted - 2006-05-26 : 08:00:28
Hello,

I'm trying to do a string comparison by name and want to use the Levenstein algorithm for my comparison( http://en.wikipedia.org/wiki/Levenstein_distance ). I've searched the internet and had no luck.

Does anyone know where I can obtain the code for this in SQL?

Thanks!
Richard

-Richard the Newb

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2006-05-26 : 08:28:30
Levenshtein Edit Distance Algorithm
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540

Better Phonetic Matching Algorithm than Soundex
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574

Lools like its spelled "Levenshtein distance". A Google search for that seems to return a lot.

CODO ERGO SUM
Go to Top of Page

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2006-05-26 : 08:37:50
We should warn you that by nature of that algorithm a few comparisons is fine, but if you use it on a large set of data... prepare to wait

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2006-05-26 : 08:47:51
There seem to be a lot of others around:
http://www.google.com/search?num=100&hl=en&lr=&c2coff=1&as_qdr=all&q=+%22Levenshtein+distance%22+sql&btnG=Search

Levenshtein Distance Algorithm: TSQL Implementation
by Joseph Gama
http://www.merriampark.com/ldtsql.htm

This may be a good use for a SQL 2005 CLR procedure.



CODO ERGO SUM
Go to Top of Page

mr2turbo18
Starting Member

10 Posts

Posted - 2006-05-26 : 08:57:03
thanks for the links! Kind of funny, wish I had used a string comparison algorithm on my searches(Levenstein vs. Levenshtein)

-Richard the Newb
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-05-27 : 09:53:00
quote:
Originally posted by Seventhnight

We should warn you that by nature of that algorithm a few comparisons is fine, but if you use it on a large set of data... prepare to wait



I've often found that I'm only interested pairs of strings that differ by a small amount. For these cases, I've adapted the function so it gives up if it reaches a threshold -- since the distance values in the matrix increase monotonically rightwards and downwards, this is pretty simple to achieve.

Sorry about the lack of documentation!

SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO

CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
@cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
WHILE @j <= @tl
SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
WHILE @i <= @sl
BEGIN
SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
WHILE @j <= @tl
BEGIN
SET @c = @c + 1
SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
IF @c > @c1 SET @c = @c1
SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
IF @c > @c1 SET @c = @c1
IF @c < @cmin SET @cmin = @c
SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
END
IF @cmin > @d BREAK
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO

Go to Top of Page

cmdr_skywalker
Posting Yak Master

159 Posts

Posted - 2006-05-27 : 16:36:49
There are two glaring weakness of soundex: unable to support different first letter words (kristine, christine) and international names. However the first can be remedied by replacing the first character of the first word with the first character of the second word (i.e. cristine, christine) or vice versa (kristine, khristine) will produce the same soundex value or consider only the numeric digit in comparing values (disregard the first letter). This can be done using a customized soundex implementation.

Either way, if accuracy is the issue, use the algorithms above (it may take sometime) but if you want just a quick comparison, use the soundex (or modified soundex).

May the Almighty God bless us all!
Go to Top of Page

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2006-05-30 : 08:49:01
quote:
Originally posted by Arnold Fribble

quote:
Originally posted by Seventhnight

We should warn you that by nature of that algorithm a few comparisons is fine, but if you use it on a large set of data... prepare to wait



I've often found that I'm only interested pairs of strings that differ by a small amount. For these cases, I've adapted the function so it gives up if it reaches a threshold -- since the distance values in the matrix increase monotonically rightwards and downwards, this is pretty simple to achieve.




Good Idea... !!
[added to library]

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page
   

- Advertisement -