SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Code for Levenstein algorithm in SQL?
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

mr2turbo18
Starting Member

USA
10 Posts

Posted - 05/26/2006 :  08:00:28  Show Profile  Reply with Quote
Hello,

I'm trying to do a string comparison by name and want to use the Levenstein algorithm for my comparison( http://en.wikipedia.org/wiki/Levenstein_distance ). I've searched the internet and had no luck.

Does anyone know where I can obtain the code for this in SQL?

Thanks!
Richard

-Richard the Newb

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 05/26/2006 :  08:28:30  Show Profile  Reply with Quote
Levenshtein Edit Distance Algorithm
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540

Better Phonetic Matching Algorithm than Soundex
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574

Lools like its spelled "Levenshtein distance". A Google search for that seems to return a lot.

CODO ERGO SUM

Edited by - Michael Valentine Jones on 05/26/2006 08:31:19
Go to Top of Page

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 05/26/2006 :  08:37:50  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
We should warn you that by nature of that algorithm a few comparisons is fine, but if you use it on a large set of data... prepare to wait

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 05/26/2006 :  08:47:51  Show Profile  Reply with Quote
There seem to be a lot of others around:
http://www.google.com/search?num=100&hl=en&lr=&c2coff=1&as_qdr=all&q=+%22Levenshtein+distance%22+sql&btnG=Search

Levenshtein Distance Algorithm: TSQL Implementation
by Joseph Gama
http://www.merriampark.com/ldtsql.htm

This may be a good use for a SQL 2005 CLR procedure.



CODO ERGO SUM

Edited by - Michael Valentine Jones on 05/26/2006 08:52:50
Go to Top of Page

mr2turbo18
Starting Member

USA
10 Posts

Posted - 05/26/2006 :  08:57:03  Show Profile  Reply with Quote
thanks for the links! Kind of funny, wish I had used a string comparison algorithm on my searches(Levenstein vs. Levenshtein)

-Richard the Newb
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 05/27/2006 :  09:53:00  Show Profile  Reply with Quote
quote:
Originally posted by Seventhnight

We should warn you that by nature of that algorithm a few comparisons is fine, but if you use it on a large set of data... prepare to wait



I've often found that I'm only interested pairs of strings that differ by a small amount. For these cases, I've adapted the function so it gives up if it reaches a threshold -- since the distance values in the matrix increase monotonically rightwards and downwards, this is pretty simple to achieve.

Sorry about the lack of documentation!

SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO

CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
  WHILE @j <= @tl
    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
  WHILE @i <= @sl
  BEGIN
    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
    WHILE @j <= @tl
    BEGIN
      SET @c = @c + 1
      SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
      IF @c > @c1 SET @c = @c1
      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
      IF @c > @c1 SET @c = @c1
      IF @c < @cmin SET @cmin = @c
      SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
    END
    IF @cmin > @d BREAK
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO

Go to Top of Page

cmdr_skywalker
Posting Yak Master

159 Posts

Posted - 05/27/2006 :  16:36:49  Show Profile  Visit cmdr_skywalker's Homepage  Reply with Quote
There are two glaring weakness of soundex: unable to support different first letter words (kristine, christine) and international names. However the first can be remedied by replacing the first character of the first word with the first character of the second word (i.e. cristine, christine) or vice versa (kristine, khristine) will produce the same soundex value or consider only the numeric digit in comparing values (disregard the first letter). This can be done using a customized soundex implementation.

Either way, if accuracy is the issue, use the algorithms above (it may take sometime) but if you want just a quick comparison, use the soundex (or modified soundex).

May the Almighty God bless us all!
Go to Top of Page

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 05/30/2006 :  08:49:01  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
quote:
Originally posted by Arnold Fribble

quote:
Originally posted by Seventhnight

We should warn you that by nature of that algorithm a few comparisons is fine, but if you use it on a large set of data... prepare to wait



I've often found that I'm only interested pairs of strings that differ by a small amount. For these cases, I've adapted the function so it gives up if it reaches a threshold -- since the distance values in the matrix increase monotonically rightwards and downwards, this is pretty simple to achieve.




Good Idea... !!
[added to library]

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.11 seconds. Powered By: Snitz Forums 2000