SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 General SQL Server Forums
 Script Library
 Levenshtein Edit Distance Algorithm
 New Topic  Reply to Topic
 Printer Friendly
Previous Page
Author Previous Topic Topic Next Topic
Page: of 2

SamC
White Water Yakist

USA
3460 Posts

Posted - 06/28/2005 :  08:58:19  Show Profile  Reply with Quote
I have a performance improvement to the function which will speed results when there there is a perfect match.
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
    @cv0 varbinary(8000), @cv1 varbinary(8000)
  IF @s1 = @s2
    RETURN 0
  SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len
  BEGIN
    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN @c
END

Edited by - SamC on 06/28/2005 08:59:52
Go to Top of Page

elwoos
Flowing Fount of Yak Knowledge

United Kingdom
2050 Posts

Posted - 06/29/2005 :  03:23:51  Show Profile  Reply with Quote
quote:
quote:
--------------------------------------------------------------------------------
Anyway the GP's are the 'good' guys.
--------------------------------------------------------------------------------


Oh, sorry, it wasn't the GPs themselves I was complaining about,


Arnold, I knew you weren't having a go at them. I just wondered how you dealt with the consultants were the data is even worse, to my understanding it's only updated every couple of years at best and much of it is wrong.

I knew about GP's in different surgeries but I hadn't realised that a new code was created for each. As I'm about to try to tidy up the data on my system for these that is a really important snippet. Maybe I'll just wait for our Brave New World!!

Many thanks

steve

(maybe my sig should say NACS instead of Brain)

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.

Edited by - elwoos on 06/29/2005 03:25:47
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/29/2005 :  03:43:31  Show Profile  Reply with Quote
quote:
Arnold, I knew you weren't having a go at them. I just wondered how you dealt with the consultants were the data is even worse, to my understanding it's only updated every couple of years at best and much of it is wrong.


To be honest, national coding of consultants isn't something we've had to worry about yet. But I'm sure in this new age of NPfIT or Connecting for Health or whatever the hell it's called this week...

quote:
maybe my sig should say NACS instead of Brain

oh "Brain"! I've been reading that as "Brian"...
Go to Top of Page

jjworld
Starting Member

USA
1 Posts

Posted - 08/02/2006 :  16:12:56  Show Profile  Reply with Quote
Here's a version that I created to meet my specific needs:


CREATE FUNCTION [dbo].[zfsMin3] (@a INT, @b INT, @c INT )
RETURNS INT
AS

BEGIN
DECLARE @Retval INT
IF @a <= @b AND @a <= @c
SET @Retval = @a

IF @b < @a AND @b <= @c
SET @Retval = @b

IF @c < @a AND @c < @b
SET @Retval = @c

RETURN @Retval

END


CREATE FUNCTION [dbo].[zfsLevenshtein] (@Source VARCHAR(2000), @Target VARCHAR(2000) )
--Retranslated by Jerry Johnson, 8/2/2006
RETURNS INT
AS

BEGIN

DECLARE @Distance INT, @Cost INT
DECLARE @SourceLength INT, @TargetLength INT
DECLARE @SourceCounter INT, @TargetCounter INT
DECLARE @Matrix TABLE (X INT, Y INT, Value INT)
DECLARE @Above INT, @Left INT, @Diagonal INT

--Preprocess to remove any characters that already match
DECLARE @MinLength INTEGER
DECLARE @NewSource VARCHAR(50), @NewTarget VARCHAR(50)

SET @MinLength = LEN(@Source)
IF LEN(@Target) < @MinLength
SET @MinLength = LEN(@Target)

SET @SourceCounter = 1
SET @NewSource = ''
SET @NewTarget = ''

WHILE @SourceCounter <= @MinLength
BEGIN
IF SUBSTRING(@Source, @SourceCounter, 1) <> SUBSTRING(@Target, @SourceCounter, 1)
BEGIN
SET @NewSource = @NewSource + SUBSTRING(@Source, @SourceCounter, 1)
SET @NewTarget = @NewTarget + SUBSTRING(@Target, @SourceCounter, 1)
END
SET @SourceCounter = @SourceCounter + 1
END

SET @NewSource = @NewSource + SUBSTRING(@Source, @MinLength + 1, LEN(@Source))
SET @NewTarget = @NewTarget + SUBSTRING(@Target, @MinLength + 1, LEN(@Target))
SET @Source = @NewSource
SET @Target = @NewTarget

--Step 1
SET @SourceLength = LEN(@Source)
SET @TargetLength = LEN(@Target)

IF @SourceLength = 0
BEGIN
SET @Distance = @TargetLength
GOTO Done
END

IF @TargetLength = 0
BEGIN
SET @Distance = @SourceLength
GOTO Done
END

--Step 2
SET @SourceCounter = 1
SET @TargetCounter = 1

INSERT @Matrix VALUES(0, 0, 0)

WHILE @SourceCounter <= @SourceLength
BEGIN
INSERT @Matrix VALUES(@SourceCounter, 0, @SourceCounter)
SET @SourceCounter = @SourceCounter + 1
END

WHILE @TargetCounter <= @TargetLength
BEGIN
INSERT @Matrix VALUES(0, @TargetCounter, @TargetCounter)
SET @TargetCounter = @TargetCounter + 1
END

--Steps 3 and 4
SET @SourceCounter = 1
SET @TargetCounter = 1

WHILE @SourceCounter <= @SourceLength
BEGIN
WHILE @TargetCounter <= @TargetLength
BEGIN

--Step 5
IF @SourceCounter > @TargetLength OR @TargetCounter > @SourceLength
SET @Cost = 1
ELSE
IF SUBSTRING(@Source, @SourceCounter, 1) = SUBSTRING(@Target, @TargetCounter, 1)
SET @Cost = 0
ELSE
SET @Cost = 1

--Step 6
SET @Above = (SELECT Value FROM @Matrix WHERE X = @SourceCounter AND Y = @TargetCounter - 1)
SET @Left = (SELECT Value FROM @Matrix WHERE X = @SourceCounter - 1 AND Y = @TargetCounter)
SET @Diagonal = (SELECT Value FROM @Matrix WHERE X = @SourceCounter - 1 AND Y = @TargetCounter - 1)

SET @Cost = dbo.zfsMin3(@Above + 1, @Left + 1, @Diagonal + @Cost)
INSERT @Matrix VALUES(@SourceCounter, @TargetCounter, @Cost)

SET @TargetCounter = @TargetCounter + 1
END
SET @SourceCounter = @SourceCounter + 1
SET @TargetCounter = 1
END

Done:
--SELECT * FROM @Matrix ORDER BY X, Y

--Step 7
RETURN (SELECT Value FROM @Matrix WHERE X = @SourceLength AND Y = @TargetLength)

END






Go to Top of Page

Murat
Starting Member

1 Posts

Posted - 04/02/2007 :  19:10:08  Show Profile  Reply with Quote
Is there any copyright issue in terms of using this function :)
CREATE FUNCTION edit_distance

MK

quote:
Originally posted by Arnold Fribble

See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link (http://www.merriampark.com/ldtsql.htm) to a T-SQL implementation by Joseph Gama: unfortunately, that function doesn't work. There is a debugged version in the also-referenced package of TSQL functions (http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5), but this still has the fundamental problem that it only works on pairs of strings up to 49 characters.


CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
    @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len
  BEGIN
    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN @c
END



Go to Top of Page

mharr
Starting Member

USA
20 Posts

Posted - 03/20/2008 :  13:56:06  Show Profile  Reply with Quote
quote:
Originally posted by Arnold Fribble

See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link (http://www.merriampark.com/ldtsql.htm) to a T-SQL implementation .....


I stumbled across an inconsistency with this function. It appears that is @s1_len = 0 (i.e. if the first argument is an empty string), the function returns 0, as if it is perfect match to 2nd string.

I suggest adding this line after @s1_len is calculated:
IF (@s1_len=0) RETURN @s2_len




Mark
Go to Top of Page

coolerbob
Aged Yak Warrior

United Kingdom
841 Posts

Posted - 05/29/2008 :  17:49:49  Show Profile  Reply with Quote
quote:
Originally posted by Arnold Fribble

I didn't say it was fast for big strings, just that it worked!
Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.

Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n)




Maybe putting the algorithm in a UDA will yield better performance?
Go to Top of Page

andrermagalhaes
Starting Member

2 Posts

Posted - 06/09/2008 :  16:39:03  Show Profile  Reply with Quote
Hi.

I don't even get closer to understand this algorithm, but doesn't work how i imagine.

if you reverse the string doesn't match ?..

edit_distance('ABCD', 'ABCD') // returns 0
edit_distance('ABCD', 'DCBA') // returns 4

why?? it's that correct, and i understand wrong?
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

Sweden
30113 Posts

Posted - 06/10/2008 :  01:10:24  Show Profile  Visit SwePeso's Homepage  Reply with Quote
Google is your friend
http://en.wikipedia.org/wiki/Levenstein_Distance



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

andrermagalhaes
Starting Member

2 Posts

Posted - 06/10/2008 :  09:12:59  Show Profile  Reply with Quote
Hmm..

I understand now.

thank's for your google help.
Go to Top of Page

paulrowe
Starting Member

1 Posts

Posted - 08/27/2008 :  22:45:13  Show Profile  Reply with Quote
Here is a version of the Levenshtein algorithm that has a 3999-character limit and works with a one-dimensional array (without it simulating a two-dimensional array). My thanks to Arnold for showing me some tricks to use.

Some notes:
  • It first checks the length of each string and returns the length of the other if one is empty.
  • It then checks the strings for equivalence and returns zero if they are the same.
  • It then checks to see if one string is a sub-string of the other and returns the difference in length if such is the case.
  • Only after these checks have failed will it run the algorithm against the two strings.

Notice that the array length only needs to be as long as the shorter string, but both are confined to 3,999 characters to ensure that the array will be large enough. I expect you can use a nVarChar(Max) as long as neither string exceeds a length of 32,766 characters.

CREATE FUNCTION [dbo].[levenshtein](@s1 nVarChar(3999), @s2 nVarChar(3999))
RETURNS Int
AS BEGIN
  DECLARE @sLeft nVarChar(3999), @sRight nVarChar(3999), @nLeftLength Int, @nRightLength Int, 
    @nsDistance nVarChar(4000), @nLeftPos Int, @nRightPos Int, @nMatrixDiag Int, @nMatrixSide Int, @nMatrixVert Int, 
    @cLeft nVarChar(1), @cRight nVarChar(1), @nCost Int, @nMatrixTemp Int;
  IF (Len(@s1) = 0) RETURN Len(@s2)
  ELSE IF (Len(@s2) = 0) RETURN Len(@s1)
  ELSE IF (@s1 = @s2) RETURN 0
  ELSE IF ((Len(@s1) < Len(@s2)) AND (CharIndex(@s1, @s2) <> 0)) RETURN Len(@s2) - Len(@s1)
  ELSE IF ((Len(@s2) < Len(@s1)) AND (CharIndex(@s2, @s1) <> 0)) RETURN Len(@s1) - Len(@s2)
  ELSE BEGIN
    IF (Len(@s1) > Len(@s2)) BEGIN
      SET @sLeft = @s1;
      SET @sRight = @s2;
    END ELSE BEGIN
      SET @sLeft = @s2;
      SET @sRight = @s1;
    END;
    SET @nLeftLength = Len(@sLeft);
    SET @nRightLength = Len(@sRight);
    SET @nsDistance = NChar(1);
    SET @nRightPos = 1;
    WHILE (@nRightPos <= @nRightLength) BEGIN
      SET @nsDistance = @nsDistance + NChar(@nRightPos + 1);
      SET @nRightPos = @nRightPos + 1;
    END;
    SET @nLeftPos = 1;
    WHILE (@nLeftPos <= @nLeftLength) BEGIN
      SET @cLeft = SubString(@sLeft, @nLeftPos, 1);
      SET @nMatrixDiag = Unicode(SubString(@nsDistance, 1, 1));
      SET @nsDistance = NChar(@nLeftPos + 1) + SubString(@nsDistance, 2, @nRightLength);
      SET @nRightPos = 1;
      WHILE (@nRightPos <= @nRightLength) BEGIN
        SET @cRight = SubString(@sRight, @nRightPos, 1);
        SET @nCost = CASE WHEN (@cRight = @cLeft) THEN 0 ELSE 1 END;
        SET @nMatrixSide = Unicode(SubString(@nsDistance, @nRightPos, 1));
        SET @nMatrixVert = Unicode(SubString(@nsDistance, @nRightPos + 1, 1));
        SET @nMatrixTemp = CASE
          WHEN ((@nMatrixDiag + @nCost <= @nMatrixSide + 1) AND (@nMatrixDiag + @nCost <= @nMatrixVert + 1))
            THEN @nMatrixDiag + @nCost
          WHEN ((@nMatrixSide + 1 <= @nMatrixDiag + @nCost) AND (@nMatrixSide + 1 <= @nMatrixVert + 1)) 
            THEN @nMatrixSide + 1
          ELSE @nMatrixVert + 1
        END;
        IF (@nRightPos < @nRightLength)
          SET @nsDistance = SubString(@nsDistance, 1, @nRightPos) + NChar(@nMatrixTemp) 
            + SubString(@nsDistance, @nRightPos + 2, @nRightLength - @nRightPos)
        ELSE
          SET @nsDistance = SubString(@nsDistance, 1, @nRightPos) + NChar(@nMatrixTemp);
        SET @nMatrixDiag = @nMatrixVert;
        SET @nRightPos = @nRightPos + 1;
      END;
      SET @nLeftPos = @nLeftPos + 1;
    END;
  END;
  RETURN Unicode(SubString(@nsDistance, @nRightLength + 1, 1)) - 1;
END;

Enjoy!

Edited by - paulrowe on 08/27/2008 22:47:06
Go to Top of Page

mozkan
Starting Member

1 Posts

Posted - 01/21/2011 :  03:46:55  Show Profile  Reply with Quote
I use SQL Function and the SQL query at link below.
Works great for me. These 2 togethger give me the best search results like "google's did you mean this"

http://www.findwhouse.com/post.aspx?t=41

This query is made to search product names in a product table. You can customize it for your needs.

Query:

SELECT P.*, ABS(dbo.edit_distance(Name, @Name) + LEN(@Name)) - (LEN(Name) + LEN(@Name) / 3) AS Score, dbo.edit_distance(Name, @Name) AS Score2 FROM viewProduct P 
WHERE (dbo.edit_distance(NAME,@Name) + LEN(@Name)) < (LEN(Name) + (LEN(@Name) / 3) ) 
OR (dbo.edit_distance(Name, @Name) + LEN(Name) < LEN(@Name) + LEN(Name) / 3) ORDER BY Score, Score2

Edited by - mozkan on 01/21/2011 03:49:31
Go to Top of Page
Page: of 2 Previous Topic Topic Next Topic  
Previous Page
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.11 seconds. Powered By: Snitz Forums 2000