| Author |
Topic  |
|
SamC
White Water Yakist
USA
3459 Posts |
Posted - 06/28/2005 : 08:58:19
|
I have a performance improvement to the function which will speed results when there there is a perfect match. CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
@cv0 varbinary(8000), @cv1 varbinary(8000)
IF @s1 = @s2
RETURN 0
SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END |
Edited by - SamC on 06/28/2005 08:59:52 |
 |
|
|
elwoos
Flowing Fount of Yak Knowledge
United Kingdom
2039 Posts |
Posted - 06/29/2005 : 03:23:51
|
quote: quote: -------------------------------------------------------------------------------- Anyway the GP's are the 'good' guys. --------------------------------------------------------------------------------
Oh, sorry, it wasn't the GPs themselves I was complaining about,
Arnold, I knew you weren't having a go at them. I just wondered how you dealt with the consultants were the data is even worse, to my understanding it's only updated every couple of years at best and much of it is wrong.
I knew about GP's in different surgeries but I hadn't realised that a new code was created for each. As I'm about to try to tidy up the data on my system for these that is a really important snippet. Maybe I'll just wait for our Brave New World!!
Many thanks
steve
(maybe my sig should say NACS instead of Brain)
Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer. |
Edited by - elwoos on 06/29/2005 03:25:47 |
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
Posted - 06/29/2005 : 03:43:31
|
quote: Arnold, I knew you weren't having a go at them. I just wondered how you dealt with the consultants were the data is even worse, to my understanding it's only updated every couple of years at best and much of it is wrong.
To be honest, national coding of consultants isn't something we've had to worry about yet. But I'm sure in this new age of NPfIT or Connecting for Health or whatever the hell it's called this week...
quote: maybe my sig should say NACS instead of Brain
oh "Brain"! I've been reading that as "Brian"... 
|
 |
|
|
jjworld
Starting Member
USA
1 Posts |
Posted - 08/02/2006 : 16:12:56
|
Here's a version that I created to meet my specific needs:
CREATE FUNCTION [dbo].[zfsMin3] (@a INT, @b INT, @c INT ) RETURNS INT AS
BEGIN DECLARE @Retval INT IF @a <= @b AND @a <= @c SET @Retval = @a
IF @b < @a AND @b <= @c SET @Retval = @b
IF @c < @a AND @c < @b SET @Retval = @c
RETURN @Retval
END
CREATE FUNCTION [dbo].[zfsLevenshtein] (@Source VARCHAR(2000), @Target VARCHAR(2000) ) --Retranslated by Jerry Johnson, 8/2/2006 RETURNS INT AS
BEGIN
DECLARE @Distance INT, @Cost INT DECLARE @SourceLength INT, @TargetLength INT DECLARE @SourceCounter INT, @TargetCounter INT DECLARE @Matrix TABLE (X INT, Y INT, Value INT) DECLARE @Above INT, @Left INT, @Diagonal INT
--Preprocess to remove any characters that already match DECLARE @MinLength INTEGER DECLARE @NewSource VARCHAR(50), @NewTarget VARCHAR(50)
SET @MinLength = LEN(@Source) IF LEN(@Target) < @MinLength SET @MinLength = LEN(@Target)
SET @SourceCounter = 1 SET @NewSource = '' SET @NewTarget = ''
WHILE @SourceCounter <= @MinLength BEGIN IF SUBSTRING(@Source, @SourceCounter, 1) <> SUBSTRING(@Target, @SourceCounter, 1) BEGIN SET @NewSource = @NewSource + SUBSTRING(@Source, @SourceCounter, 1) SET @NewTarget = @NewTarget + SUBSTRING(@Target, @SourceCounter, 1) END SET @SourceCounter = @SourceCounter + 1 END
SET @NewSource = @NewSource + SUBSTRING(@Source, @MinLength + 1, LEN(@Source)) SET @NewTarget = @NewTarget + SUBSTRING(@Target, @MinLength + 1, LEN(@Target)) SET @Source = @NewSource SET @Target = @NewTarget
--Step 1 SET @SourceLength = LEN(@Source) SET @TargetLength = LEN(@Target)
IF @SourceLength = 0 BEGIN SET @Distance = @TargetLength GOTO Done END
IF @TargetLength = 0 BEGIN SET @Distance = @SourceLength GOTO Done END
--Step 2 SET @SourceCounter = 1 SET @TargetCounter = 1
INSERT @Matrix VALUES(0, 0, 0)
WHILE @SourceCounter <= @SourceLength BEGIN INSERT @Matrix VALUES(@SourceCounter, 0, @SourceCounter) SET @SourceCounter = @SourceCounter + 1 END
WHILE @TargetCounter <= @TargetLength BEGIN INSERT @Matrix VALUES(0, @TargetCounter, @TargetCounter) SET @TargetCounter = @TargetCounter + 1 END
--Steps 3 and 4 SET @SourceCounter = 1 SET @TargetCounter = 1
WHILE @SourceCounter <= @SourceLength BEGIN WHILE @TargetCounter <= @TargetLength BEGIN
--Step 5 IF @SourceCounter > @TargetLength OR @TargetCounter > @SourceLength SET @Cost = 1 ELSE IF SUBSTRING(@Source, @SourceCounter, 1) = SUBSTRING(@Target, @TargetCounter, 1) SET @Cost = 0 ELSE SET @Cost = 1
--Step 6 SET @Above = (SELECT Value FROM @Matrix WHERE X = @SourceCounter AND Y = @TargetCounter - 1) SET @Left = (SELECT Value FROM @Matrix WHERE X = @SourceCounter - 1 AND Y = @TargetCounter) SET @Diagonal = (SELECT Value FROM @Matrix WHERE X = @SourceCounter - 1 AND Y = @TargetCounter - 1)
SET @Cost = dbo.zfsMin3(@Above + 1, @Left + 1, @Diagonal + @Cost) INSERT @Matrix VALUES(@SourceCounter, @TargetCounter, @Cost)
SET @TargetCounter = @TargetCounter + 1 END SET @SourceCounter = @SourceCounter + 1 SET @TargetCounter = 1 END
Done: --SELECT * FROM @Matrix ORDER BY X, Y
--Step 7 RETURN (SELECT Value FROM @Matrix WHERE X = @SourceLength AND Y = @TargetLength)
END
|
 |
|
|
Murat
Starting Member
1 Posts |
Posted - 04/02/2007 : 19:10:08
|
Is there any copyright issue in terms of using this function :) CREATE FUNCTION edit_distance
MK
quote: Originally posted by Arnold Fribble
See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link (http://www.merriampark.com/ldtsql.htm) to a T-SQL implementation by Joseph Gama: unfortunately, that function doesn't work. There is a debugged version in the also-referenced package of TSQL functions (http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5), but this still has the fundamental problem that it only works on pairs of strings up to 49 characters.
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
@cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
|
 |
|
|
mharr
Starting Member
USA
20 Posts |
Posted - 03/20/2008 : 13:56:06
|
quote: Originally posted by Arnold Fribble
See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link (http://www.merriampark.com/ldtsql.htm) to a T-SQL implementation .....
I stumbled across an inconsistency with this function. It appears that is @s1_len = 0 (i.e. if the first argument is an empty string), the function returns 0, as if it is perfect match to 2nd string.
I suggest adding this line after @s1_len is calculated:
IF (@s1_len=0) RETURN @s2_len
Mark |
 |
|
|
coolerbob
Aged Yak Warrior
United Kingdom
841 Posts |
Posted - 05/29/2008 : 17:49:49
|
quote: Originally posted by Arnold Fribble
I didn't say it was fast for big strings, just that it worked!  Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.
Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n)
Maybe putting the algorithm in a UDA will yield better performance? |
 |
|
|
andrermagalhaes
Starting Member
2 Posts |
Posted - 06/09/2008 : 16:39:03
|
Hi.
I don't even get closer to understand this algorithm, but doesn't work how i imagine.
if you reverse the string doesn't match ?..
edit_distance('ABCD', 'ABCD') // returns 0 edit_distance('ABCD', 'DCBA') // returns 4
why?? it's that correct, and i understand wrong? |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
Sweden
29138 Posts |
|
|
andrermagalhaes
Starting Member
2 Posts |
Posted - 06/10/2008 : 09:12:59
|
Hmm..
I understand now.
thank's for your google help. |
 |
|
|
paulrowe
Starting Member
1 Posts |
Posted - 08/27/2008 : 22:45:13
|
Here is a version of the Levenshtein algorithm that has a 3999-character limit and works with a one-dimensional array (without it simulating a two-dimensional array). My thanks to Arnold for showing me some tricks to use.
Some notes:
- It first checks the length of each string and returns the length of the other if one is empty.
- It then checks the strings for equivalence and returns zero if they are the same.
- It then checks to see if one string is a sub-string of the other and returns the difference in length if such is the case.
- Only after these checks have failed will it run the algorithm against the two strings.
Notice that the array length only needs to be as long as the shorter string, but both are confined to 3,999 characters to ensure that the array will be large enough. I expect you can use a nVarChar(Max) as long as neither string exceeds a length of 32,766 characters.
CREATE FUNCTION [dbo].[levenshtein](@s1 nVarChar(3999), @s2 nVarChar(3999))
RETURNS Int
AS BEGIN
DECLARE @sLeft nVarChar(3999), @sRight nVarChar(3999), @nLeftLength Int, @nRightLength Int,
@nsDistance nVarChar(4000), @nLeftPos Int, @nRightPos Int, @nMatrixDiag Int, @nMatrixSide Int, @nMatrixVert Int,
@cLeft nVarChar(1), @cRight nVarChar(1), @nCost Int, @nMatrixTemp Int;
IF (Len(@s1) = 0) RETURN Len(@s2)
ELSE IF (Len(@s2) = 0) RETURN Len(@s1)
ELSE IF (@s1 = @s2) RETURN 0
ELSE IF ((Len(@s1) < Len(@s2)) AND (CharIndex(@s1, @s2) <> 0)) RETURN Len(@s2) - Len(@s1)
ELSE IF ((Len(@s2) < Len(@s1)) AND (CharIndex(@s2, @s1) <> 0)) RETURN Len(@s1) - Len(@s2)
ELSE BEGIN
IF (Len(@s1) > Len(@s2)) BEGIN
SET @sLeft = @s1;
SET @sRight = @s2;
END ELSE BEGIN
SET @sLeft = @s2;
SET @sRight = @s1;
END;
SET @nLeftLength = Len(@sLeft);
SET @nRightLength = Len(@sRight);
SET @nsDistance = NChar(1);
SET @nRightPos = 1;
WHILE (@nRightPos <= @nRightLength) BEGIN
SET @nsDistance = @nsDistance + NChar(@nRightPos + 1);
SET @nRightPos = @nRightPos + 1;
END;
SET @nLeftPos = 1;
WHILE (@nLeftPos <= @nLeftLength) BEGIN
SET @cLeft = SubString(@sLeft, @nLeftPos, 1);
SET @nMatrixDiag = Unicode(SubString(@nsDistance, 1, 1));
SET @nsDistance = NChar(@nLeftPos + 1) + SubString(@nsDistance, 2, @nRightLength);
SET @nRightPos = 1;
WHILE (@nRightPos <= @nRightLength) BEGIN
SET @cRight = SubString(@sRight, @nRightPos, 1);
SET @nCost = CASE WHEN (@cRight = @cLeft) THEN 0 ELSE 1 END;
SET @nMatrixSide = Unicode(SubString(@nsDistance, @nRightPos, 1));
SET @nMatrixVert = Unicode(SubString(@nsDistance, @nRightPos + 1, 1));
SET @nMatrixTemp = CASE
WHEN ((@nMatrixDiag + @nCost <= @nMatrixSide + 1) AND (@nMatrixDiag + @nCost <= @nMatrixVert + 1))
THEN @nMatrixDiag + @nCost
WHEN ((@nMatrixSide + 1 <= @nMatrixDiag + @nCost) AND (@nMatrixSide + 1 <= @nMatrixVert + 1))
THEN @nMatrixSide + 1
ELSE @nMatrixVert + 1
END;
IF (@nRightPos < @nRightLength)
SET @nsDistance = SubString(@nsDistance, 1, @nRightPos) + NChar(@nMatrixTemp)
+ SubString(@nsDistance, @nRightPos + 2, @nRightLength - @nRightPos)
ELSE
SET @nsDistance = SubString(@nsDistance, 1, @nRightPos) + NChar(@nMatrixTemp);
SET @nMatrixDiag = @nMatrixVert;
SET @nRightPos = @nRightPos + 1;
END;
SET @nLeftPos = @nLeftPos + 1;
END;
END;
RETURN Unicode(SubString(@nsDistance, @nRightLength + 1, 1)) - 1;
END;
Enjoy! |
Edited by - paulrowe on 08/27/2008 22:47:06 |
 |
|
|
mozkan
Starting Member
1 Posts |
Posted - 01/21/2011 : 03:46:55
|
I use SQL Function and the SQL query at link below. Works great for me. These 2 togethger give me the best search results like "google's did you mean this"
http://www.findwhouse.com/post.aspx?t=41
This query is made to search product names in a product table. You can customize it for your needs.
Query:
SELECT P.*, ABS(dbo.edit_distance(Name, @Name) + LEN(@Name)) - (LEN(Name) + LEN(@Name) / 3) AS Score, dbo.edit_distance(Name, @Name) AS Score2 FROM viewProduct P
WHERE (dbo.edit_distance(NAME,@Name) + LEN(@Name)) < (LEN(Name) + (LEN(@Name) / 3) )
OR (dbo.edit_distance(Name, @Name) + LEN(Name) < LEN(@Name) + LEN(Name) / 3) ORDER BY Score, Score2 |
Edited by - mozkan on 01/21/2011 03:49:31 |
 |
|
Topic  |
|
|
|