Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Script Library
 Levenshtein Edit Distance Algorithm

Author  Topic 

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-24 : 09:52:40
See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link ([url]http://www.merriampark.com/ldtsql.htm[/url]) to a T-SQL implementation by Joseph Gama: unfortunately, that function doesn't work. There is a debugged version in the also-referenced package of TSQL functions ([url]http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5[/url]), but this still has the fundamental problem that it only works on pairs of strings up to 49 characters.


CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
@cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2005-06-24 : 11:13:52
speed :
Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('abcdefghij',399)) took 1:19 to run and return 0

Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('zabcdefghi',399))
took 1:14 to run and return 400 (correct)


Corey

Co-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-24 : 11:51:42
I didn't say it was fast for big strings, just that it worked!
Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.

Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n)
Go to Top of Page

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2005-06-24 : 11:53:03
i'm thinking

Corey

Co-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-06-24 : 12:09:41
Anyone have an example of an application you would use this for?

CODO ERGO SUM
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-24 : 12:17:24
In my case, I'd been supplied with a table of 14500 GPs (primary care doctors). About 1200 of those, the name in that table didn't match the name in the national table (about 45000 rows) (joining on national GP id). Sorting the mismatches by edit distance I could quickly see what sort of differences there were -- most of them were one or two character differences.
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-24 : 14:07:22
quote:
Originally posted by Michael Valentine Jones

Anyone have an example of an application you would use this for?
You may have already seen the reader challenge I posted using this function.

I'm not finding many takers...
Go to Top of Page

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2005-06-24 : 14:57:42
>>I'm not finding many takers...

My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.

Be One with the Optimizer
TG
Go to Top of Page

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2005-06-24 : 18:09:01
well we use it for our CRM...
when adding a new organization or person you can search using distance check or not.
although it's implemented in asp not sql.


Go with the flow & have fun! Else fight the flow
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-06-24 : 23:53:39
"I bet you can't wash all those dishes before I'm done watching this TV show"

Shucks, I wish I'd had a big sister ...

Kristen
Go to Top of Page

elwoos
Master Smack Fu Yak Hacker

2052 Posts

Posted - 2005-06-27 : 04:16:44
Arnold, you could be a life saver. I am going to have to do something similar to you soon


cheers

steve (Possibly only the second person on this board that can pronounce Machynlleth)
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-27 : 09:54:36
quote:
Originally posted by TG

My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.
Hey! I resemble that remark!

... I mentioned in my post that it seemed better to correct the mistakes rather than pursue approximations (in our case anyway).
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-27 : 10:26:09
quote:
Originally posted by elwoos

Arnold, you could be a life saver. I am going to have to do something similar to you soon


Have fun! Hope it's easier than the time I tried to write something to separate the initials and surname in the "Organisation Name" column of EGPCUR! Love to know what idiot at NACS thought it would be a good idea to give the tables for GPs, practices, PCTs and health authorities the same schema. And don't get me started on GPs who work in multiple practices...

Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-27 : 11:03:11
Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.

Anyone know why? I'm guessing it's a compile time optimization with no runtime benefit...

Sam
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-27 : 12:36:28
quote:
Originally posted by SamC

Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.


Does it? Oh, I assumed it could spot that there was no actual selecting involved -- after all, if there were, it wouldn't allow the function to be defined! -- and that it would get interpreted the same way as SET. It's quite possible it does make a difference, though. I'll give it a go.
I only used SELECT because I had a load of variables that needed setting and it was getting visually noisy having all those SETs.
Go to Top of Page

elwoos
Master Smack Fu Yak Hacker

2052 Posts

Posted - 2005-06-28 : 03:39:08
Arnold, you mean we can't just ignore them - maybe that's where I've been going wrong . Anyway the GP's are the 'good' guys. What about the consultants! Damn GMC needs to get its act together

The functions I wrote [url]http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50044[/url] (albeit with errors!) were partially for handling this little lot

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.
Go to Top of Page

rrb
SQLTeam Poet Laureate

1479 Posts

Posted - 2005-06-28 : 03:57:23
Mr Fribble, if I haven't already said it a million times... You ROCK!

That's extra cool - more cool than just normally cool!

Thanks

--
I hope that when I die someone will say of me "That guy sure owed me a lot of money"
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-28 : 03:57:42
quote:
Anyway the GP's are the 'good' guys.

Oh, sorry, it wasn't the GPs themselves I was complaining about, just NACS mishandling of the database where a GP works in multiple practices (or at least, multiple entities represented in the EPRACCUR table). One of my colleagues asked NACS about this and they said that they create a new national GP code (starting with G6 or something) for the GP working in the 'other' practice.
Certainly if you look at something like this:

SELECT A.*
FROM EGPCUR AS A
INNER JOIN (
SELECT "Organisation Name", LEFT(Postcode, 2) AS pc
FROM EGPCUR
GROUP BY "Organisation Name", LEFT(Postcode, 2)
HAVING COUNT(*) > 1
) AS B
ON A."Organisation Name" = B."Organisation Name" AND LEFT(A.Postcode, 2) = pc
ORDER BY A."Organisation Name", A.Postcode

you see (apart from many rows in EGPCUR that don't represent a single person!) quite a number of occurrences of this sort of thing.
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-28 : 07:17:30
Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-28 : 07:56:26
quote:
Originally posted by SamC

Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?


If it is faster to combine the columns, then it's only because of the call overhead. The function takes O(nm) time (where n and m are the lengths of the strings)*, so you're comparing m1n1 + m2n2 + m3n3 to (m1+m2+m3)(n1+n2+n3)
But "as accurate" depends on the data, and what you're trying to compare.

* Probably worse in this implementation, since it's reassigning the costs in the cost vectors.
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2005-06-28 : 08:24:39
I could experiment, but let me ask anyway. Suppose your data largely had errors in the middlename being present (or not), and the Firstname, Lastname were (usually) a match.

Would accuracy suffer in the comparison of (m1+m2+m3) versus comparing m1, m2 and m3 individually?

The query will be much simpler in the former, but it seems to me it would be no less accurate than the latter.
Go to Top of Page
    Next Page

- Advertisement -