SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 General SQL Server Forums
 Script Library
 Levenshtein Edit Distance Algorithm
 New Topic  Reply to Topic
 Printer Friendly
Next Page
Author Previous Topic Topic Next Topic
Page: of 2

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/24/2005 :  09:52:40  Show Profile  Reply with Quote
See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link (http://www.merriampark.com/ldtsql.htm) to a T-SQL implementation by Joseph Gama: unfortunately, that function doesn't work. There is a debugged version in the also-referenced package of TSQL functions (http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5), but this still has the fundamental problem that it only works on pairs of strings up to 49 characters.


CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
    @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len
  BEGIN
    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN @c
END


Edited by - Arnold Fribble on 06/24/2005 09:56:42

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 06/24/2005 :  11:13:52  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
speed :
Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('abcdefghij',399)) took 1:19 to run and return 0

Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('zabcdefghi',399))
took 1:14 to run and return 400 (correct)


Corey

Co-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."

Edited by - Seventhnight on 06/24/2005 11:14:14
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/24/2005 :  11:51:42  Show Profile  Reply with Quote
I didn't say it was fast for big strings, just that it worked!
Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.

Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n)
Go to Top of Page

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 06/24/2005 :  11:53:03  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
i'm thinking

Corey

Co-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 06/24/2005 :  12:09:41  Show Profile  Reply with Quote
Anyone have an example of an application you would use this for?

CODO ERGO SUM
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/24/2005 :  12:17:24  Show Profile  Reply with Quote
In my case, I'd been supplied with a table of 14500 GPs (primary care doctors). About 1200 of those, the name in that table didn't match the name in the national table (about 45000 rows) (joining on national GP id). Sorting the mismatches by edit distance I could quickly see what sort of differences there were -- most of them were one or two character differences.

Edited by - Arnold Fribble on 06/24/2005 12:25:04
Go to Top of Page

SamC
White Water Yakist

USA
3464 Posts

Posted - 06/24/2005 :  14:07:22  Show Profile  Reply with Quote
quote:
Originally posted by Michael Valentine Jones

Anyone have an example of an application you would use this for?
You may have already seen the reader challenge I posted using this function.

I'm not finding many takers...
Go to Top of Page

TG
Flowing Fount of Yak Knowledge

USA
6062 Posts

Posted - 06/24/2005 :  14:57:42  Show Profile  Reply with Quote
>>I'm not finding many takers...

My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.

Be One with the Optimizer
TG
Go to Top of Page

spirit1
Cybernetic Yak Master

Slovenia
11751 Posts

Posted - 06/24/2005 :  18:09:01  Show Profile  Visit spirit1's Homepage  Reply with Quote
well we use it for our CRM...
when adding a new organization or person you can search using distance check or not.
although it's implemented in asp not sql.


Go with the flow & have fun! Else fight the flow
Go to Top of Page

Kristen
Test

United Kingdom
22415 Posts

Posted - 06/24/2005 :  23:53:39  Show Profile  Reply with Quote
"I bet you can't wash all those dishes before I'm done watching this TV show"

Shucks, I wish I'd had a big sister ...

Kristen
Go to Top of Page

elwoos
Flowing Fount of Yak Knowledge

United Kingdom
2050 Posts

Posted - 06/27/2005 :  04:16:44  Show Profile  Reply with Quote
Arnold, you could be a life saver. I am going to have to do something similar to you soon


cheers

steve (Possibly only the second person on this board that can pronounce Machynlleth)
Go to Top of Page

SamC
White Water Yakist

USA
3464 Posts

Posted - 06/27/2005 :  09:54:36  Show Profile  Reply with Quote
quote:
Originally posted by TG

My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.
Hey! I resemble that remark!

... I mentioned in my post that it seemed better to correct the mistakes rather than pursue approximations (in our case anyway).
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/27/2005 :  10:26:09  Show Profile  Reply with Quote
quote:
Originally posted by elwoos

Arnold, you could be a life saver. I am going to have to do something similar to you soon


Have fun! Hope it's easier than the time I tried to write something to separate the initials and surname in the "Organisation Name" column of EGPCUR! Love to know what idiot at NACS thought it would be a good idea to give the tables for GPs, practices, PCTs and health authorities the same schema. And don't get me started on GPs who work in multiple practices...


Edited by - Arnold Fribble on 06/27/2005 10:27:17
Go to Top of Page

SamC
White Water Yakist

USA
3464 Posts

Posted - 06/27/2005 :  11:03:11  Show Profile  Reply with Quote
Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.

Anyone know why? I'm guessing it's a compile time optimization with no runtime benefit...

Sam
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/27/2005 :  12:36:28  Show Profile  Reply with Quote
quote:
Originally posted by SamC

Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.


Does it? Oh, I assumed it could spot that there was no actual selecting involved -- after all, if there were, it wouldn't allow the function to be defined! -- and that it would get interpreted the same way as SET. It's quite possible it does make a difference, though. I'll give it a go.
I only used SELECT because I had a load of variables that needed setting and it was getting visually noisy having all those SETs.
Go to Top of Page

elwoos
Flowing Fount of Yak Knowledge

United Kingdom
2050 Posts

Posted - 06/28/2005 :  03:39:08  Show Profile  Reply with Quote
Arnold, you mean we can't just ignore them - maybe that's where I've been going wrong . Anyway the GP's are the 'good' guys. What about the consultants! Damn GMC needs to get its act together

The functions I wrote http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50044 (albeit with errors!) were partially for handling this little lot

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.
Go to Top of Page

rrb
SQLTeam Poet Laureate

Australia
1479 Posts

Posted - 06/28/2005 :  03:57:23  Show Profile  Reply with Quote
Mr Fribble, if I haven't already said it a million times... You ROCK!

That's extra cool - more cool than just normally cool!

Thanks

--
I hope that when I die someone will say of me "That guy sure owed me a lot of money"
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/28/2005 :  03:57:42  Show Profile  Reply with Quote
quote:
Anyway the GP's are the 'good' guys.

Oh, sorry, it wasn't the GPs themselves I was complaining about, just NACS mishandling of the database where a GP works in multiple practices (or at least, multiple entities represented in the EPRACCUR table). One of my colleagues asked NACS about this and they said that they create a new national GP code (starting with G6 or something) for the GP working in the 'other' practice.
Certainly if you look at something like this:

SELECT A.*
FROM EGPCUR AS A
INNER JOIN (
    SELECT "Organisation Name", LEFT(Postcode, 2) AS pc
    FROM EGPCUR
    GROUP BY "Organisation Name", LEFT(Postcode, 2)
    HAVING COUNT(*) > 1
  ) AS B
  ON A."Organisation Name" = B."Organisation Name" AND LEFT(A.Postcode, 2) = pc
ORDER BY A."Organisation Name", A.Postcode

you see (apart from many rows in EGPCUR that don't represent a single person!) quite a number of occurrences of this sort of thing.

Edited by - Arnold Fribble on 06/28/2005 04:26:36
Go to Top of Page

SamC
White Water Yakist

USA
3464 Posts

Posted - 06/28/2005 :  07:17:30  Show Profile  Reply with Quote
Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 06/28/2005 :  07:56:26  Show Profile  Reply with Quote
quote:
Originally posted by SamC

Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?


If it is faster to combine the columns, then it's only because of the call overhead. The function takes O(nm) time (where n and m are the lengths of the strings)*, so you're comparing m1n1 + m2n2 + m3n3 to (m1+m2+m3)(n1+n2+n3)
But "as accurate" depends on the data, and what you're trying to compare.

* Probably worse in this implementation, since it's reassigning the costs in the cost vectors.
Go to Top of Page

SamC
White Water Yakist

USA
3464 Posts

Posted - 06/28/2005 :  08:24:39  Show Profile  Reply with Quote
I could experiment, but let me ask anyway. Suppose your data largely had errors in the middlename being present (or not), and the Firstname, Lastname were (usually) a match.

Would accuracy suffer in the comparison of (m1+m2+m3) versus comparing m1, m2 and m3 individually?

The query will be much simpler in the former, but it seems to me it would be no less accurate than the latter.
Go to Top of Page
Page: of 2 Previous Topic Topic Next Topic  
Next Page
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.16 seconds. Powered By: Snitz Forums 2000