Please start any new threads on our new site at http://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

Our new SQL Server Forums are live! Come on over! We've restricted the ability to create new threads on these forums.

 SQL Server Forums Profile | Active Topics | Members | Search | Forum FAQ Register Now and get your question answered!
 All Forums  General SQL Server Forums  Script Library  Levenshtein Edit Distance Algorithm Reply to Topic  Printer Friendly
Author  Topic
Page: of 2

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/24/2005 :  09:52:40 See here www.merriampark.com/ld.htm for information about the algorithm. This page has a link (http://www.merriampark.com/ldtsql.htm) to a T-SQL implementation by Joseph Gama: unfortunately, that function doesn't work. There is a debugged version in the also-referenced package of TSQL functions (http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5), but this still has the fundamental problem that it only works on pairs of strings up to 49 characters.``` CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999)) RETURNS int AS BEGIN DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int, @cv0 varbinary(8000), @cv1 varbinary(8000) SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0 WHILE @j <= @s2_len SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1 WHILE @i <= @s1_len BEGIN SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1 WHILE @j <= @s2_len BEGIN SET @c = @c + 1 SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END IF @c > @c_temp SET @c = @c_temp SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1 IF @c > @c_temp SET @c = @c_temp SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1 END SELECT @cv1 = @cv0, @i = @i + 1 END RETURN @c END ``` Edited by - Arnold Fribble on 06/24/2005 09:56:42

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

 Posted - 06/24/2005 :  11:13:52 speed :Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('abcdefghij',399)) took 1:19 to run and return 0Select admin.dbo.edit_distance(replicate('abcdefghij',399),replicate('zabcdefghi',399))took 1:14 to run and return 400 (correct)CoreyCo-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now." Edited by - Seventhnight on 06/24/2005 11:14:14

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/24/2005 :  11:51:42 I didn't say it was fast for big strings, just that it worked! Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n)

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

 Posted - 06/24/2005 :  11:53:03 i'm thinking CoreyCo-worker on The Wizard of Oz "...those three midgets that came out and danced, the freaked me out when I was little. But they are ok now."

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

 Posted - 06/24/2005 :  12:09:41 Anyone have an example of an application you would use this for?CODO ERGO SUM

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/24/2005 :  12:17:24 In my case, I'd been supplied with a table of 14500 GPs (primary care doctors). About 1200 of those, the name in that table didn't match the name in the national table (about 45000 rows) (joining on national GP id). Sorting the mismatches by edit distance I could quickly see what sort of differences there were -- most of them were one or two character differences. Edited by - Arnold Fribble on 06/24/2005 12:25:04

SamC
White Water Yakist

USA
3467 Posts

 Posted - 06/24/2005 :  14:07:22 quote:Originally posted by Michael Valentine JonesAnyone have an example of an application you would use this for?You may have already seen the reader challenge I posted using this function.I'm not finding many takers...

TG
Flowing Fount of Yak Knowledge

USA
6065 Posts

 Posted - 06/24/2005 :  14:57:42 >>I'm not finding many takers...My big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.Be One with the OptimizerTG

spirit1
Cybernetic Yak Master

Slovenia
11752 Posts

 Posted - 06/24/2005 :  18:09:01 well we use it for our CRM...when adding a new organization or person you can search using distance check or not.although it's implemented in asp not sql.Go with the flow & have fun! Else fight the flow

Kristen
Test

United Kingdom
22859 Posts

 Posted - 06/24/2005 :  23:53:39 "I bet you can't wash all those dishes before I'm done watching this TV show"Shucks, I wish I'd had a big sister ... Kristen

elwoos
Flowing Fount of Yak Knowledge

United Kingdom
2052 Posts

 Posted - 06/27/2005 :  04:16:44 Arnold, you could be a life saver. I am going to have to do something similar to you sooncheerssteve (Possibly only the second person on this board that can pronounce Machynlleth)

SamC
White Water Yakist

USA
3467 Posts

 Posted - 06/27/2005 :  09:54:36 quote:Originally posted by TGMy big sister used to give me similar "challenges". ie: I bet you can't wash all those dishes before I'm done watching this TV show.Hey! I resemble that remark!... I mentioned in my post that it seemed better to correct the mistakes rather than pursue approximations (in our case anyway).

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/27/2005 :  10:26:09 quote:Originally posted by elwoosArnold, you could be a life saver. I am going to have to do something similar to you soonHave fun! Hope it's easier than the time I tried to write something to separate the initials and surname in the "Organisation Name" column of EGPCUR! Love to know what idiot at NACS thought it would be a good idea to give the tables for GPs, practices, PCTs and health authorities the same schema. And don't get me started on GPs who work in multiple practices... Edited by - Arnold Fribble on 06/27/2005 10:27:17

SamC
White Water Yakist

USA
3467 Posts

 Posted - 06/27/2005 :  11:03:11 Couldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.Anyone know why? I'm guessing it's a compile time optimization with no runtime benefit...Sam

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/27/2005 :  12:36:28 quote:Originally posted by SamCCouldn't help notice you use SELECT instead of SET. BOL makes a issue about there being some benefit using SET rather than SELECT for non-table based (scalar?) calculations.Does it? Oh, I assumed it could spot that there was no actual selecting involved -- after all, if there were, it wouldn't allow the function to be defined! -- and that it would get interpreted the same way as SET. It's quite possible it does make a difference, though. I'll give it a go. I only used SELECT because I had a load of variables that needed setting and it was getting visually noisy having all those SETs.

elwoos
Flowing Fount of Yak Knowledge

United Kingdom
2052 Posts

 Posted - 06/28/2005 :  03:39:08 Arnold, you mean we can't just ignore them - maybe that's where I've been going wrong . Anyway the GP's are the 'good' guys. What about the consultants! Damn GMC needs to get its act togetherThe functions I wrote http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50044 (albeit with errors!) were partially for handling this little lotAlright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.

rrb
SQLTeam Poet Laureate

Australia
1479 Posts

 Posted - 06/28/2005 :  03:57:23 Mr Fribble, if I haven't already said it a million times... You ROCK!That's extra cool - more cool than just normally cool!Thanks--I hope that when I die someone will say of me "That guy sure owed me a lot of money"

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/28/2005 :  03:57:42 quote:Anyway the GP's are the 'good' guys.Oh, sorry, it wasn't the GPs themselves I was complaining about, just NACS mishandling of the database where a GP works in multiple practices (or at least, multiple entities represented in the EPRACCUR table). One of my colleagues asked NACS about this and they said that they create a new national GP code (starting with G6 or something) for the GP working in the 'other' practice.Certainly if you look at something like this:``` SELECT A.* FROM EGPCUR AS A INNER JOIN ( SELECT "Organisation Name", LEFT(Postcode, 2) AS pc FROM EGPCUR GROUP BY "Organisation Name", LEFT(Postcode, 2) HAVING COUNT(*) > 1 ) AS B ON A."Organisation Name" = B."Organisation Name" AND LEFT(A.Postcode, 2) = pc ORDER BY A."Organisation Name", A.Postcode ```you see (apart from many rows in EGPCUR that don't represent a single person!) quite a number of occurrences of this sort of thing. Edited by - Arnold Fribble on 06/28/2005 04:26:36

SamC
White Water Yakist

USA
3467 Posts

 Posted - 06/28/2005 :  07:17:30 Is it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

 Posted - 06/28/2005 :  07:56:26 quote:Originally posted by SamCIs it a consensus then that if 3 fields exists: Firstname, Middlename, Lastname, that the a distance on (Firstname + ' ' + Middlename + ' ' + Lastname) will be as accurate (and faster) than 3 separate distance calculations on the individual columns?If it is faster to combine the columns, then it's only because of the call overhead. The function takes O(nm) time (where n and m are the lengths of the strings)*, so you're comparing m1n1 + m2n2 + m3n3 to (m1+m2+m3)(n1+n2+n3)But "as accurate" depends on the data, and what you're trying to compare.* Probably worse in this implementation, since it's reassigning the costs in the cost vectors.

SamC
White Water Yakist

USA
3467 Posts

 Posted - 06/28/2005 :  08:24:39 I could experiment, but let me ask anyway. Suppose your data largely had errors in the middlename being present (or not), and the Firstname, Lastname were (usually) a match.Would accuracy suffer in the comparison of (m1+m2+m3) versus comparing m1, m2 and m3 individually?The query will be much simpler in the former, but it seems to me it would be no less accurate than the latter.
Page: of 2  Topic
 Reply to Topic  Printer Friendly Jump To: Select Forum General SQL Server Forums       New to SQL Server Programming       New to SQL Server Administration       Script Library       Data Corruption Issues       Database Design and Application Architecture SQL Server 2012 Forums       Transact-SQL (2012)       SQL Server Administration (2012)       SSIS and Import/Export (2012)       Analysis Server and Reporting Services (2012)       Replication (2012)       Availability Groups and DR (2012)       Other SQL Server 2012 Topics SQL Server 2008 Forums       Transact-SQL (2008)       SQL Server Administration (2008)       SSIS and Import/Export (2008)       High Availability (2008)       Replication (2008)       Analysis Server and Reporting Services (2008)       Other SQL Server 2008 Topics SQL Server 2005 Forums       Transact-SQL (2005)       SQL Server Administration (2005)       .NET Inside SQL Server (2005)       SSIS and Import/Export (2005)       Service Broker (2005)       Replication (2005)       High Availability (2005)       Analysis Server and Reporting Services (2005)       Express Edition and Compact Edition (2005)       Other SQL Server Topics (2005) SQL Server 2000 Forums       SQL Server Development (2000)       SQL Server Administration (2000)       Import/Export (DTS) and Replication (2000)       Transact-SQL (2000)       Analysis Services (2000)       MSDE (2000) Development Tools       ASP.NET       Reporting Services Development       Other Development Tools Site Related Forums       Site Related Discussions       Article Discussion       Poll Discussion       The Yak Corral Other Forums       SQL Server 6.5 \ SQL Server 7.0       Other Topics       MS Access       ClearTrace Support Forum Old Forums       CLOSED - General SQL Server       CLOSED - SQL Server 2005/Yukon  -------------------- Home Active Topics Frequently Asked Questions Member Information Search Page
 SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC