| Author |
Topic  |
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/08/2008 : 09:36:39
|
Since the solution I provided actually uses scoring rules similar to what marat so belatedly suggests, there will be no need for them to stumble upon anything. Marat should read through the entire thread before posting, or find a good hobby to better occupy his time.
e4 d5 xd5 Nf6 |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/08/2008 : 20:18:33
|
Sorry, you are right, i didn't went through all postings, but I did it now, or almost did. I looked at your fuzzy match algorithm and noticed that in your algorithm you do not checking match on first and the last characters, as a result "select dbo.CompareText('Blak', 'Black')" returns only 57% matching ratio which I think isn't right. Also I want you to take into consideration that I was talking about matching in general, rather fuzzy string matching. Unfortunately this topic will not fit into one BLOG, there are more questions than answers. By the way, Address and Name matching was my professional hobby for the past 6 years, but not anymore. Enjoy string matching as I did. Best Regards marat
|
 |
|
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/09/2008 : 00:33:44
|
My algorithm does not look at individual characters, because that level of granularity would be practically useless in comparing larger strings. Instead, it looks at letter pairs. In the example you gave, "Black" and "Blak" together have seven pairs of consecutive letters, of which four match. Four divided by seven is 0.5714..., or 57%.
e4 d5 xd5 Nf6 |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/09/2008 : 01:06:09
|
"My algorithm does not look at individual characters...." Actually, I wasn't suggesting to use individual characters; I suggest, if you add a space in the begining and at the end of the matching strings(after shrinking matching strings) - vu a la, you will get match on the first and last characters working and as a result more precise(increased) matching ratio. Of course, you have to change matching ratio calculation a little bit. You got the point, I believe. |
 |
|
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/09/2008 : 10:47:27
|
Good suggestion:
select dbo.comparetext('John Smith', 'Smith, John')
-----------
73
select dbo.comparetext(' John Smith ', ' Smith, John ')
-----------
90 But...
select dbo.comparetext(' vu a la ', ' Voila ')
-----------
41 ...I think you can do better. Don't you? ;)
e4 d5 xd5 Nf6 |
Edited by - blindman on 04/09/2008 10:48:34 |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/10/2008 : 21:06:56
|
You are right, we can do better. I made a few chandges to your program. I think in C# it would be even better(compact and fast), but I left it for C# guys.
--if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[CompareText]')
--and xtype in (N'FN', N'IF', N'TF')) --drop function [dbo].[CompareText] --GO -- --create function CompareText (@String1 varchar (100), @String2 varchar (100)) --returns integer --Function CompareText -- This function accepts two string values and returns an integer value between --zero and one hundred indicating the similarity between the two string. This --algorithm was developed specifically to search for similar names, spelling --variations, and typos in proper names and business names. For this purpose, --it is superior to SOUNDEX, which searches only for similar sounding words. -- Proper name pairs which yield a CompareText value above 80 are very likely to --represent the same person. Pair values greater than 60 should be reviewed --manually to confirm the match. For greater accuracy and efficiency, run the names --through the MatchText function to remove spaces and vowels before submitting them --to comparetext. -- For efficiency in comparing two large lists of names, it is best to join --the sets on another column as well, such as zip code, or city name.
--Usage: select dbo.CompareText('Alan Smith', 'Smith, Alan J.')
--blindman 4/2005 --Adapted from MS Access algorithm developed 1997 --marat 4/2008 adapted from blindman --included first and last letters match; changed scoring logics and used only one run through --for the best results all not alphanumeric characters(or at least, spaces) should be removed before matching starts --**************** DECLARE @String1 varchar (100), @String2 varchar (100) --put you strings here set @String1 = 'Alk' set @String2 = 'Ak'
-- removing not alphanumeric characters should be here --- declare @Possibles integer declare @Hits integer declare @Counter integer declare @Pos integer
set @Possibles = case when len(@String1) >= len(@String2) then len(@String1) else len(@String2) end if @String1 = @String2 --this exact match check was introduced due to getting extra points for the begin --first and last char matching to prevent matching over 100% select 100 goto finish end --print @Possibles set @Hits = 0 set @String1 = '?' + @String1 + '?' set @String2 = '?' + @String2 + '?'
set @Counter = 0 while @Counter < len(@String1) begin --print @Counter set @Pos = charindex(substring(@String1, @Counter, 2), @String2) if @Pos > 0 begin set @Hits = @Hits + 1 --print @Hits --print @Pos select @String2 = stuff(@String2, @Pos , 1, '!') --this was done to eliminate double-hit --print @String2 end set @Counter = @Counter + 1 end /* this was made redundant by marat set @Counter = len(@String2)-1 while @Counter > 0 begin if charindex(substring(@String2, @Counter, 2), @String1) > 0 set @Hits =
@Hits + 1 set @Counter = @Counter - 1 end */ select (100*@Hits)/@Possibles finish: print 'end' --return (100*@Hits)/@Possibles --end
|
 |
|
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/11/2008 : 10:48:59
|
Be careful about using only one "run through". I used two passes, first comparing A to B and then comparing B to A, because otherwise I could get different scores depending upon which string was submitted as which parameter. The scoring should always be the same regardless of the order in which the strings are compared.
Be careful about eliminating numeric characters, as that would make the algorithm useless for matching things like addresses. Take a look at the MATCHTEXT function which removes non-alphanumeric characters. I have often stored this value as a separate column in my tables in applications where frequent name searching was required.
e4 d5 xd5 Nf6 |
 |
|
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/11/2008 : 10:51:02
|
Also, since you say you have in interest in this stuff, I played around with a fractal algorithm for matching names. It was a cool idea, but ultimately in testing it proved less efficient and less accurate than this simple "count the matching pairs" test. What other algorithms have you tried?
e4 d5 xd5 Nf6 |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/11/2008 : 19:26:40
|
there is a bug in the cscript I send set @Counter = 0 while @Counter < len(@String1) should be set @Counter = 1 while @Counter < len(@String1)
But this doesn't really matter. You got the point. I wasn't suggesting eliminate numeric characters. I was using matchkeys as well("I have often stored this value as a separate column..") ; this is very convinient and they reflect at some level your matching rules; you can change rules and nothing in the record will change just a matching key. I was successfully using matching pairs algorithm for years. Want to point that for matching addresses and names I was using different approach. We usually divide names and addresses into a different components. Each component has a matching score weight. As an example matching on Surname is better than on First Name and Exact match is better than fuzzy and so on... This is why I was talking about scoring engine which actually impliments business rules on name and address match. In my engine matching was a choice of the row with the highest score. So far it was a booletproof approach for me and produced exceptional results; but as I said I don't do it anymore; this is just a little bit of nostalgie for me. Cheers. |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/16/2008 : 20:16:38
|
Sorry for the bugs in the script I posted earlier. Below is a new debugged script. I decided to make matching on first and last characters optional; so user can choose himself. Also with this algorithm position of the matching pair doesn't matter. Next time I will post string matching function where position does matter and which I actually was using myself. if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareString]')
and xtype in (N'FN', N'IF', N'TF')) drop function [dbo].[fn_CompareString] GO -- create function fn_CompareString (@String1 varchar (100), @String2 varchar (100), @FirstIn tinyint = 0, @LastIn tinyint = 0) returns integer --Function CompareString --This function accepts two string values and returns an integer value between --zero and one hundred indicating the similarity between the two string. This --algorithm was developed specifically to search for similar names, spelling --variations, and typos in proper names and business names. For this purpose, --it is superior to SOUNDEX, which searches only for similar sounding words. --SOUNDEX fails when dealing with not Latin names(asian or arabic names as an example) or with it's interpretations --Usage: select dbo.CompareString('Alan Smith', 'Smith, Alan J.', 0, 0) --Input Parameters --@String1 first string --@String2 second string --@FirstIn option to include comparison on the first character --@LastIn option to include comparison on the last character -- --marat 4/2008 adapted from blindman's CompareText function --blindman 4/2005 Adapted from MS Access algorithm developed in 1997 --marat 4/2008 --included first and last letters match; changed scoring logics and used only one run through --and eliminated double hit possibility --marat 16/04/2008 include first and last letters matching option as a parameters, --so now you can chose to use them or not, and did debugging
--**************** begin
declare @Target integer declare @Hits integer declare @Pos integer declare @Index integer declare @String2x varchar (201)
set @String2x = @String2
-- removing not alphanumeric characters should be here if you decide to -- I am removing just my working characters '?' and Space set @String1 = replace(@String1, ' ','') set @String1 = replace(@String1, '?','') set @String2x = replace(@String2x, ' ','') set @String2x = replace(@String2x, '?','') -- set @Hits = 0 set @String1 = case @FirstIn when 1 then '?' else '' end + @String1 + case @LastIn when 1 then '?' else '' end set @String2x = case @FirstIn when 1 then '?' else '' end + @String2x + case @LastIn when 1 then '?' else '' end
set @Target = case when len(@String1) >= len(@String2x) then len(@String1) -1 else len(@String2x) -1 end set @Pos = 1 while @Pos < len(@String1) begin set @Index = charindex(substring(@String1, @Pos, 2), @String2x) if @Index > 0 begin set @Hits = @Hits + 1 select @String2x = left(@String2x,@Index) + ' ' + right(@String2x, len(@String2x) - @Index) --this is to eliminate multi-hit end set @Pos = @Pos + 1 end
finish: return (100*@Hits)/@Target end
|
 |
|
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/17/2008 : 09:01:08
|
Marat, I liked your suggestion about adding a space to the front of each string to include the first letter in the matching algorithm. I tested it extensively and found that it made a slight but consistent improvement.
The rest of your changes...eh, not so much.
Your code seems much more complicated than mine, which is a simple implementation of a simple algorithm. Plus, I have issues with stripping characters from the string within the function (spaces and question marks). This makes the function much less generic, in that it is now unable to match those characters even if the user wants them included. Any reduction of the input strings should be done before the parameters are submitted to the function, otherwise its scope is restricted.
Go ahead and have fun modifying the code and adapting it to your particular application requirements, but as far as putting this forth as a generic pattern matching utility I will be sticking with the code I already have posted.
e4 d5 xd5 Nf6 |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/21/2008 : 09:23:53
|
As I promised, below is a string comparison function I developed and was using myself for years. I put matching on the first and last characters as a parameters to let to do comparison test if anybody is interested. I was using comparison on the first and last characters as a compalsory option(hardcoded). I believe this function is generic enough :-) ---
if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]')) drop function [dbo].[fn_CompareThePair] GO -- create function fn_CompareThePair (@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1) returns integer /************************************************************************************************* Function CompareThePair This function accepts two string values and returns an integer value between zero and one hundred indicating the similarity between the two string. This algorithm was developed specifically to search for the spelling variations of the names and addresses. For this purpose, it is superior to SOUNDEX, which searches only for similar sounding words. SOUNDEX fails when dealing with not Latin names(asian names as an example) or with their interpretations. Usage: select dbo.CompareThePair('Smith', 'Shmidth', 1, 1) Input Parameters @String1 first string @String2 second string @FirstIn option to include comparison on the first character @LastIn option to include comparison on the last character --History * Developed by marat 21/04/2008 based on the string matching algorithm used for address and name matching. ****************************************************************************************************/ begin declare @Target integer declare @Hits integer declare @ExHits integer declare @Pos integer declare @Index integer declare @Cursor integer declare @Keep integer -- set @Hits = 0 set @ExHits = 0 -- if @FirstIn > 0 begin if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1 end if @LastIn > 0 begin if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1 end
set @Target = case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn else len(@String2) -1 + @FirstIn + @LastIn end --Run I set @Pos = 1 set @Cursor = 1 while @Pos < len(@String1) begin set @Index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor) if @Index > 0 begin set @Cursor = @Index + 1 set @Hits = @Hits + 1 end set @Pos = @Pos + 1 end
set @Keep = @Hits --keep first result --Run II set @Pos = 1 set @Cursor = 1 set @Hits = 0 while @Pos < len(@String2) begin set @Index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor) if @Index > 0 begin set @Cursor = @Index + 1 set @Hits = @Hits + 1 end set @Pos = @Pos + 1 end
finish: if @Keep > @Hits set @Hits = @Keep return 100*(@Hits + @ExHits)/@Target end |
 |
|
|
blindman
Flowing Fount of Yak Knowledge
USA
2365 Posts |
Posted - 04/21/2008 : 12:22:37
|
The function you developed years ago has the same comment text as mine, uses the same variable names as mine, the same algorithm as mine, and is dated "21/04/2008". The most significant difference I can see is that in this version of my function you are no longer crediting me for the code you copied. If you tried that in a college course you would be flunked, if not expelled.
quote: Originally posted by marat
As I promised, below is a string comparison function I developed and was using myself for years. I put matching on the first and last characters as a parameters to let to do comparison test if anybody is interested. I was using comparison on the first and last characters as a compalsory option(hardcoded). I believe this function is generic enough :-) ---
if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]')) drop function [dbo].[fn_CompareThePair] GO -- create function fn_CompareThePair (@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1) returns integer /************************************************************************************************* Function CompareThePair This function accepts two string values and returns an integer value between zero and one hundred indicating the similarity between the two string. This algorithm was developed specifically to search for the spelling variations of the names and addresses. For this purpose, it is superior to SOUNDEX, which searches only for similar sounding words. SOUNDEX fails when dealing with not Latin names(asian names as an example) or with their interpretations. Usage: select dbo.CompareThePair('Smith', 'Shmidth', 1, 1) Input Parameters @String1 first string @String2 second string @FirstIn option to include comparison on the first character @LastIn option to include comparison on the last character --History * Developed by marat 21/04/2008 based on the string matching algorithm used for address and name matching. ****************************************************************************************************/ begin declare @Target integer declare @Hits integer declare @ExHits integer declare @Pos integer declare @Index integer declare @Cursor integer declare @Keep integer -- set @Hits = 0 set @ExHits = 0 -- if @FirstIn > 0 begin if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1 end if @LastIn > 0 begin if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1 end
set @Target = case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn else len(@String2) -1 + @FirstIn + @LastIn end --Run I set @Pos = 1 set @Cursor = 1 while @Pos < len(@String1) begin set @Index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor) if @Index > 0 begin set @Cursor = @Index + 1 set @Hits = @Hits + 1 end set @Pos = @Pos + 1 end
set @Keep = @Hits --keep first result --Run II set @Pos = 1 set @Cursor = 1 set @Hits = 0 while @Pos < len(@String2) begin set @Index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor) if @Index > 0 begin set @Cursor = @Index + 1 set @Hits = @Hits + 1 end set @Pos = @Pos + 1 end
finish: if @Keep > @Hits set @Hits = @Keep return 100*(@Hits + @ExHits)/@Target end
e4 d5 xd5 Nf6 |
 |
|
|
marat
Yak Posting Veteran
Australia
85 Posts |
Posted - 04/21/2008 : 21:49:39
|
Sorry, blindman if I hurt your feelings. I wasn't thinking that naming conventions actually will make any difference to the algorithm itself, which I developed years ago using PROGRESS and later adapted it in VB6 and it also was adapted in C++ by another developer(I didn't tell him which naming conventions to use) In my algorithm, if you noticed, the core features are: using pair match(the same as yours), reverse matching(same as yours but with different scoring),matching first and last letters, and using forward only cursor to prevent scoring on double hit, which I think is important. None of the last 2 features aren't presented in your function and the second feature isn't quite as yours. Match on the first and the last characters was used to better handle missing second and second last characters. Not seeing the difference, means actually not understanding how it all works. But after all I want to thank you for being a good opponent to me. Actually your comments about using space in the begining and at the end of the string made me a change to the function - not the algorithm. I was using spaces in front and at the end of the strings because my matchkeys didn't have spaces and it was absolutely valid in my case. So please tell me which part of your function I copied.
|
 |
|
Topic  |
|
|
|