SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Extact Match
 New Topic  Reply to Topic
 Printer Friendly
Previous Page
Author Previous Topic Topic Next Topic
Page: of 2

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/08/2008 :  09:36:39  Show Profile  Reply with Quote
Since the solution I provided actually uses scoring rules similar to what marat so belatedly suggests, there will be no need for them to stumble upon anything. Marat should read through the entire thread before posting, or find a good hobby to better occupy his time.

e4 d5 xd5 Nf6
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/08/2008 :  20:18:33  Show Profile  Reply with Quote
Sorry,
you are right, i didn't went through all postings, but I did it now, or almost did.
I looked at your fuzzy match algorithm and noticed that in your algorithm you do not checking match on first and the last characters,
as a result "select dbo.CompareText('Blak', 'Black')" returns only 57% matching ratio which I think isn't right.
Also I want you to take into consideration that I was talking about matching in general, rather fuzzy string matching.
Unfortunately this topic will not fit into one BLOG, there are more questions than answers.
By the way, Address and Name matching was my professional hobby for the past 6 years, but not anymore.
Enjoy string matching as I did.
Best Regards
marat
Go to Top of Page

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/09/2008 :  00:33:44  Show Profile  Reply with Quote
My algorithm does not look at individual characters, because that level of granularity would be practically useless in comparing larger strings. Instead, it looks at letter pairs. In the example you gave, "Black" and "Blak" together have seven pairs of consecutive letters, of which four match. Four divided by seven is 0.5714..., or 57%.

e4 d5 xd5 Nf6
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/09/2008 :  01:06:09  Show Profile  Reply with Quote
"My algorithm does not look at individual characters...."
Actually, I wasn't suggesting to use individual characters;
I suggest, if you add a space in the begining and at the end of the matching strings(after shrinking matching strings) - vu a la, you will get match on the first and last characters working and as a result more precise(increased) matching ratio.
Of course, you have to change matching ratio calculation a little bit.
You got the point, I believe.
Go to Top of Page

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/09/2008 :  10:47:27  Show Profile  Reply with Quote
Good suggestion:
select dbo.comparetext('John Smith', 'Smith, John')
----------- 
73

select dbo.comparetext(' John Smith ', ' Smith, John ')
----------- 
90

But...
select dbo.comparetext(' vu a la ', ' Voila ')
----------- 
41

...I think you can do better. Don't you? ;)

e4 d5 xd5 Nf6

Edited by - blindman on 04/09/2008 10:48:34
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/10/2008 :  21:06:56  Show Profile  Reply with Quote
You are right, we can do better.
I made a few chandges to your program.
I think in C# it would be even better(compact and fast), but I left it for C# guys.

--if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[CompareText]')

--and xtype in (N'FN', N'IF', N'TF'))
--drop function [dbo].[CompareText]
--GO
--
--create function CompareText (@String1 varchar (100), @String2 varchar (100))
--returns integer
--Function CompareText
-- This function accepts two string values and returns an integer value between
--zero and one hundred indicating the similarity between the two string. This
--algorithm was developed specifically to search for similar names, spelling
--variations, and typos in proper names and business names. For this purpose,
--it is superior to SOUNDEX, which searches only for similar sounding words.
-- Proper name pairs which yield a CompareText value above 80 are very likely to
--represent the same person. Pair values greater than 60 should be reviewed
--manually to confirm the match. For greater accuracy and efficiency, run the names
--through the MatchText function to remove spaces and vowels before submitting them
--to comparetext.
-- For efficiency in comparing two large lists of names, it is best to join
--the sets on another column as well, such as zip code, or city name.

--Usage: select dbo.CompareText('Alan Smith', 'Smith, Alan J.')

--blindman 4/2005
--Adapted from MS Access algorithm developed 1997
--marat 4/2008 adapted from blindman
--included first and last letters match; changed scoring logics and used only one run through
--for the best results all not alphanumeric characters(or at least, spaces) should be removed before matching starts
--****************
DECLARE @String1 varchar (100),
@String2 varchar (100)
--put you strings here
set @String1 = 'Alk'
set @String2 = 'Ak'

-- removing not alphanumeric characters should be here
---
declare @Possibles integer
declare @Hits integer
declare @Counter integer
declare @Pos integer

set @Possibles = case when len(@String1) >= len(@String2) then len(@String1) else len(@String2) end
if @String1 = @String2 --this exact match check was introduced due to getting extra points for the
begin --first and last char matching to prevent matching over 100%
select 100
goto finish
end
--print @Possibles
set @Hits = 0
set @String1 = '?' + @String1 + '?'
set @String2 = '?' + @String2 + '?'

set @Counter = 0
while @Counter < len(@String1)
begin
--print @Counter
set @Pos = charindex(substring(@String1, @Counter, 2), @String2)
if @Pos > 0
begin
set @Hits = @Hits + 1
--print @Hits
--print @Pos
select @String2 = stuff(@String2, @Pos , 1, '!') --this was done to eliminate double-hit
--print @String2
end
set @Counter = @Counter + 1
end
/* this was made redundant by marat
set @Counter = len(@String2)-1
while @Counter > 0
begin
if charindex(substring(@String2, @Counter, 2), @String1) > 0 set @Hits =

@Hits + 1
set @Counter = @Counter - 1
end
*/
select (100*@Hits)/@Possibles
finish:
print 'end'
--return (100*@Hits)/@Possibles
--end

Go to Top of Page

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/11/2008 :  10:48:59  Show Profile  Reply with Quote
Be careful about using only one "run through". I used two passes, first comparing A to B and then comparing B to A, because otherwise I could get different scores depending upon which string was submitted as which parameter. The scoring should always be the same regardless of the order in which the strings are compared.

Be careful about eliminating numeric characters, as that would make the algorithm useless for matching things like addresses. Take a look at the MATCHTEXT function which removes non-alphanumeric characters. I have often stored this value as a separate column in my tables in applications where frequent name searching was required.

e4 d5 xd5 Nf6
Go to Top of Page

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/11/2008 :  10:51:02  Show Profile  Reply with Quote
Also, since you say you have in interest in this stuff, I played around with a fractal algorithm for matching names. It was a cool idea, but ultimately in testing it proved less efficient and less accurate than this simple "count the matching pairs" test. What other algorithms have you tried?

e4 d5 xd5 Nf6
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/11/2008 :  19:26:40  Show Profile  Reply with Quote
there is a bug in the cscript I send
set @Counter = 0
while @Counter < len(@String1)
should be
set @Counter = 1
while @Counter < len(@String1)

But this doesn't really matter. You got the point.
I wasn't suggesting eliminate numeric characters.
I was using matchkeys as well("I have often stored this value as a separate column..") ; this is very convinient and they reflect at some level your matching rules; you can change rules and nothing in the record will change just a matching key.
I was successfully using matching pairs algorithm for years.
Want to point that for matching addresses and names I was using different approach.
We usually divide names and addresses into a different components. Each component has a matching score weight.
As an example matching on Surname is better than on First Name and Exact match is better than fuzzy and so on...
This is why I was talking about scoring engine which actually impliments business rules on name and address match.
In my engine matching was a choice of the row with the highest score.
So far it was a booletproof approach for me and produced exceptional results; but as I said I don't do it anymore;
this is just a little bit of nostalgie for me.
Cheers.
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/16/2008 :  20:16:38  Show Profile  Reply with Quote
Sorry for the bugs in the script I posted earlier.
Below is a new debugged script. I decided to make matching on first and last characters optional; so user can choose himself.
Also with this algorithm position of the matching pair doesn't matter.
Next time I will post string matching function where position does matter and which I actually was using myself.
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareString]')

and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[fn_CompareString]
GO
--
create function fn_CompareString (@String1 varchar (100), @String2 varchar (100), @FirstIn tinyint = 0, @LastIn tinyint = 0)
returns integer
--Function CompareString
--This function accepts two string values and returns an integer value between
--zero and one hundred indicating the similarity between the two string. This
--algorithm was developed specifically to search for similar names, spelling
--variations, and typos in proper names and business names. For this purpose,
--it is superior to SOUNDEX, which searches only for similar sounding words.
--SOUNDEX fails when dealing with not Latin names(asian or arabic names as an example) or with it's interpretations
--Usage: select dbo.CompareString('Alan Smith', 'Smith, Alan J.', 0, 0)
--Input Parameters
--@String1 first string
--@String2 second string
--@FirstIn option to include comparison on the first character
--@LastIn option to include comparison on the last character
--
--marat 4/2008 adapted from blindman's CompareText function
--blindman 4/2005 Adapted from MS Access algorithm developed in 1997
--marat 4/2008
--included first and last letters match; changed scoring logics and used only one run through
--and eliminated double hit possibility
--marat 16/04/2008 include first and last letters matching option as a parameters,
--so now you can chose to use them or not, and did debugging

--****************
begin

declare @Target integer
declare @Hits integer
declare @Pos integer
declare @Index integer
declare @String2x varchar (201)

set @String2x = @String2

-- removing not alphanumeric characters should be here if you decide to
-- I am removing just my working characters '?' and Space
set @String1 = replace(@String1, ' ','')
set @String1 = replace(@String1, '?','')
set @String2x = replace(@String2x, ' ','')
set @String2x = replace(@String2x, '?','')
--
set @Hits = 0
set @String1 = case @FirstIn when 1 then '?' else '' end + @String1 + case @LastIn when 1 then '?' else '' end
set @String2x = case @FirstIn when 1 then '?' else '' end + @String2x + case @LastIn when 1 then '?' else '' end

set @Target =
case when len(@String1) >= len(@String2x) then len(@String1) -1 else len(@String2x) -1
end
set @Pos = 1
while @Pos < len(@String1)
begin
set @Index = charindex(substring(@String1, @Pos, 2), @String2x)
if @Index > 0
begin
set @Hits = @Hits + 1
select @String2x = left(@String2x,@Index) + ' ' + right(@String2x, len(@String2x) - @Index) --this is to eliminate multi-hit
end
set @Pos = @Pos + 1
end

finish:
return (100*@Hits)/@Target
end

Go to Top of Page

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/17/2008 :  09:01:08  Show Profile  Reply with Quote
Marat, I liked your suggestion about adding a space to the front of each string to include the first letter in the matching algorithm. I tested it extensively and found that it made a slight but consistent improvement.

The rest of your changes...eh, not so much.

Your code seems much more complicated than mine, which is a simple implementation of a simple algorithm. Plus, I have issues with stripping characters from the string within the function (spaces and question marks). This makes the function much less generic, in that it is now unable to match those characters even if the user wants them included. Any reduction of the input strings should be done before the parameters are submitted to the function, otherwise its scope is restricted.

Go ahead and have fun modifying the code and adapting it to your particular application requirements, but as far as putting this forth as a generic pattern matching utility I will be sticking with the code I already have posted.

e4 d5 xd5 Nf6
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/21/2008 :  09:23:53  Show Profile  Reply with Quote
As I promised, below is a string comparison function I developed and was using myself for years.
I put matching on the first and last characters as a parameters to let to do comparison test if anybody is interested.
I was using comparison on the first and last characters as a compalsory option(hardcoded).
I believe this function is generic enough :-)
---

if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]'))
drop function [dbo].[fn_CompareThePair]
GO
--
create function fn_CompareThePair
(@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1)
returns integer
/*************************************************************************************************
Function CompareThePair
This function accepts two string values and returns an integer value between
zero and one hundred indicating the similarity between the two string. This
algorithm was developed specifically to search for the spelling
variations of the names and addresses. For this purpose,
it is superior to SOUNDEX, which searches only for similar sounding words.
SOUNDEX fails when dealing with not Latin names(asian names as an example)
or with their interpretations.
Usage: select dbo.CompareThePair('Smith', 'Shmidth', 1, 1)
Input Parameters
@String1 first string
@String2 second string
@FirstIn option to include comparison on the first character
@LastIn option to include comparison on the last character
--History
* Developed by marat 21/04/2008 based on the string matching algorithm used for address and name matching.

****************************************************************************************************/
begin

declare @Target integer
declare @Hits integer
declare @ExHits integer
declare @Pos integer
declare @Index integer
declare @Cursor integer
declare @Keep integer

--
set @Hits = 0
set @ExHits = 0
--
if @FirstIn > 0
begin
if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1
end
if @LastIn > 0
begin
if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1
end


set @Target =
case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn
else len(@String2) -1 + @FirstIn + @LastIn
end
--Run I
set @Pos = 1
set @Cursor = 1
while @Pos < len(@String1)
begin
set @Index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor)
if @Index > 0
begin
set @Cursor = @Index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end

set @Keep = @Hits --keep first result
--Run II
set @Pos = 1
set @Cursor = 1
set @Hits = 0
while @Pos < len(@String2)
begin
set @Index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor)
if @Index > 0
begin
set @Cursor = @Index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end

finish:
if @Keep > @Hits set @Hits = @Keep
return 100*(@Hits + @ExHits)/@Target
end
Go to Top of Page

blindman
Flowing Fount of Yak Knowledge

USA
2365 Posts

Posted - 04/21/2008 :  12:22:37  Show Profile  Reply with Quote
The function you developed years ago has the same comment text as mine, uses the same variable names as mine, the same algorithm as mine, and is dated "21/04/2008". The most significant difference I can see is that in this version of my function you are no longer crediting me for the code you copied. If you tried that in a college course you would be flunked, if not expelled.

quote:
Originally posted by marat

As I promised, below is a string comparison function I developed and was using myself for years.
I put matching on the first and last characters as a parameters to let to do comparison test if anybody is interested.
I was using comparison on the first and last characters as a compalsory option(hardcoded).
I believe this function is generic enough :-)
---

if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]'))
drop function [dbo].[fn_CompareThePair]
GO
--
create function fn_CompareThePair
(@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1)
returns integer
/*************************************************************************************************
Function CompareThePair
This function accepts two string values and returns an integer value between
zero and one hundred indicating the similarity between the two string. This
algorithm was developed specifically to search for the spelling
variations of the names and addresses. For this purpose,
it is superior to SOUNDEX, which searches only for similar sounding words.
SOUNDEX fails when dealing with not Latin names(asian names as an example)
or with their interpretations.
Usage: select dbo.CompareThePair('Smith', 'Shmidth', 1, 1)
Input Parameters
@String1 first string
@String2 second string
@FirstIn option to include comparison on the first character
@LastIn option to include comparison on the last character
--History
* Developed by marat 21/04/2008 based on the string matching algorithm used for address and name matching.

****************************************************************************************************/
begin

declare @Target integer
declare @Hits integer
declare @ExHits integer
declare @Pos integer
declare @Index integer
declare @Cursor integer
declare @Keep integer

--
set @Hits = 0
set @ExHits = 0
--
if @FirstIn > 0
begin
if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1
end
if @LastIn > 0
begin
if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1
end


set @Target =
case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn
else len(@String2) -1 + @FirstIn + @LastIn
end
--Run I
set @Pos = 1
set @Cursor = 1
while @Pos < len(@String1)
begin
set @Index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor)
if @Index > 0
begin
set @Cursor = @Index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end

set @Keep = @Hits --keep first result
--Run II
set @Pos = 1
set @Cursor = 1
set @Hits = 0
while @Pos < len(@String2)
begin
set @Index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor)
if @Index > 0
begin
set @Cursor = @Index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end

finish:
if @Keep > @Hits set @Hits = @Keep
return 100*(@Hits + @ExHits)/@Target
end



e4 d5 xd5 Nf6
Go to Top of Page

marat
Yak Posting Veteran

Australia
85 Posts

Posted - 04/21/2008 :  21:49:39  Show Profile  Reply with Quote
Sorry, blindman if I hurt your feelings.
I wasn't thinking that naming conventions actually will make any difference to the algorithm itself, which I developed years ago using PROGRESS and later adapted it in VB6 and it also was adapted in C++ by another developer(I didn't tell him which naming conventions to use)
In my algorithm, if you noticed, the core features are:
using pair match(the same as yours), reverse matching(same as yours but with different scoring),matching first and last letters, and using forward only cursor to prevent scoring on double hit, which I think is important. None of the last 2 features aren't presented in your function and the second feature isn't quite as yours.
Match on the first and the last characters was used to better handle missing second and second last characters.
Not seeing the difference, means actually not understanding how it all works.
But after all I want to thank you for being a good opponent to me. Actually your comments about using space in the begining and at the end of the string made me a change to the function - not the algorithm.
I was using spaces in front and at the end of the strings because my matchkeys didn't have spaces and it was absolutely valid in my case.
So please tell me which part of your function I copied.

Go to Top of Page
Page: of 2 Previous Topic Topic Next Topic  
Previous Page
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.31 seconds. Powered By: Snitz Forums 2000