Similarity Searching - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Database Design and Application Architecture

Similarity Searching

Author

Topic

bogus
Starting Member

41 Posts

Posted - 2007-11-08 : 19:16:03

This started in one thread, but since it was for beginners, I didn't want anyones brain to melt...

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=92096

I said this:

quote:

This is getting into a deeper topic beyond just sounding alike.

I am trying to create a similarity search so I can clean up our database... it isn't going to be easy, that's for sure.

Here in lies my data problem, an example:

Let's take something simple, like, gee, a university. A good one is UCLA - University of California, Los Angeles. Do you have any idea how many variations of that can exist on a database? A LOT! UCLA, U.C.L.A., UC,LA, Univ. Cal. LA., the list is literally endless. The same applies for Penn State... PSU, Penn State, P-State... not as many, but you get the idea.

The goal is to be able to bring those similar names back together.

I have been researching the Smith-Waterman algorithm, but they are all rather slow... and prepping a database with a gram based index is just painful... it could take hours to put that together.

For those that don't know, gram is a part of a word. For example, "University," would be stored as UN, NI, IV, and so on... that's a 2 byte gram, a 3 byte gram would be UNI, NIV, IVE, etc.

Think of all the databits that need to be compiled... and then maintained!

Here is a good example of how a search can work:

You all watch CSI, right? Great show. Good fun, serious leaps of faith. They run fingerprint searches, and on their computers, it's these trick graphics with images that are matched. That is how the human mind matches prints, but even then, we use the same basic data the computer uses - reference points on the print.

Take a quick look at a finger, and you will see swirls and junctions and things of that nature (I am not going to get to tech on that, but I used to work in identity theft investigations for a police department on the east coast). These points become part of a mathmatical equation that is then converted to a number. That number is stored.

Whenever a new print is to be tested, it is manually referenced for it's points and then the number is created and compared to everything in that index. Bam, 15 seconds later you have a 99.9% match. Very impressive. It will also preset a top 10 match list.

I do have a question: what is the nature of the similarity searching in SQL 2005?

To save from reading that other thread I hi-jacked, Kristen was kind enough to reply with this:

quote:

Our clients use organisations that specialise in de-duping data for this kind of fuzzy-matching

Which I replied with:

quote:

yea, I was afraid of that.

We are out to create a model that will allow for data quality in real time. For example, a new client comes into our system, let's say from a trade show.

They fill out an info card and we take the contact data and add it to our CRM.

For example, they fill out the card with this info:

John Smith
Some Huge Technical Company
123 Main Street
Anytown, NY 17999 USA

Well... Our search model would take that info and pass it by our RDBMS and return a top ten hit list of matches.

Some Huge Technical Company can become SHTC or Some Huge Tech Co., or any other combo. But because it looks similar, we can then apply the new contact, Mr. Smith, to the right firm, instead of creating three or four (or more) versions of Some Huge Technical Company.

That kinda mass fuzzy matching is easy, we need something that keeps us from needing to do mass updates on a regular basis.

So far, we have determined that a separate database server will be required. It would contain the metrics needed to narrow the search down to the point where the search won't take 30 seconds. We are resisting the idea of grams, simply because of the overhead needed... a single address could create 400 entries in a database table.

I have found a couple of products, but I am terrorified about pricing... I think coding will be my best solution. Amazon listed a couple of very interesting looking books, pricy, yes, but I suspect cheaper than buying the technology.

And finally, Kristen's final reply:

quote:

I did a fair amount of work to just try to match "new accounts".

We looked at Telephone number (unique, so a high indicator if it matches, but then we found lots of addresses had the same "agency phone number" )

Then ZIP code (in the UK our PostCode generally relates to < 20 properties), then lines of address - mixing them around to try to get State / Street matches even if they were switched around in the Address fields, or someone entered an extra address line - like "4th floor" as the first address line.

We had a copy of the "accounts" table that we had cleaned up a bit. Removed all trailing spaces and punctuation. Also all embedded punctuation converted to "space" and adjacent spaces removed, so:

"10, The High Street,"
"10 The High Street"

We made abbreviations consistent:

"10 The High St"
"10 The High Street"

and we probably took out and "the" and other noise words.

Then we tried matching based on that "sanitised" version.

But it took DAYS AND DAYS - of iterative processing - "Gee, look, here's yet-another-variation-of-rule-X" ...

That really tee-d be off, I don't have the stomach for any single job that takes "days" ...

Just in case any of that gives you any ideas

Kristen

The gist:

I need to know what SQL Server 2005 offers in regards to SimSearching. Also, are there any OTC products that will interface with VB.NET 2005 - and not cost a mint.

Thanks!!!

dinakar
Master Smack Fu Yak Hacker

2507 Posts

Posted - 2007-11-08 : 19:38:46

I dont think SQL Server (2005/2000) offers you anything built-in for this kind of fuzzy logic matching. You'd either have to build it yourself or look for 3rd party solutions...

Dinakar Nethi
************************
Life is short. Enjoy it.
************************
http://weblogs.sqlteam.com/dinakar/

bogus
Starting Member

41 Posts

Posted - 2007-11-08 : 19:47:59

Interesting... I had read/heard that 2005 was supposed to offer something of this nature.

I wonder if it was a proposed improvment...

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources