Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Grouping similar records

Author  Topic 

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 12:32:30
I'm working with a table which has tens of thousands of records of company records which were manually entered. After doing a distinct query, an example of some results is as follows:

K-Mart
Kmart
K Mart

I'm wondering if there is a way to group these records together somehow. Something that would you allow you to define a percent similarity would obviously be ideal, but I'm not sure if that's possible.

Thanks

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-07-17 : 12:36:12
may be this:-

GROUP BY REPLACE(REPLACE(field,' ',''),'-','')
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 12:39:54
My example was a simple one. Here's another example:

Cheltenham Nursing
CHELTENHAM NURSING REHAB
CHELTENHAM NURSING & REHAB
Cheltenham Nursing & Rehabilitation
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-07-17 : 12:41:57
try using SOUNDEX or DIFFERENCE functions then:-

http://doc.ddart.net/mssql/sql70/setu-sus_8.htm
Go to Top of Page

rohitkumar
Constraint Violating Yak Guru

472 Posts

Posted - 2008-07-17 : 12:42:37
SELECT 'Cheltenham Nursing', soundex('Cheltenham Nursing')
union all
SELECT 'CHELTENHAM NURSING REHAB', soundex('CHELTENHAM NURSING REHAB')
union all
SELECT 'CHELTENHAM NURSING & REHAB', soundex('CHELTENHAM NURSING & REHAB')
union all
SELECT 'Cheltenham Nursing & Rehabilitation', soundex('Cheltenham Nursing & Rehabilitation')
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 12:46:18
rohitkumar: are you serious?
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 12:50:26
SOUNDEX is an interesting function, but only using that won't resolve this issue. There are just too many company names that have the same code.

I found this, but couldn't really make out what the guy was doing:
http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SQL-Server-2005/Q_23047590.html

Anyone?
Go to Top of Page

rohitkumar
Constraint Violating Yak Guru

472 Posts

Posted - 2008-07-17 : 12:54:53
yes I am...using soundex you can very vaguely group similar records together.
try exploring Full text Indexing if you can achieve it more accurately
Go to Top of Page

rohitkumar
Constraint Violating Yak Guru

472 Posts

Posted - 2008-07-17 : 12:56:16
experts-exchange is a paid website and I don't have access.
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 12:57:05
It surely isn't feasible to create such a query when there are 36,000 distinct company name records...
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 12:57:25
That's what I thought, but you just have to scroll down lower on the page.
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 13:07:20
Also, it might be helpful to note that there are other columns that could also be used to help the grouping. Address and phone number, but they have similar difference issues.
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-17 : 16:50:40
So no one has ever done anything like this? I find that hard to believe.
Go to Top of Page

rohitkumar
Constraint Violating Yak Guru

472 Posts

Posted - 2008-07-17 : 16:59:15
read this

http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-18 : 09:26:28
rohitkumar: Awesome. It looks like Fuzzy Grouping is exactly what I'm looking for. Thanks so much.

http://msdn.microsoft.com/en-us/library/ms345128.aspx#fzdtssql05_topic4
Go to Top of Page

rohitkumar
Constraint Violating Yak Guru

472 Posts

Posted - 2008-07-18 : 09:37:55
Great! do teach us know when you have a solution in place.
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-18 : 10:39:41
I'm trying to post an example on here, but HTML is off and I don't know how to post a table otherwise. Little help?
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-07-18 : 10:59:55
quote:
Originally posted by im1dermike

I'm trying to post an example on here, but HTML is off and I don't know how to post a table otherwise. Little help?


just copy the data from your table after selecting results to text option and paste it here within code tags (selecting # from top button)
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-18 : 11:13:01
Well I've been playing around with it for a couple hours and it's pretty awesome. The link I provided in my last post has a decent description of how to actually implement it. It seems a little finicky as far as creating the package, but the instructions provided work if you follow them closely.

The key is that you have to experiment with what threshold will work for your data. I've been working with 50% because my data is filthy and it seems to be decent. Here's an example:
Employer_clean                                      Employer
------------------------------------------------------------------------------------------------
Belmont & Parkside Retirement House Belmont & Parkside Retirement House
Belmont and Parkside Center Belmont and Parkside
Belmont and Parkside Center Belmont and Parkside Center
Belmont Center for Comprehensive Treatment BELMONT CENTER FOR
Belmont Center for Comprehensive Treatment Belmont Center for Comprehensive Treatment
belmont charter scho Belmont Carter School
belmont charter scho Belmont Charter
belmont charter scho belmont charter scho
belmont charter scho BELMONT CHARTER SCHOOL
belmont charter scho BELMONT ELEMT CHARTER SCHL
BELMONT COURT BELMONT COURT
BELMONT ELEMENTARY SCHOOL BELMONT ELEMENTARY SCHOOL
Belmont Improvment Assoc. Belmont Improvment Assoc.
Belmont Parkside Ave Belmont Parkside Ave
Belmont Parkside Geriatric Center Belmont & Parkside Geriatric
Belmont Parkside Geriatric Center BELMONT GERIATRIC CENTER
Belmont Parkside Geriatric Center BELMONT PARKSIDE GERIATIC
Belmont Parkside Geriatric Center BELMONT PARKSIDE GERIATRIC
Belmont Parkside Geriatric Center Belmont Parkside Geriatric Center
Belmont Parkside Healthcare Center Belmont & Parkside
Belmont Parkside Healthcare Center BELMONT & PARKSIDE CENTER
Belmont Parkside Healthcare Center Belmont & Parkside Health Care Center
Belmont Parkside Healthcare Center Belmont & Parkside Healthcare Center
Belmont Parkside Healthcare Center Belmont Center
Belmont Parkside Healthcare Center Belmont Healthcare
Belmont Parkside Healthcare Center Belmont Parkside
Belmont Parkside Healthcare Center Belmont Parkside HealthCare
Belmont Parkside Healthcare Center Belmont Parkside Healthcare Center
Belmont Parkside Nursing Home Belmont Nursing Home
Belmont Parkside Nursing Home BELMONT PARKSIDE NURSING
Belmont Parkside Nursing Home Belmont Parkside Nursing Home
Belmont Treatment Ctr Belmont Treatment Ctr


The column on the left is the grouped employer after the package is run and the column on the right is the original employer string.

What I'm working on now is running another fuzzy grouping package on these results to try and further refine the grouping. I'm not sure how successful that will be, though. As I said, the threshold above is at 50% and results in 12 groups from an original 32 distinct employer strings. When I run a subsequent package on those results at a threshold of 40%, I get the same 12. When I change the subsquent package to 30%, though, I get 11 groups. So it does look like you can further refine your groupings, but there appears to be a limit. Additionally, being that I'm working with almost 36,000 distinct employer strings in my original table, it will be impossible to verify how certain thresholds work on all my data.

Hope this helps.
Go to Top of Page

rohitkumar
Constraint Violating Yak Guru

472 Posts

Posted - 2008-07-18 : 13:16:46
thanks for it Mike. Results are impressive and yes it will need a couple of iterations before you have a fairly cleaned up data.
Go to Top of Page

im1dermike
Posting Yak Master

222 Posts

Posted - 2008-07-18 : 14:43:14
Thanks for bringing Fuzzy Grouping to my/our attention!
Go to Top of Page
    Next Page

- Advertisement -