| Author |
Topic |
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 12:32:30
|
| I'm working with a table which has tens of thousands of records of company records which were manually entered. After doing a distinct query, an example of some results is as follows:K-MartKmartK MartI'm wondering if there is a way to group these records together somehow. Something that would you allow you to define a percent similarity would obviously be ideal, but I'm not sure if that's possible.Thanks |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-07-17 : 12:36:12
|
| may be this:-GROUP BY REPLACE(REPLACE(field,' ',''),'-','') |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 12:39:54
|
| My example was a simple one. Here's another example:Cheltenham NursingCHELTENHAM NURSING REHABCHELTENHAM NURSING & REHABCheltenham Nursing & Rehabilitation |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-07-17 : 12:41:57
|
| try using SOUNDEX or DIFFERENCE functions then:-http://doc.ddart.net/mssql/sql70/setu-sus_8.htm |
 |
|
|
rohitkumar
Constraint Violating Yak Guru
472 Posts |
Posted - 2008-07-17 : 12:42:37
|
| SELECT 'Cheltenham Nursing', soundex('Cheltenham Nursing')union allSELECT 'CHELTENHAM NURSING REHAB', soundex('CHELTENHAM NURSING REHAB')union allSELECT 'CHELTENHAM NURSING & REHAB', soundex('CHELTENHAM NURSING & REHAB')union allSELECT 'Cheltenham Nursing & Rehabilitation', soundex('Cheltenham Nursing & Rehabilitation') |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 12:46:18
|
| rohitkumar: are you serious? |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
|
|
rohitkumar
Constraint Violating Yak Guru
472 Posts |
Posted - 2008-07-17 : 12:54:53
|
| yes I am...using soundex you can very vaguely group similar records together.try exploring Full text Indexing if you can achieve it more accurately |
 |
|
|
rohitkumar
Constraint Violating Yak Guru
472 Posts |
Posted - 2008-07-17 : 12:56:16
|
| experts-exchange is a paid website and I don't have access. |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 12:57:05
|
| It surely isn't feasible to create such a query when there are 36,000 distinct company name records... |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 12:57:25
|
| That's what I thought, but you just have to scroll down lower on the page. |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 13:07:20
|
| Also, it might be helpful to note that there are other columns that could also be used to help the grouping. Address and phone number, but they have similar difference issues. |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-17 : 16:50:40
|
| So no one has ever done anything like this? I find that hard to believe. |
 |
|
|
rohitkumar
Constraint Violating Yak Guru
472 Posts |
Posted - 2008-07-17 : 16:59:15
|
| read thishttp://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-18 : 09:26:28
|
| rohitkumar: Awesome. It looks like Fuzzy Grouping is exactly what I'm looking for. Thanks so much.http://msdn.microsoft.com/en-us/library/ms345128.aspx#fzdtssql05_topic4 |
 |
|
|
rohitkumar
Constraint Violating Yak Guru
472 Posts |
Posted - 2008-07-18 : 09:37:55
|
| Great! do teach us know when you have a solution in place. |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-18 : 10:39:41
|
| I'm trying to post an example on here, but HTML is off and I don't know how to post a table otherwise. Little help? |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-07-18 : 10:59:55
|
quote: Originally posted by im1dermike I'm trying to post an example on here, but HTML is off and I don't know how to post a table otherwise. Little help?
just copy the data from your table after selecting results to text option and paste it here within code tags (selecting # from top button) |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-18 : 11:13:01
|
Well I've been playing around with it for a couple hours and it's pretty awesome. The link I provided in my last post has a decent description of how to actually implement it. It seems a little finicky as far as creating the package, but the instructions provided work if you follow them closely.The key is that you have to experiment with what threshold will work for your data. I've been working with 50% because my data is filthy and it seems to be decent. Here's an example:Employer_clean Employer------------------------------------------------------------------------------------------------Belmont & Parkside Retirement House Belmont & Parkside Retirement HouseBelmont and Parkside Center Belmont and ParksideBelmont and Parkside Center Belmont and Parkside CenterBelmont Center for Comprehensive Treatment BELMONT CENTER FORBelmont Center for Comprehensive Treatment Belmont Center for Comprehensive Treatmentbelmont charter scho Belmont Carter Schoolbelmont charter scho Belmont Charterbelmont charter scho belmont charter schobelmont charter scho BELMONT CHARTER SCHOOLbelmont charter scho BELMONT ELEMT CHARTER SCHLBELMONT COURT BELMONT COURTBELMONT ELEMENTARY SCHOOL BELMONT ELEMENTARY SCHOOLBelmont Improvment Assoc. Belmont Improvment Assoc.Belmont Parkside Ave Belmont Parkside AveBelmont Parkside Geriatric Center Belmont & Parkside GeriatricBelmont Parkside Geriatric Center BELMONT GERIATRIC CENTERBelmont Parkside Geriatric Center BELMONT PARKSIDE GERIATICBelmont Parkside Geriatric Center BELMONT PARKSIDE GERIATRICBelmont Parkside Geriatric Center Belmont Parkside Geriatric CenterBelmont Parkside Healthcare Center Belmont & ParksideBelmont Parkside Healthcare Center BELMONT & PARKSIDE CENTERBelmont Parkside Healthcare Center Belmont & Parkside Health Care CenterBelmont Parkside Healthcare Center Belmont & Parkside Healthcare CenterBelmont Parkside Healthcare Center Belmont CenterBelmont Parkside Healthcare Center Belmont HealthcareBelmont Parkside Healthcare Center Belmont ParksideBelmont Parkside Healthcare Center Belmont Parkside HealthCareBelmont Parkside Healthcare Center Belmont Parkside Healthcare CenterBelmont Parkside Nursing Home Belmont Nursing HomeBelmont Parkside Nursing Home BELMONT PARKSIDE NURSINGBelmont Parkside Nursing Home Belmont Parkside Nursing HomeBelmont Treatment Ctr Belmont Treatment Ctr The column on the left is the grouped employer after the package is run and the column on the right is the original employer string.What I'm working on now is running another fuzzy grouping package on these results to try and further refine the grouping. I'm not sure how successful that will be, though. As I said, the threshold above is at 50% and results in 12 groups from an original 32 distinct employer strings. When I run a subsequent package on those results at a threshold of 40%, I get the same 12. When I change the subsquent package to 30%, though, I get 11 groups. So it does look like you can further refine your groupings, but there appears to be a limit. Additionally, being that I'm working with almost 36,000 distinct employer strings in my original table, it will be impossible to verify how certain thresholds work on all my data.Hope this helps. |
 |
|
|
rohitkumar
Constraint Violating Yak Guru
472 Posts |
Posted - 2008-07-18 : 13:16:46
|
| thanks for it Mike. Results are impressive and yes it will need a couple of iterations before you have a fairly cleaned up data. |
 |
|
|
im1dermike
Posting Yak Master
222 Posts |
Posted - 2008-07-18 : 14:43:14
|
| Thanks for bringing Fuzzy Grouping to my/our attention! |
 |
|
|
Next Page
|