Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Query to Remove Duplicates with Near by Match Data

Author  Topic 

Kalaiselvan
Posting Yak Master

112 Posts

Posted - 2013-10-23 : 07:37:48
Hi,
I got a Table with some Customer Info. Where as the Address Columns were Splitted into Many columns.
Hence some Customer Address were Loaded in various Ways. Need to Overcome this Issue and Required Unique set of Records of Customers.

Table: (CustomerInfo)
[NAME] [ADDRESS1] [ADDRESS2] [ADDRESS3] [AREA] [CITY]
BABU V NO.41 31st street 2nd Avenue PADI CHENNAI
BABU V 31st Street No.41 2nd Aven PADI CHENNAI
BABU V 31st street, PADI CH
ARUN ADYAR CHENNAI
ARUN NO.50 3rd Cross St ADYAR CHENNAI
ARUN 3rd Cross St ADYAR CHE
ARUN No.75 AA Avenue CHOOLAI CHE


Like the Above table I have got nearly millions of Rows. With some Name Duplication and Also with Un Matched Address Column.

But from above Table, Required output will be 3 Rows.

TABLE: (CustomerInfo Unique)
[NAME] [ADDRESS] [CITY]
BABU V No.41 2nd Avenue 31st Steet, PADI CHENNAI
ARUN NO.50 3rd Cross Street ADYAR CHENNAI
ARUN NO.75 AA Avenue CHOOLAI CHENNAI

Output Table has Combined Address Column. Where as Address Near by Match is taken as Single Row.
Please help me How to fix this Using Query or else using any Fuzzy Grouping or Fuzzy Logic Transformation in SSIS.

Regards,
Kalai

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2013-10-23 : 08:36:19
you need use fuzzy lookup for this. Or use some data quality tools like Mellisa data component or DQS task in SSIS. DQS is available in SSIS from 2012 onwards. Mellisa data is third party component.

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/
https://www.facebook.com/VmBlogs
Go to Top of Page

Kalaiselvan
Posting Yak Master

112 Posts

Posted - 2013-10-23 : 14:24:27
Can u Explain me in detail in how to tackle this Issue. I can use Fuzzy Logic and not other as mentioned.
Or Can Help with SQL Queries too...

Regards,
Kalai
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2013-10-24 : 01:46:08
You need to have a master table with standard entries for each group ie for example among

NO.41 31st street 2nd Avenue PADI CHENNAI
31st Street No.41 2nd Aven PADI CHENNAI
31st street, PADI CH
..

store one standard value (most accurate) in table

Then in SSIS use lookup task from your source data to this table. This will capture all exact matches in lookup match output. Now take nomatch output and link it to Fuzzy Lookup task based on the Address value. Then get output onto a staging table or file. Then check this table/file content. It will have few additional columns like Similarity,Confidence which will have a value based on how similar value is to actual value. Based on values you get you need to determine what similarity and confidence values gives you expected matches (this has to be done by trial and error method doing few runs). Then add a conditional split task to fuzzy match output and create an output specifying your values for similarity and confidence. Then link this output to match output of previous lookup task via Union all and you'll get required pattern matches in your final output

So package will look like below inside data flow task

Your source tasks
|
|
Lookup Task
/ / / Match NoMatch
| |
| |
| Fuzzy Lookup
| |
| |
| Conditional Split
| /
| /
| /
| /
| /
| /
Union all


------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/
https://www.facebook.com/VmBlogs
Go to Top of Page
   

- Advertisement -