Using Fuzzy Lookup Transformations in SQL Server Integration Services
By Guest Authors
on 22 January 2007
| 3 Comments
| Tags: Import/Export, SSIS
This article comes to us from Michael K. Campbell. Michael writes "Humans can instantly spot the difference between "411 Madison Avenue" and "411 Madisan Av". Moreover, because human brains are wired semantically, they can discern that both addresses are likely the same even though the second one has been misspelled. Computers, unfortunately, are typically unaware of these semantic similarities, which can lead to improperly checked data." He then goes on to show how the Fuzzy Lookup Transformation in SQL Server Integration Services (SSIS) can be used to make this exact same match. And he includes the application so you can test it yourself.
Left unchecked, poor data integrity costs businesses billions of dollars each year by negatively impacting business intelligence and decision support systems. Poor data integrity can also wreak havoc with inventory and contract management applications - to say nothing of all of the improperly addressed junk-mail that must pile up at the post office.
To help businesses overcome the problems associated with poor data integrity, specialized vendors offer a variety of solutions to help detect and correct subtle differences in semantically identical data through a process known as data cleansing. For organizations with SQL Server 2005, the Fuzzy Lookup Transformation from SQL Server Integration Services (SSIS) can be leveraged to create data cleansing solutions by detecting semantically equivalent matches which can then be cleansed as needed.
Behind the scenes the Fuzzy Lookup operation builds token-based indexes (in the form of tables) against approved values in a reference table. As each piece of non-cleansed data is processed, SSIS compares it against the tokenized index and generates a Similarity percentage along with an accompanying Confidence factor (also expressed as a percentage). Both of these values can then be logically evaluated to help strike any desired balance between possibility and certainty. The process works incredibly well and provides an excellent balance between functionality and manageability. I also found the entire process to be very approachable in terms of learning curve.
Testing SSIS' Fuzzy Lookup Functionality
To test SSIS data cleansing, I created a list of parts pulled from the AdventureWorks database and stored it in a flat file using the code in Listing 1 output as an .rpt text file.
Listing 1: Creating Bogus Data
SET NOCOUNT ON
WITH Source (PartName, NumberOfPartsTaken)
SELECT TOP 122
CHARINDEX('A',CAST(NEWID() as varchar(36)),0) [NumberOfPartsTaken]
INNER JOIN Production.ProductInventory i ON p.ProductID = i.ProductID
i.LocationID = 6 -- miscellaneous storage
ORDER BY NEWID()
NumberOfPartsTaken > 0
Once the data was exported, I opened it up in NotePad and made a number of formatting and spelling changes to simulate the type of semantic problems typically encountered in Exact, Transform, and Load (ETL) operations where data has been hand-entered.
Once the 'sample' data was created, I opened up SQL Server 2005 Business Intelligence Studio and created a new Integration Services Project. Using a Flat File Source linked to my .rpt file, I routed my 'sloppy' input into a Fuzzy Lookup Transformation to create 'fuzzy matches' against incoming part names in the AdventureWorks.Production.Product table. While you can set threshold criteria directly within the Fuzzy Lookup Transformation itself, you can also opt to have _Similarity and _Confidence columns appended to a default output stream from the transform, which you can then use to evaluate the possible matches with a Conditional Transform operation. This is the route I took, and, as you can see in Figure 1, leveraging a Conditional Split Transformation, I was able to easily set up my own criteria defining Solid and Likely matches, with anything not meeting those criteria being output as Non Matches. (Errors were routed out in their own error stream as normal.)
Figure 1: Assigning output paths based on confidence levels
With rules in place to parcel results into designated output streams, I routed Solid Matches, which were the bulk of the output, directly to a SQL Server destination to simulate a normal ETL endpoint (see Figure 2). Each of the other three output streams from the Conditional Split were then 'coupled' with a Derived Column Transformation that added a new _Match column (i.e., a custom column that I created) to the output for each path, and assigned a literal value of "LIKELY", "NON-MATCH", and "ERROR" for each output type as seen in Figure 2.
Figure 2: The entire package (and direction of output paths based on matching criteria)
I then used a Union All Transformation (see Figure 2) to combine the output from each of the derived column paths into a single result set to make it easy for humans decipher what would need to be done with results that weren't cleansed programmatically. In this way, they can look at all of the raw data as well as the value provided in the derived column to help guide their decisions. Figure 3 shows some sample output routed in to the Human Intervention endpoint as displayed by a Data Viewer. Looking at the results, it's pretty easy to spot that I've got my tolerances set quite high, but the nice thing about using SSIS' Fuzzy Lookup for data cleansing is that you can easily tune the entire process in iterative fashion using sample data to help get it exactly as you'd like it to be.
Figure 3: Examining the output of the "Human Intervention" Endpoint
If you'd like to get a better idea of how the whole process works, download the accompanying sample application which includes the entire project displayed in Figure 2. You may determine that my approach was overkill (I'm leaning that way myself - due to the number of output paths), but hopefully it will provide you with a good overview of the ways you can handle the output from a Fuzzy Lookup Transformation, and save you a bit of time in attempting the fairly small learning curve needed to use SSIS for data cleansing.
What I liked best about the Fuzzy Lookup Transform is that it's part of a very flexible and very powerful ETL framework that makes loading, cleansing, and outputting massaged data easy - even to and from heterogeneous locations. There is a learning curve involved (even if you've had plenty of DTS experience), and the designer and tools definitely have a strong 'Microsoft Version 1' feel to them, but other Data Cleansing applications are likely to have their own learning curves and warts as well.
Overall, data cleansing with SQL Server Integration Services offers highly flexible yet easy to manage functionality that creates consistent, reproducible reports that organizations can easily cater to their own needs. I highly recommend it because of its excellent blend of functionality as well as its attractive price tag.
Michael K. Campbell is a professional SQL Server consultant with years of experience as a DBA and database developer.