Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Sets instead of Iteration for Data Clean up

Author  Topic 

Rainmann
Starting Member

5 Posts

Posted - 2011-08-22 : 16:21:49
I have some data I need to clean up. Because of poor database design there are orphaned records. The orphans, however, can usually be matched back up to valid parent records. What I’m not sure of is how to best build the statement to fix up the data.

First an example of a valid record:
Table: TeamMaster
TeamID: 1
TeamName: Green Bay Packers

In almost all cases the orphaned records are a subset of valid records and might look something like:
BadTeamName: Green Bay Pckrs

What I’d like to do is take the bad team name and do a ‘like’ search over the valid records, trimming the bad team name by a single letter until I get a match.

So my first query would look like:
Select * from Teams where TeamName like 'Green Bay Pckrs%'
and would return no matches.

And my final query would look like:
Select * from Teams where TeamName like 'Green Bay P%'
and would find the valid record.

I can do the above queries by trimming, but is there a better way? Also, my starting point will be a work table or result set in memory. I can do it via iteration but am not sure how to write set based operations to perform the same task.

I’m aware that I might wind up with some records still orphaned, but this will greatly minimize the manual work needed.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2011-08-22 : 17:35:03
Best way to start this type of thing is to get an idea of how many orphans and the nature of the mis-matches.

select t.teamName, count(*)
from teams t
left outer join teamMaster tm on tm.TeamName = t.TeamName
where tm.teamName is null
order by t.teamName

--to see by order of number of occurrences
--order by count(*) desc

I assume your "fixes" will be to set the TeamID, right? Your child table should just have the ID not the teamName repeated.
You can first update the TeamID for all direct matches. Then do specific updates starting with the largest occurrence count.

Then fix your model so that you have foreign key constraints that don't allow orphaned rows.

Be One with the Optimizer
TG
Go to Top of Page

Rainmann
Starting Member

5 Posts

Posted - 2011-08-23 : 09:56:57
Yes to all the things you mentioned. Changing to Id's. Changing direct matches first. Adding proper constraints to prevent this from happening again in the future.

In this particular table I have 40,000 out of 50,000 that can be replaced as direct matches. Of the 10,000 that remain there are only 113 distinct records that are causing an issue.

However, there are many other tables with the same problem. I'm planning in implementing this same fix in all of the other tables as well.
Go to Top of Page
   

- Advertisement -