Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
Frederick Volking
Starting Member
9 Posts |
Posted - 2003-05-28 : 15:36:32
|
We have a business need, meaning we can actually PAY for a solution if said solution is a prepackaged tool set (otherwise, we have to write the routines ourselves internally)We're faced with transforming several databases into one. Unfortunately, all incoming data is real GIGO (Garbage In Garbage Out). So data clean-up is essential.I'm looking for tools which will act against SQL Server 2000 and will make an "Intelligent Best-Guess" for:Names:- Parse one single field into First, Last, Middle, & Suffix.- Capitalize a name properly. Must accommodate a set of predefined acronyms such as IBM, IRS, UPS, etc.Addresses:- Strip, parse, or split a state or country name (or acronym) from the trailing edge of a composite city/state/country field.Phone Numbers:- parse one single field into AreaCode, Prefix, Suffix, ExtensionMy management understands that even the best will be a best-guess, but they insist we accomplish the best best-guess possible. Also, speed of the tool is irrelevant since we're only going to do this once (I freeking hope!)I'm sure some of the world's Bulk Mailing Houses have already written these routines. Problem is finding a source.Any help would be greatly appreciated. Frederick Volking |
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-05-28 : 16:02:29
|
You forgot that your deadline was yesterday..Search the site for some code discussing this, but as best guess you'll still have problems with a percentage.I know of a product called finalist that will check address to see if they're correct after you've "clean them up".You can even write something that would tell you of possible problems. For example if the majority of your name field data is like:Adam B WestThen you would expect the majority of the data to have 2 spaces.Anything outside of that would be suspect and should be interogated.I guess what I'm saying is that anything you do will require analysis, probably be data centric to your data conforms to, and you'll definetly have some manual cleanup.Start coding.And post questions.Unless you're in the NY Metro area....Brett8-) |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2003-05-28 : 22:16:32
|
I'd also suggest that if you don't find a ready-made product or solution that does the total clean up, that you break up the data cleansing in stages. Don't try to write be-all end-all code that does EVERYTHING. Identify data that matches a certain pattern and run a cleaning routine to just do that data. Doesn't matter if you have 50, 75, or even 100 separate patterns, you WILL spend less time writing 100 specializing cleansing routines than you will writing one kitchen sink routine to handle them all. After you've cleansed everything and you still have stuff left over, unless it is a prohibitively high volume of data, you're better off fixing by hand.Also make sure that your cleansing routines MARK the data that's already been cleaned so that you don't end up washing it through another routine that could possibly mess it up.I can't find the link now, but I had a possible answer for your other post about parsing addresses into separate columns. That one is a perfect example of what I mean: a short, concise, does-one-thing-and-one-thing-only type of procedure. It also demonstrates how adding some basic but desireable features can quickly bloat and complicate the process and possibly render it unmanageable. |
|
|
ValterBorges
Master Smack Fu Yak Hacker
1429 Posts |
Posted - 2003-05-28 : 23:14:25
|
Contact me via emailValterBorges@msn.comI will give you my phone number.Sounds like a weekend job. |
|
|
Frederick Volking
Starting Member
9 Posts |
Posted - 2003-05-29 : 07:19:24
|
Robvolk ... Automation is important. Manual is not feasible. Several tables have 2+ million rows. Can you imagine 2,000,000 rows of freeking garbage! |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2003-05-29 : 08:08:50
|
That wasn't my point at all. I never said do ALL of the processing manually. If you can clean up all but 3,000 rows, for example, you'll probably get through those remaining 3,000 rows faster by hand than trying to write a procedure to do it automatically. You said so yourself in the other thread regarding proper case; you're going to find data that defeats every set of rules you can program for, and you'll be left with no choice than to clean it up manually. Wait till you try parsing addresses that have a street number and apartment number, never in the same place twice, and with "apt" and "unit" in the same line, and both misspelled. |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-05-29 : 09:37:56
|
That's my point...you need to narrow your scope of field. Eliminate what fits your rules of acceptance and handle what you "know" to be invalid.I'm working on a sample....quote: You can even write something that would tell you of possible problems. For example if the majority of your name field data is like: Adam B West Then you would expect the majority of the data to have 2 spaces. Anything outside of that would be suspect and should be interogated.
The anything outside should be smallish in nature (hopefully):Once you isolate your problems and have a "clean" population, conversion should be simple.Something like:CREATE TABLE CustNamesOld(CustName varchar(255))GOCREATE TABLE CustNamesNew(FirstName varchar(30), LastName varchar(50),MiddleInit char(1),)GOINSERT INTO CustNamesOld(CustName) SELECT 'AAROON W BURR' Union AllSELECT 'BENJAMIN P FRANKLIN' Union AllSELECT 'CAPTAIN JAMES TIBERIUS KIRK' Union AllSELECT 'DOUGLAS FAIRBANKS JR.' Union AllSELECT 'ELLA R FITZGERALD' Union AllSELECT 'FRANKLIN DELANO ROOSEVELT' Union AllSELECT 'GORDIE HOWE' Union AllSELECT 'HAROLDS PUB AND GRILL' Union AllSELECT 'MS. IVANA TRUMP' Union All -- Hey I had tooSELECT 'JACKSON BROWNE' Union AllSELECT 'KATHY P IRELAND' Union All -- DittoSELECT 'LOUIS S ARMSTRONG' Union AllSELECT 'MIKE THE MAD DOG RUSSO' Union AllSELECT 'NATHAN HALE' Union AllSELECT 'MR. OSCAR P WILDE' Union AllSELECT 'PIERRE C TURGEON' Union AllSELECT 'QUIGLEYS TAVERN'-- Build something to help filterCREATE FUNCTION udf_WORDS (@str varchar(8000))RETURNS intASBEGIN DECLARE @Words INT, @Pos INT, @x Int SELECT @Words = 0, @Pos = 1, @x = -1 WHILE (@x <> 0) BEGIN SET @x = CHARINDEX(' ', @str, @Pos) SET @Pos = @x + 1 SET @Words = @Words + 1 END RETURN @WordsEND-- First try and identify the problem cases, this will be your filterSELECT *, dbo.udf_WORDS(CustName) FROM CustNamesOldWHERE -- Business terms (CustName LIKE '%and %' OR CustName LIKE '%INC%' OR CustName LIKE '%TAVERN%' OR CustName LIKE '%GRILL%') --ect-- Titles and PrefixesOR (CustName LIKE '%MR %' OR CustName LIKE '%MS %' OR CustName LIKE '%MRS %' OR CustName LIKE '%CAPTAIN %') --ect-- Non conforming names <> to 2 spacesOR (dbo.udf_WORDS(CustName) NOT IN (2,3))DROP TABLE CustNamesOldGODROP TABLE CustNamesNewGODROP FUNCTION udf_WORDSGO And Valter I have to disagree...technically it might be a weekend project...IF you have the rules...the rules will be an iterative process that requires interogation of the data to determine what the best data centric filter will be...Anyway Good luckBrett8-) |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-06-02 : 10:55:23
|
Fredrick,Did any of this help?Brett8-) |
|
|
|
|
|
|
|