GIGO clean-up ... HELP! - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

Old Forums

CLOSED - General SQL Server

GIGO clean-up ... HELP!

Author

Topic

Frederick Volking
Starting Member

9 Posts

Posted - 2003-05-28 : 15:36:32

We have a business need, meaning we can actually PAY for a solution if said solution is a prepackaged tool set (otherwise,

we have to write the routines ourselves internally)

We're faced with transforming several databases into one. Unfortunately, all incoming data is real GIGO

(Garbage In Garbage Out). So data clean-up is essential.

I'm looking for tools which will act against SQL Server 2000 and will make an "Intelligent Best-Guess" for:

Names:
- Parse one single field into First, Last, Middle, & Suffix.
- Capitalize a name properly. Must accommodate a set of predefined acronyms such as IBM, IRS, UPS, etc.

Addresses:
- Strip, parse, or split a state or country name (or acronym) from the trailing edge of a composite city/state/country field.

Phone Numbers:
- parse one single field into AreaCode, Prefix, Suffix, Extension

My management understands that even the best will be a best-guess, but they insist we accomplish the best best-guess possible. Also, speed of the tool is irrelevant since we're only going to do this once (

I freeking hope

!)

I'm sure some of the world's Bulk Mailing Houses have already written these routines. Problem is finding a source.

Any help would be greatly appreciated.

Frederick Volking

X002548
Not Just a Number

15586 Posts

Posted - 2003-05-28 : 16:02:29

You forgot that your deadline was yesterday..

Search the site for some code discussing this, but as best guess you'll still have problems with a percentage.

I know of a product called finalist that will check address to see if they're correct after you've "clean them up".

You can even write something that would tell you of possible problems. For example if the majority of your name field data is like:

Adam B West

Then you would expect the majority of the data to have 2 spaces.

Anything outside of that would be suspect and should be interogated.

I guess what I'm saying is that anything you do will require analysis, probably be data centric to your data conforms to, and you'll definetly have some manual cleanup.

Start coding.

And post questions.

Unless you're in the NY Metro area....

Brett

8-)

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-05-28 : 22:16:32

I'd also suggest that if you don't find a ready-made product or solution that does the total clean up, that you break up the data cleansing in stages. Don't try to write be-all end-all code that does EVERYTHING. Identify data that matches a certain pattern and run a cleaning routine to just do that data. Doesn't matter if you have 50, 75, or even 100 separate patterns, you WILL spend less time writing 100 specializing cleansing routines than you will writing one kitchen sink routine to handle them all. After you've cleansed everything and you still have stuff left over, unless it is a prohibitively high volume of data, you're better off fixing by hand.

Also make sure that your cleansing routines MARK the data that's already been cleaned so that you don't end up washing it through another routine that could possibly mess it up.

I can't find the link now, but I had a possible answer for your other post about parsing addresses into separate columns. That one is a perfect example of what I mean: a short, concise, does-one-thing-and-one-thing-only type of procedure. It also demonstrates how adding some basic but desireable features can quickly bloat and complicate the process and possibly render it unmanageable.

ValterBorges
Master Smack Fu Yak Hacker

1429 Posts

Posted - 2003-05-28 : 23:14:25

Contact me via email
ValterBorges@msn.com

I will give you my phone number.
Sounds like a weekend job.

Frederick Volking
Starting Member

9 Posts

Posted - 2003-05-29 : 07:19:24

Robvolk ... Automation is important. Manual is not feasible. Several tables have 2+ million rows. Can you imagine 2,000,000 rows of freeking garbage!

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-05-29 : 08:08:50

That wasn't my point at all. I never said do ALL of the processing manually. If you can clean up all but 3,000 rows, for example, you'll probably get through those remaining 3,000 rows faster by hand than trying to write a procedure to do it automatically. You said so yourself in the other thread regarding proper case; you're going to find data that defeats every set of rules you can program for, and you'll be left with no choice than to clean it up manually.

Wait till you try parsing addresses that have a street number and apartment number, never in the same place twice, and with "apt" and "unit" in the same line, and both misspelled.

X002548
Not Just a Number

15586 Posts

Posted - 2003-05-29 : 09:37:56

That's my point...you need to narrow your scope of field. Eliminate what fits your rules of acceptance and handle what you "know" to be invalid.

I'm working on a sample....

quote:

You can even write something that would tell you of possible problems. For example if the majority of your name field data is like:

Adam B West

Then you would expect the majority of the data to have 2 spaces.

Anything outside of that would be suspect and should be interogated.

The anything outside should be smallish in nature (hopefully):

Once you isolate your problems and have a "clean" population, conversion should be simple.

Something like:


CREATE TABLE CustNamesOld(CustName varchar(255))
GO
CREATE TABLE CustNamesNew(FirstName varchar(30), LastName varchar(50),MiddleInit char(1),)
GO
INSERT INTO CustNamesOld(CustName) 
SELECT 'AAROON W BURR' Union All
SELECT 'BENJAMIN P FRANKLIN' Union All
SELECT 'CAPTAIN JAMES TIBERIUS KIRK' Union All
SELECT 'DOUGLAS FAIRBANKS JR.' Union All
SELECT 'ELLA R FITZGERALD' Union All
SELECT 'FRANKLIN DELANO ROOSEVELT' Union All
SELECT 'GORDIE HOWE' Union All
SELECT 'HAROLDS PUB AND GRILL' Union All
SELECT 'MS. IVANA TRUMP' Union All -- Hey I had too
SELECT 'JACKSON BROWNE' Union All
SELECT 'KATHY P IRELAND' Union All -- Ditto
SELECT 'LOUIS S ARMSTRONG' Union All
SELECT 'MIKE THE MAD DOG RUSSO' Union All
SELECT 'NATHAN HALE' Union All
SELECT 'MR. OSCAR P WILDE' Union All
SELECT 'PIERRE C TURGEON' Union All
SELECT 'QUIGLEYS TAVERN'

-- Build something to help filter

CREATE FUNCTION udf_WORDS
	(@str varchar(8000))
RETURNS int
AS
BEGIN
	DECLARE @Words INT, @Pos INT, @x Int
	SELECT @Words = 0, @Pos = 1, @x = -1

	WHILE (@x <> 0)
		BEGIN
			SET @x = CHARINDEX(' ', @str, @Pos)
			SET @Pos = @x + 1
			SET @Words = @Words + 1
		END

	RETURN @Words
END

-- First try and identify the problem cases, this will be your filter

SELECT *, dbo.udf_WORDS(CustName) FROM CustNamesOld
WHERE 
-- Business terms 
(CustName LIKE '%and %' OR CustName LIKE '%INC%' OR CustName LIKE '%TAVERN%' OR CustName LIKE '%GRILL%') --ect
-- Titles and Prefixes
OR (CustName LIKE '%MR %' OR CustName LIKE '%MS %' OR CustName LIKE '%MRS %' OR CustName LIKE '%CAPTAIN %') --ect
-- Non conforming names <> to 2 spaces
OR (dbo.udf_WORDS(CustName) NOT IN (2,3))

DROP TABLE CustNamesOld
GO
DROP TABLE CustNamesNew
GO
DROP FUNCTION udf_WORDS
GO

And Valter I have to disagree...technically it might be a weekend project...IF you have the rules...the rules will be an iterative process that requires interogation of the data to determine what the best data centric filter will be...

Anyway Good luck

Brett

8-)

X002548
Not Just a Number

15586 Posts

Posted - 2003-06-02 : 10:55:23

Fredrick,

Did any of this help?

Brett

8-)

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources