Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Pattern matching for columns to clean data

Author  Topic 

Concatch
Starting Member

4 Posts

Posted - 2011-09-26 : 02:55:23
Hi,

I have 3 columns. Column_1 - First name, Column_2 - Last_name and Column_3 - Free text field. In Column_3 sometimes the first name is entered or sometimes the last name is entered. At other times it may be a combination or something like 'mr. Smith' or 'the insured - Mr Happy'. There are even instances where names are misspelled ie 'Michael' is captured as 'Micheal' or 'Susan' is captured as 'Susanne'. To make matters worse there are cases when there is text and a numerical value ie 'Michael ID:86504928392' Some of the cell entries are completely unrelated to Column_1 or Column_2. What I'm trying to do is clean up the data.

This is my aim. If in Column_3 there is a match of some sort to either column Column_1 or Column_2 then replace what is in Column_3 with the word INSURED. If there is no match then replace with UNKNOWN.

I received some help for 24 of the rows but I have 39 000. Here is the code so far:

declare @test table
( col1 varchar(25), col2 varchar(25), col3 varchar(300))
insert into @test
select 'Jacob', 'Nieuwenhuys', 'MYSELF' union all
select 'Ina Nell', null, 'DENNIS - 8901255105083' union all
select 'Moeketsi J', 'Thebe', 'MOEKETSI J THEBE' union all
select 'Vicent Vizo', 'Ngobeni', 'VINCENT' union all
select 'Thapelo Emmanuel', 'Matongwane', 'THAPELO' union all
select ' Elsie Margareta Cornelia', 'Visser', 'PATRICK HOMAN' union all
select 'Riaan', 'Swanepoel', 'NOTED REGULAR DRIVER' union all
select 'Felicia Thandi', 'Maleho', 'FELICA' union all
select 'Howard Noel', 'Viljoen', 'GLOLASS CLAIM' union all
select 'Sibongiseni K', 'Buthelezi','SIBONGISENI' union all
select 'Winston Sindile', 'Tyanti', 'STATIONERY' union all
select 'Rayno', 'Verwey', 'MRS VERWY- 6209280834087' union all
select 'Sipho', 'Ngobe', '"NTOMBI NGOBE -6505110263086 - 0795294600 I WAS COMING FROM CHURCH AND I WAS ENTERING MY HOUSE THERE WAS THIS SPEEDING CAR BEHIND ME AND I WAS FORCED TO DRIVE INTO MY HOUSE VERY FAST AND I SCRATCH MY CAR AGAINST THE WALL"' union all
select 'Mouring Marnix Marius', 'Kolhoff', 'MOURING MARNIX MARIUS KOLHOFF' union all
select 'Piet', 'Van Der Walt', 'MARILIZE DAUGHTER ID:8511140147081' union all
select 'Jonas', 'Sikhosana', 'SAMUKELISIWE NGEMA - 9007280304086' union all
select 'Suzan', 'Phala', 'THUSO PHALA - 8605275664082 - 0761814997' union all
select 'Alicia', 'Austin', 'IAN CUMMING MY PARTNER-CUMMING IAN ARTHUR BOY8111105329087' union all

select 'Vivian', 'Naidoo', 'THE INSURED - VIVIAN' union all
select 'Josephine Maitlhoko', 'Molefe', 'BROTHER - MOLEFE - TBA - 072 7372 482' union all
select 'Ishmael Stephen', 'Mabunda', 'ISHAMAEL MABUNDA' union all
select 'Johanna Magrita', 'Chester', '6611255093088 MR BP HAVINGA' union all
select 'Lesley Eleanoir', 'Plutsick', 'REG DRIVER - COLIN' union all
select 'Makhuto', 'Manaka', 'KOBELA SIBUYI - INSURED''S DAUGHTER'

select
ltrim(col1) as col1,
ltrim(col2) as col2,
case when col3 like '%' + col1 + '%'
or col3 like '%' + col2 + '%'
or col1 like '%' + col3 + '%'
or col2 like '%' + col3 + '%'
then 'INSURED'
else 'unknown'
end as New_Col3
,col3
from @test

Is there a way to obtain the result for all 39 000 entries without having to input each row as seen in the code above.

Thanx!!!

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2011-09-26 : 02:59:21
you can use PATINDEX for finding exact pattern matches and for similar name matches (to account for spelling differences) use fuzzy matching algorithm like below

http://sqlblindman.googlepages.com/fuzzysearchalgorithm




------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

Go to Top of Page
   

- Advertisement -