The "John Smith" Problem of Householding

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

The "John Smith" Problem of Householding

Author

Topic

clay_mckinney
Starting Member

14 Posts

Posted - 2011-07-07 : 14:11:34

I am working on a "Householding" project where I get a list of accounts (flat file) and I'm supposed to determine which ones are related to each other. So, if two accounts have the same email address, or phone number, or mailing address, then they are "householded" together with a common key.

One way we want to group them is by name. So, if there are two accounts with the name Rudebaga Jameson, we want to assume that this is the same guy and household them together.

The problem is that with a list of 150,000 accounts, some very common names are going to create enough false positives to cause a problem. We are calling this the "John Smith" problem.

Normally, our business needs allow us to be very permissive. False positives are usually acceptable to us. For this reason, if there is a match on any one field, it's considered a matched record. Therefore it's useless to corroberate the name matches with another field. A rule like "Only count a match on 'Name' if there is also a match on another field" would be useless because if there was a match on another field we'd already consider the records matched.

Do any of you out there have a strategy suggestion for how to deal with this? I'd appreciate any input.

- Clay McKinney

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2011-07-07 : 14:30:19

Not sure what you're asking becase based on what you said you are accomplishing your goal. What specifically do you want to change? The only thing I can think of is disregard the name - just "household" on (email or phone or address)

Be One with the Optimizer
TG

clay_mckinney
Starting Member

14 Posts

Posted - 2011-07-07 : 15:08:28

Disregarding the name is possibly the CORRECT answer, but a requirement from the sales team is we have to be able to say we are using the name for matching. Our non-technical salespeople have to sell this to non-technical other people. Also, a name match might be the only match and might be a valid match on most names. It's only a problem for extremely common names, like "John Smith".

Currently my plan is to highlight all the ones where the name is the only matched field so that a human can look at them and make a human decision about the likelihood the match is valid. This is a plausible plan for us, but it would be better if I could take human intervention out completely.

Perhaps, if someone knew the highest percentage a common name appears in the general population, I could use that as a guide. Like if someone knew "The most popular name in the world is Sally Johnson. .8% of the US population is named Sally Johnson." Then I could know that if the same name appeared in more than .8% of my records, I should ignore it. Or something.

The reason I'm posing this is because it seems like the kind of thing everyone will have to deal with at some point, and it's probably been solved before. Yet I don't know how to search for the answer, since it's a question of strategy rather than syntax.

Also, I'd like to apologize to all of you out there named John Smith or Sally Johnson. I don't mean any offence by using your name as an example.

- Clay McKinney

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2011-07-07 : 15:12:10

I think the only folks that may take offense are ones named Rudebaga Jameson

Be One with the Optimizer
TG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2011-07-07 : 15:26:43

I've never needed to solve this particular problem so feel free to ignore these follow up questions.

- what geographic area does your population cover?
- how big is your population?
- do you capture ANY other data points related to the name besides the other 3 attributes you mentioned?
- what is/are the source(s) of your population data ie: Are you likely to have multiple, varying instances of the same person? Do you store from which source a particular entry derives?
- Do you have a definition for what constitutes a "household"?

Be One with the Optimizer
TG

clay_mckinney
Starting Member

14 Posts

Posted - 2011-07-07 : 15:43:19

Our customers are in 49 of the 50 states in the USA. Each individual file we get though would be from just one metropolitan area.

We get different things to match on from different customers. Address, Email, and Phone were just easy examples.

Population sizes vary by customer from 5,000 to 150,000 or so.

The data comes from human entry. It's the Wild West. So, we could have Mr. John Smith, John Smith, John C. Smith, etc. But I'm solving this problem with some scrubbing rules and SSIS's fuzzy matching.

A "household" is a group of records that we believe to be related to each other closely enough that if one of them is a good customer we don't want to tick off any of the others. If two people's names are on one account, then wherever else either name appears, all of those records are in the "household". It's better for us to accidentally lump someone in than to accidentally leave someone out of the household. But with "The John Smith Problem" the lumping in gets pushed to a silly extreme.

- Clay McKinney

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2011-07-07 : 16:40:01

Ughh, I've lived this one before. False positives are never acceptable in our case. False negatives are. It's the difference between (1)considering two different customers to be a single account or (2) accidentally making 2 accounts for a single customer. The former is never ok, the latter is, and will resolve itself over time if the customer is active.

We only use name when we match on at least one of the following:
- Phone
- Address/City/Zip
You would be surprised how many not-so-common names represent different people or businesses.

We do several passes. We try to match on Phone 1st, then Address/City/Zip.

If we have email addresses, that is an almost positive match. I say almost positive because we do import data from other sources. Our own tables force email to be unique. Can't force address or phone to be unique because people and businesses move without telling you. Phone tends to be the best field to match on in my experience.

You also need to check for mis-spellings: Jones, Jonse.
And for abbreviations (and for missing street identifiers):
- 123 North Main Street
- 123 N. Main St.
- 123 N. Main

And for bad spacing:
- 123N Main St.

So...we make several passes, matching on various fields and assign scores to those that aren't either a positive match or a positive non-match.

We can scrub with about 95% accuracy. The rest we get real close and pass it to an analyst to manually review.

clay_mckinney
Starting Member

14 Posts

Posted - 2011-07-07 : 17:43:28

Very helpful comments.

For abbreviations, were basically changing
- 123 North Main Street to 123NMainST
- 123 N. Main St. to 123NMainSt
- 123 N. Main to 123NMain
- 123N Main St. to 123NMainSt

SQL would consider all these to be a match except the third one. I'm going to put this through SSIS's fuzzy match to hopefully take care of that case and also to take care of things like 123 N. Mian Str.

I'm especially heartened to hear that you "get real close and pass it to an analyst to manually review." Now I'll feel less guilty if I have to do that. :)

- Clay McKinney

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2011-07-07 : 20:32:01

Glad it helped a little. Scrunning user input will always have its caveats.

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-07-08 : 05:08:58

feel free to ignore the following advice/thoughts completely as I have very limited experience in this area. But in SQL Server 2008 R2 (are you on R2?) there are some new features called Master Data Services that has some pretty decent udf's you can use for fuzzy matching. The udf "mdq.Similarity" basically compares two strings of data an scores them based on their similarity and the closer the score is to 1 the more similar they are.

Now by using such a function; would it be possible that if a string (i.e. the name) gets x matches with a score of 0.9 or more you'll need at least one other match with at least a 0.9 for the records to qualify as being a part of the same household? You can read about the mdq.Similarity here:
http://thefirstsql.com/2011/02/07/regular-expressions-advanced-string-matching-and-new-split-function-sql-server-2008-r2/

- Lumbago
My blog-> http://thefirstsql.com/2011/07/08/how-to-find-gaps-in-identity-columns-at-the-speed-of-light/

clay_mckinney
Starting Member

14 Posts

Posted - 2011-07-08 : 11:06:50

Thanks Lumbago. We are on 2008, so I will check into this to see if I can get them to upgrade to R2. Upgrading to R2 might be actually faster than moving parts of the project out to SSIS just to get fuzzy matching.

- Clay McKinney

clay_mckinney
Starting Member

14 Posts

Posted - 2011-07-09 : 11:13:10

http://msdn.microsoft.com/en-us/library/ee633742.aspx says that Master Data Services is only available in Enterprise Edition or higher. At $8500/processor, Enterprise is not within our budget for this project. So, for the fuzzy matching, I'll be moving this process from a sproc to an SSIS package. The fuzzy matching stuff creates more matches.

"The John Smith Problem" is the opposite end of the spectrum, too many matches. For this my current plan is to isolate the suspect rows with a query and then let a human analyst review them.

I'm still open to further suggestions. You guys have all been very helpful.

- Clay McKinney

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources