Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
s_a19
Starting Member
15 Posts |
Posted - 2003-08-01 : 07:32:17
|
hy everyone, im having this problemi am transferring data from one table to another table, but there is compatability issues, as in one table, this is contact which can have numerous amounts in, usually less than 3 people, for exampleS Scott, Mr Scott Smith, Scott Smithas you can see they are in different formats, but are always seperated by a comma, the above is only an example and can be different peopleand i need to transfer it, into another table.with title inits name surnameas the requirements.this one has me seriously stumped.thanx |
|
Page47
Master Smack Fu Yak Hacker
2878 Posts |
|
AndrewMurphy
Master Smack Fu Yak Hacker
2916 Posts |
Posted - 2003-08-01 : 07:53:51
|
I'd tackle this in multiple passes....1. sort out the titles (there's a very limited range of valid titles, which can be found on inspection or by a query - select distinct(left(name,patindex('% %',name)-1)) from customerwhere patindex('% %',name)-1 > 0)removing the titles from the input table after they have been 'moved' into the new table2. sort out the surnames....slight reverse of the above..again removeonly issue you have here is that not all suffixes are surnames...ie company, ltd, jnr, etc....3. anything left over could be split into 1st name, middle name (or dumped/concatanated onto 1st name)4. Initials would be a formula on 1st name....ie left(1stname,1)The source data having 'multiple customers' per-row should be sorted first....look for CSV on this site....that particular issue has come up often before....and then deal with getting the unstructured names into a structure....you won't be able to get 100% of the names right, but you could come very close to make an automated solution better than no solution. |
|
|
GreySky
Starting Member
20 Posts |
Posted - 2003-08-01 : 15:52:55
|
I have written a heuristic name parser that will parse names into:-Prefix-First-Middle-Last-Suffix-Titlecomponents. It is industrial strength, can handle single names as well as name composites such as "David C. & Nancy Atkins" (splits into two distinct names: David-C.-Atkins and Nancy-Atkins), and generally is the best heuristic name parser I've ever seen (shucks I wrote it!).It runs in VB (is an Access database), and I chiefly use it to aid data conversions from those icky "I've never seen a relational database before" formats to a relational design.If want it, email me at atkinsdc@bellsouth.net or datkins@atl.invesco.com.David Atkins, MCP |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-08-01 : 15:59:01
|
How do you know to put what where?You must accomodate some fallout, no?What defines out of scope parsing?quote: Grey skies are gonna clear up, so put on a happy face
Brett8-)SELECT POST=NewId() |
|
|
s_a19
Starting Member
15 Posts |
Posted - 2003-08-03 : 05:27:49
|
thanx for all the answers, i ideally need to do it, in mysql using oracle. but if i can do it another way. i will |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2003-08-03 : 08:07:34
|
quote: i ideally need to do it, in mysql using oracle. but if i can do it another way. i will
I don't know how you're going to parse MySQL data using Oracle, but posting the question on a SQL Server site probably won't help much. Don't worry, the CSV parsing technique will work with any database product, and once you parse out the multiple comma-separated names, you can modify the technique to parse out the pieces of each name. |
|
|
s_a19
Starting Member
15 Posts |
Posted - 2003-08-03 : 08:21:10
|
ohh right sorry, but everything hopefully should still be ok.what type of things would i need to modify? |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2003-08-03 : 08:26:17
|
Instead of using a comma as a separator, you'd use a space. If you have data like David describes, then you might also need to look at parsing ampersands and periods too.I'd recommend doing it in stages, try parsing simple names first, using Charindex and Substring:SELECT SubString(FullName, 1, CharIndex(' ', FullName)) AS FirstName,SubString(FullName, CharIndex(' ', FullName)+1, 100) AS LastNameFROM myTableOnce you get those parsed, then you can modify the expression to handle 3 names, Mr./Mrs./Dr. prefixes, and Jr./III/Esq. suffixes and such. |
|
|
s_a19
Starting Member
15 Posts |
Posted - 2003-08-03 : 08:30:29
|
yeah i was thinking about just doing the ones, such asA Smith, etc. which i feel should give me some feeling as to what is required for the other parts to complete. |
|
|
s_a19
Starting Member
15 Posts |
Posted - 2003-08-04 : 05:56:31
|
i have just come to run one of the queries, like belowselect substring(contact, 1, CharIndex('',Contact)) AS Namesubstring(contact, CharIndex('',Contact)+1,100) AS Surnamefrom address;and it comes up with from keyword not found were expected. neone know why?thanx |
|
|
GreySky
Starting Member
20 Posts |
Posted - 2003-08-04 : 08:23:03
|
Dear X002548:If you understand what a heuristic is, then you would realize that it makes assumptions. Of course, if the input is garbage, the output will be garbage; SISO rules still apply.Quite frankly, I've had numerous replies very similar to yours regarding my name parser. I would only suggest you take a look at it to see what it can do. Those many who have requested it have returned very positive comments.If you tell me how to parse "Grey skies are gonna clear up, so put on a happy face " into a name, then I'll tell my parser how to do it : ) That's what I mean by a heuristic, I tell the computer how to parse a name based upon how I would do it myself.--Parser says:First: GreyMiddle: skies are gonna clearLast: upTitle: so put on a happy face--Typical parser heuristics-Prefix: Pre-defined (arrayed)-First: First word after a prefix-Middle: Between first and last, but especially if an initial-Last: Last word before suffix/title (after last-name compositing)-Suffix: Pre-defined (arrayed)-Title: Post suffix or post comma-Ampersand: Indicates new name for name compositing and binding-Last name composites: Pre-defined (e.g., Van Gough, de la Hoya)-Out-of-Scope: Instr() use scanning for illegals typically used for company names (e.g., numbers, phrases like 'institut')-Name binding and compositing logicIt fills up 18 pages if you paste it into a Word document. So, yeah, until you've seen it ...David Atkins, MCPquote: Originally posted by X002548 How do you know to put what where?You must accomodate some fallout, no?What defines out of scope parsing?quote: Grey skies are gonna clear up, so put on a happy face
Brett8-)SELECT POST=NewId()
|
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-08-04 : 09:09:33
|
quote: If you understand what a heuristic is, then you would realize that it makes assumptions. Of course, if the input is garbage, the output will be garbage; SISO rules still apply.
Pretty big assumption...I think I'll continue with absolutes...such that if something falls outside of the rules, someone will need to know about and correct it...From Websters:quote: heuristicPronunciation: hyû'ristik Matching Terms: heuristic program, heuristic rule, heuristics testing WordNet Dictionary Definition: [n] a commonsense rule (or set of rules) intended to increase the probability of solving some problem [adj] ofo or relating to a general formulation that serves to guide investigation [adj] (computer science) relating to or using a heuristic rule Synonyms: heuristic program, heuristic rule, trial-and-error Antonyms: algorithmic See Also: formula, lateral thinking, rule Webster's 1913 Dictionary Definition: \Heu*ris"tic\, a. [Gr. ? to discover.]Serving to discover or find out. Computing Dictionary Definition: 1. A rule of thumb, simplification, or educated guess that reduces or limits the search for solutions in domains that are difficult and poorly understood. Unlike algorithms, heuristics do not guarantee optimal, or even feasible, solutions and are often used with no theoretical guarantee.2. approximation algorithm. See Also: algorithm, programming
Brett8-)SELECT POST=NewId() |
|
|
GreySky
Starting Member
20 Posts |
Posted - 2003-08-04 : 09:55:22
|
quote: Pretty big assumption...I think I'll continue with absolutes...such that if something falls outside of the rules, someone will need to know about and correct it...
#1 - You assume the "rules" are known, rigid, and used without exception over all non heterogenous data sources you may have to work with.#2 - How would you solve NP-hard problems, such as the 3d bin packing problem? Would you tell your customer, "Sorry, it's not absolute"? And, yes, I have one of these too, adapted from Vigo, Pisinger, Martello--except it's actually in production and rotates cases to aid in finding an "optimal solution."Not everything fits so easily within a box, but still needs a solution. Perfection is not always a requirement.David Atkins, MCP |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-08-04 : 10:17:05
|
quote: Originally posted by GreySkyPerfection is not always a requirement.
OK, what ever you say...An exception handling process will always be part of any system I build...as long as I can make sure I "know" about it...that's why I try to positively identify things...if it don't fit, you must acquit"Sorry sir, that $1,000.00 deduction was just a mistake..."Brett8-)SELECT POST=NewId() |
|
|
GreySky
Starting Member
20 Posts |
Posted - 2003-08-04 : 10:32:47
|
Exception handling has nothing to do with heuristics. Exception handling is for unexpected situations (typically an exception thrown by the processor itself or explicitly raised by an application)A heuristic solution that is incorrect *doesn't know it's wrong.*Our points are made.Davidquote: Originally posted by X002548
quote: Originally posted by GreySkyPerfection is not always a requirement.
OK, what ever you say...An exception handling process will always be part of any system I build...as long as I can make sure I "know" about it...that's why I try to positively identify things...if it don't fit, you must acquit"Sorry sir, that $1,000.00 deduction was just a mistake..."Brett8-)SELECT POST=NewId()
|
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2003-08-04 : 10:35:23
|
Thanks...I'll have to read up on the subject...Brett8-)SELECT POST=NewId() |
|
|
Amethystium
Aged Yak Warrior
701 Posts |
Posted - 2003-08-04 : 12:37:35
|
You got your copy of Aphelion yet?----------------Shadow to Light |
|
|
|
|
|
|
|