| Author |
Topic |
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 10:54:06
|
| I have a table which has only 1 column. I would like to write a select statement which will give me list of all duplicates. I mean duplicates are not literal. Let me give an example of the data:The TitanicTitanicBruce AllmightyABCDCheaper by the DozenDozenMeet the ParentsIf you look at the above list: 1) The Titanic and 2)Titanic and not duplicates, but they are same.So what I want is a list where even if it finds 1 word as same, it should list it.Any ideas? |
|
|
Srinika
Master Smack Fu Yak Hacker
1378 Posts |
Posted - 2006-02-28 : 11:08:38
|
| Ur definition of duplicity is a bit hard to understandIs it need to compare something like "The Titanic" and "The Ship Titanic is a" Or is it just a single word against the more than one word"How Are You" and "Are" |
 |
|
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 11:30:52
|
| "Titanic" and "The Titanic" will be duplicate since there is one word common."ABCD" and "hello Sam" will not be duplicate.I need the list of all duplicate records. |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-28 : 11:37:15
|
| So if your list includes "Lord Of The Rings" and "The Graduate", you want to count them as the same movie because they each contain the word "the"?You may want to rethink this... |
 |
|
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 12:00:52
|
quote: Originally posted by blindman So if your list includes "Lord Of The Rings" and "The Graduate", you want to count them as the same movie because they each contain the word "the"?You may want to rethink this...
Interesting Qustions. Did not think of it in that perspective. To answer that question, Can I supppy words while running the query to skip certain words. I would then have the query skip certain words like "the" "of" "A" ect |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-28 : 12:40:12
|
| ..or the word "dogs"? Since that is used in both "Reservoir Dogs" and "All Dogs Go To Heaven"... Or how about "Mary", which is used in both "Mary Poppins" and "There's Something About Mary"?I think what you are looking for is a fuzzy search algorithm. I have one, if you are interested. It returns a value between 0 and 1 indicating the degree of similiarity between two string values. Be aware the fuzzy algorithms are NEVER 100% correct. You always have to trade of the probability of missing two duplicates against the probability of mistakenly associating two values. |
 |
|
|
Srinika
Master Smack Fu Yak Hacker
1378 Posts |
Posted - 2006-02-28 : 12:46:17
|
| yalgaar,My Question again (for clarification)can u have a situation like "The Titanic" and "The Ship Titanic is a" where u wan to have duplicate as "Titanic" ?Ur search may take a long time if the field of each record contains lot of words (even if u eliminate some words as "the") - because the field value needs to be broken to each word and those needs to be checked against each field whether those exists or noteg.Let the Feild contains following 3 recordsRec1 : ABCD PQR ST UVW XYZRec2 : LMN OPQ ST U XYZRec3 : BCD LMN PQR ST Rec1's each word, ie ABCD PQR ST UVW XYZ should be checked against Rec2 & Rec3So for these 3 records the iteration is done 5 x 2 + 5 x 2 + 4 x 2U will get results as Rec1 & Rec2 "Duplicating - as per ur definition" 2 times (with regard to data items ST & XYZ)Rec1 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & PQR)Rec2 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & LMN)Is that what u want? |
 |
|
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 12:57:27
|
quote: Originally posted by blindman So if your list includes "Lord Of The Rings" and "The Graduate", you want to count them as the same movie because they each contain the word "the"?You may want to rethink this...
Yes, they will be listed as same movie unless I say exclude word "the" |
 |
|
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 12:59:19
|
quote: Originally posted by Srinika yalgaar,My Question again (for clarification)can u have a situation like "The Titanic" and "The Ship Titanic is a" where u wan to have duplicate as "Titanic" ?Ur search may take a long time if the field of each record contains lot of words (even if u eliminate some words as "the") - because the field value needs to be broken to each word and those needs to be checked against each field whether those exists or noteg.Let the Feild contains following 3 recordsRec1 : ABCD PQR ST UVW XYZRec2 : LMN OPQ ST U XYZRec3 : BCD LMN PQR ST Rec1's each word, ie ABCD PQR ST UVW XYZ should be checked against Rec2 & Rec3So for these 3 records the iteration is done 5 x 2 + 5 x 2 + 4 x 2U will get results as Rec1 & Rec2 "Duplicating - as per ur definition" 2 times (with regard to data items ST & XYZ)Rec1 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & PQR)Rec2 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & LMN)Is that what u want?
The records will be considered duplicate since all 3 of the word "ST" but they will not be considered duplicate of I exclude the word "ST" while running the query.Do you think it will take a very long time to run if there are around 3000 records? |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-28 : 13:33:14
|
| You would have to work hard to contrive a query that would take a long time in SQL Server with only 3000 records. |
 |
|
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 13:59:10
|
| So what is the final query? |
 |
|
|
Billkamm
Posting Yak Master
124 Posts |
Posted - 2006-02-28 : 14:02:09
|
| blindman: no I wouldn't. I would just create a loop that caused a timed delay :) |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-28 : 15:00:55
|
quote: Originally posted by Billkamm blindman: no I wouldn't. I would just create a loop that caused a timed delay :)
Cheating! Where's the elegance? Where's the creativity? Where's the obfuscation! Zero points for style, and I'm sorry, but you do not advance to the next round of America's Top Data Modeller. |
 |
|
|
yalgaar
Starting Member
43 Posts |
Posted - 2006-02-28 : 15:34:33
|
| Getting back at the original question, can somebody help me with this query which will report me all rows that have any duplicate words except for the ones that I mention in the query?Thanks |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-28 : 15:54:30
|
Yalgar, I'm sure somebody could come up with a solution that fits your exact requirements after a lot of thought and coding, and the results would be functionally useless. The very idea of matching strings on the mere coincidence of sharing a single word, except for some undefined and ever-expanding list of ignored words, is completely impractical, at least for the Movie Title example you have given.Here is a function I use frequently to do fuzzy searches on personal names:drop function CompareTextgocreate function CompareText(@String1 varchar(100), @String2 varchar(100))returns intasbegin--Function CompareText--blindman, 2005--Compares two strings and returns a value between 0 and 100 indicating a measure of their--similarity.--Test variables-- declare @String1 varchar(100)-- declare @String2 varchar(100)-- set @String1 = 'Bruce Alan Lindman'-- set @String2 = 'Lindman, Bruce A'-- set @String1 = dbo.MatchText(@String1)-- set @String2 = dbo.MatchText(@String2)declare @Possibles intdeclare @Hits intdeclare @Counter intset @Possibles = len(@String1) + len(@String2) - 2set @Hits = 0set @Counter = len(@String1)-1while @Counter > 0 begin if charindex(substring(@String1, @Counter, 2), @String2) > 0 set @Hits = @Hits + 1 set @Counter = @Counter - 1 endset @Counter = len(@String2)-1while @Counter > 0 begin if charindex(substring(@String2, @Counter, 2), @String1) > 0 set @Hits = @Hits + 1 set @Counter = @Counter - 1 end-- select cast(100.0*@Hits/@Possibles as int)return cast(100.0*@Hits/@Possibles as int)end I hope it will be of help to you. |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-28 : 16:02:28
|
So you can see how it works:set nocount ondeclare @MovieTitles table (MovieTitle varchar(50))insert into @MovieTitles (MovieTitle)select 'The Titanic'UNIONselect 'Titanic'UNIONselect 'Bruce Allmighty'UNIONselect 'ABCD'UNIONselect 'Cheaper by the Dozen'UNIONselect 'Dozen'UNIONselect 'The Dirty Dozen'UNIONselect 'Meet the Parents'UNIONselect 'Meat the Parents'UNIONselect 'Mary Poppins'UNIONselect 'There''s something about Mary'select A.MovieTitle MovieTitle, B.MovieTitle DuplicateTitle, dbo.CompareText(A.MovieTitle, B.MovieTitle) MatchValuefrom @MovieTitles A inner join @MovieTitles B on A.MovieTitle < B.MovieTitlewhere dbo.CompareText(A.MovieTitle, B.MovieTitle) > 50order by A.MovieTitle asc, dbo.CompareText(A.MovieTitle, B.MovieTitle) desc Output:MovieTitle DuplicateTitle MatchValue -------------------------------------------------- -------------------------------------------------- ----------- Cheaper by the Dozen The Dirty Dozen 60Meat the Parents Meet the Parents 86The Titanic Titanic 75 |
 |
|
|
|