Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 How to search

Author  Topic 

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 10:54:06
I have a table which has only 1 column. I would like to write a select statement which will give me list of all duplicates. I mean duplicates are not literal.

Let me give an example of the data:

The Titanic
Titanic
Bruce Allmighty
ABCD
Cheaper by the Dozen
Dozen
Meet the Parents

If you look at the above list: 1) The Titanic and 2)Titanic and not duplicates, but they are same.

So what I want is a list where even if it finds 1 word as same, it should list it.


Any ideas?

Srinika
Master Smack Fu Yak Hacker

1378 Posts

Posted - 2006-02-28 : 11:08:38
Ur definition of duplicity is a bit hard to understand

Is it need to compare something like
"The Titanic" and "The Ship Titanic is a"

Or is it just a single word against the more than one word
"How Are You" and "Are"
Go to Top of Page

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 11:30:52
"Titanic" and "The Titanic" will be duplicate since there is one word common.

"ABCD" and "hello Sam" will not be duplicate.

I need the list of all duplicate records.
Go to Top of Page

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2006-02-28 : 11:37:15
So if your list includes "Lord Of The Rings" and "The Graduate", you want to count them as the same movie because they each contain the word "the"?

You may want to rethink this...
Go to Top of Page

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 12:00:52
quote:
Originally posted by blindman

So if your list includes "Lord Of The Rings" and "The Graduate", you want to count them as the same movie because they each contain the word "the"?

You may want to rethink this...



Interesting Qustions. Did not think of it in that perspective. To answer that question, Can I supppy words while running the query to skip certain words. I would then have the query skip certain words like "the" "of" "A" ect
Go to Top of Page

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2006-02-28 : 12:40:12
..or the word "dogs"? Since that is used in both "Reservoir Dogs" and "All Dogs Go To Heaven"... Or how about "Mary", which is used in both "Mary Poppins" and "There's Something About Mary"?

I think what you are looking for is a fuzzy search algorithm. I have one, if you are interested. It returns a value between 0 and 1 indicating the degree of similiarity between two string values. Be aware the fuzzy algorithms are NEVER 100% correct. You always have to trade of the probability of missing two duplicates against the probability of mistakenly associating two values.
Go to Top of Page

Srinika
Master Smack Fu Yak Hacker

1378 Posts

Posted - 2006-02-28 : 12:46:17
yalgaar,
My Question again (for clarification)
can u have a situation like
"The Titanic" and "The Ship Titanic is a" where u wan to have duplicate as "Titanic" ?

Ur search may take a long time if the field of each record contains lot of words (even if u eliminate some words as "the") - because the field value needs to be broken to each word and those needs to be checked against each field whether those exists or not

eg.
Let the Feild contains following 3 records
Rec1 : ABCD PQR ST UVW XYZ
Rec2 : LMN OPQ ST U XYZ
Rec3 : BCD LMN PQR ST

Rec1's each word, ie ABCD PQR ST UVW XYZ should be checked against Rec2 & Rec3
So for these 3 records the iteration is done 5 x 2 + 5 x 2 + 4 x 2
U will get results as
Rec1 & Rec2 "Duplicating - as per ur definition" 2 times (with regard to data items ST & XYZ)
Rec1 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & PQR)
Rec2 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & LMN)

Is that what u want?

Go to Top of Page

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 12:57:27
quote:
Originally posted by blindman

So if your list includes "Lord Of The Rings" and "The Graduate", you want to count them as the same movie because they each contain the word "the"?

You may want to rethink this...



Yes, they will be listed as same movie unless I say exclude word "the"
Go to Top of Page

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 12:59:19
quote:
Originally posted by Srinika

yalgaar,
My Question again (for clarification)
can u have a situation like
"The Titanic" and "The Ship Titanic is a" where u wan to have duplicate as "Titanic" ?

Ur search may take a long time if the field of each record contains lot of words (even if u eliminate some words as "the") - because the field value needs to be broken to each word and those needs to be checked against each field whether those exists or not

eg.
Let the Feild contains following 3 records
Rec1 : ABCD PQR ST UVW XYZ
Rec2 : LMN OPQ ST U XYZ
Rec3 : BCD LMN PQR ST

Rec1's each word, ie ABCD PQR ST UVW XYZ should be checked against Rec2 & Rec3
So for these 3 records the iteration is done 5 x 2 + 5 x 2 + 4 x 2
U will get results as
Rec1 & Rec2 "Duplicating - as per ur definition" 2 times (with regard to data items ST & XYZ)
Rec1 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & PQR)
Rec2 & Rec3 "Duplicating - as per ur definition" 2 times (with regard to data items ST & LMN)

Is that what u want?





The records will be considered duplicate since all 3 of the word "ST" but they will not be considered duplicate of I exclude the word "ST" while running the query.

Do you think it will take a very long time to run if there are around 3000 records?
Go to Top of Page

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2006-02-28 : 13:33:14
You would have to work hard to contrive a query that would take a long time in SQL Server with only 3000 records.
Go to Top of Page

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 13:59:10
So what is the final query?
Go to Top of Page

Billkamm
Posting Yak Master

124 Posts

Posted - 2006-02-28 : 14:02:09
blindman: no I wouldn't. I would just create a loop that caused a timed delay :)
Go to Top of Page

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2006-02-28 : 15:00:55
quote:
Originally posted by Billkamm

blindman: no I wouldn't. I would just create a loop that caused a timed delay :)

Cheating! Where's the elegance? Where's the creativity? Where's the obfuscation! Zero points for style, and I'm sorry, but you do not advance to the next round of America's Top Data Modeller.
Go to Top of Page

yalgaar
Starting Member

43 Posts

Posted - 2006-02-28 : 15:34:33
Getting back at the original question, can somebody help me with this query which will report me all rows that have any duplicate words except for the ones that I mention in the query?

Thanks
Go to Top of Page

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2006-02-28 : 15:54:30
Yalgar, I'm sure somebody could come up with a solution that fits your exact requirements after a lot of thought and coding, and the results would be functionally useless. The very idea of matching strings on the mere coincidence of sharing a single word, except for some undefined and ever-expanding list of ignored words, is completely impractical, at least for the Movie Title example you have given.

Here is a function I use frequently to do fuzzy searches on personal names:
drop function CompareText
go
create function CompareText(@String1 varchar(100), @String2 varchar(100))
returns int
as
begin
--Function CompareText
--blindman, 2005
--Compares two strings and returns a value between 0 and 100 indicating a measure of their
--similarity.

--Test variables
-- declare @String1 varchar(100)
-- declare @String2 varchar(100)
-- set @String1 = 'Bruce Alan Lindman'
-- set @String2 = 'Lindman, Bruce A'
-- set @String1 = dbo.MatchText(@String1)
-- set @String2 = dbo.MatchText(@String2)

declare @Possibles int
declare @Hits int
declare @Counter int
set @Possibles = len(@String1) + len(@String2) - 2
set @Hits = 0
set @Counter = len(@String1)-1
while @Counter > 0
begin
if charindex(substring(@String1, @Counter, 2), @String2) > 0 set @Hits = @Hits + 1
set @Counter = @Counter - 1
end
set @Counter = len(@String2)-1
while @Counter > 0
begin
if charindex(substring(@String2, @Counter, 2), @String1) > 0 set @Hits = @Hits + 1
set @Counter = @Counter - 1
end
-- select cast(100.0*@Hits/@Possibles as int)
return cast(100.0*@Hits/@Possibles as int)
end

I hope it will be of help to you.
Go to Top of Page

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2006-02-28 : 16:02:28
So you can see how it works:
set nocount on

declare @MovieTitles table (MovieTitle varchar(50))

insert into @MovieTitles (MovieTitle)
select 'The Titanic'
UNION
select 'Titanic'
UNION
select 'Bruce Allmighty'
UNION
select 'ABCD'
UNION
select 'Cheaper by the Dozen'
UNION
select 'Dozen'
UNION
select 'The Dirty Dozen'
UNION
select 'Meet the Parents'
UNION
select 'Meat the Parents'
UNION
select 'Mary Poppins'
UNION
select 'There''s something about Mary'


select A.MovieTitle MovieTitle,
B.MovieTitle DuplicateTitle,
dbo.CompareText(A.MovieTitle, B.MovieTitle) MatchValue
from @MovieTitles A
inner join @MovieTitles B on A.MovieTitle < B.MovieTitle
where dbo.CompareText(A.MovieTitle, B.MovieTitle) > 50
order by A.MovieTitle asc,
dbo.CompareText(A.MovieTitle, B.MovieTitle) desc

Output:
MovieTitle                                         DuplicateTitle                                     MatchValue  
-------------------------------------------------- -------------------------------------------------- -----------
Cheaper by the Dozen The Dirty Dozen 60
Meat the Parents Meet the Parents 86
The Titanic Titanic 75
Go to Top of Page
   

- Advertisement -