Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 De-duplicating similar but not identical URLs

Author  Topic 

zgall1
Starting Member

9 Posts

Posted - 2014-09-25 : 11:51:46
I have a dataset with thousands of URLs stored in a column called Website (type VARCHAR) in a table called WebsiteData. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www, e.g. www.google.com and the other does not, e.g., google.com. How would I design a SQL query that identifies these pseudo-duplicates and deletes the version that does not start with www?

Bustaz Kool
Master Smack Fu Yak Hacker

1834 Posts

Posted - 2014-09-25 : 12:19:06
[code];with FindDups
as (
select Replace(Website, 'www.', '') RawWebsite
from WebsiteData
group by Replace(Website, 'www.', '')
having count(*) > 1
)
select w.Website
from WebsiteData w
where Replace(Website, 'www.', '') in (select RawWebsite from FindDups)[/code]Can you take it from here?



Too often we enjoy the comfort of opinion without the discomfort of thought. - John F. Kennedy
Go to Top of Page

zgall1
Starting Member

9 Posts

Posted - 2014-09-25 : 13:19:18
Thanks!
Go to Top of Page

Bustaz Kool
Master Smack Fu Yak Hacker

1834 Posts

Posted - 2014-09-25 : 18:28:12
m'kay



Too often we enjoy the comfort of opinion without the discomfort of thought. - John F. Kennedy
Go to Top of Page
   

- Advertisement -