Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Need to find (near) duplicate records

Author  Topic 

hismightiness
Posting Yak Master

164 Posts

Posted - 2009-01-26 : 14:06:06
I am having a problem trying to isolate a subset of records that are duplicates. Basically, I want to find all records that have the same country id and city name, but different state id numbers.

Here is the table, and a T-SQL example of what I am trying to do:

CREATE TABLE [dbo].[city](
[city_id] [int] IDENTITY(1,1) NOT NULL,
[city_name] [nvarchar](75) NOT NULL,
[state_id] [int] NULL,
[country_id] [int] NULL
);

SELECT *
FROM [dbo].[city]
WHERE [country_id] = [country_id] AND [state_id] <> [state_id];


- - - -
- Will -
- - - -
http://www.willstrohl.com

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-01-26 : 14:28:17
This will get you the duplicates:

SELECT country_id, city_name
FROM city
GROUP BY country_id, city_name
HAVING COUNT(*) > 1

Now we can use that as a derived table, although I'm sure ROW_NUMBER() can be used here too. Try this:


SELECT t.country_id, t.city_name, c.state_id, c.city_id
FROM
(
SELECT country_id, city_name
FROM city
GROUP BY country_id, city_name
HAVING COUNT(*) > 1
) t
INNER JOIN city c
ON t.country_id = c.country_id AND t.city_name = c.city_name
WHERE c.state_id <> t.state_id


Untested, I just eyeballed it.



Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

hismightiness
Posting Yak Master

164 Posts

Posted - 2009-01-26 : 16:22:25
Your second example looked like it was going to do the trick since I intend to have the full record. However, there are two slight problems.

1 - I know almost nothing about derived tables.
2 - This query throws an error since [state_id] is not one of the columns returned from the first (grouped) table.

SELECT t.[country_id], t.[city_name], c.[state_id], c.[city_id]
FROM
(
SELECT [country_id], [city_name]
FROM [dbo].[city]
GROUP BY [country_id], [city_name]
HAVING COUNT(*) > 1
) t
INNER JOIN [dbo].[city] c
ON t.[country_id] = c.[country_id] AND t.[city_name] = c.[city_name]
WHERE c.[state_id] <> t.[state_id];


- - - -
- Will -
- - - -
http://www.willstrohl.com
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-01-26 : 16:25:32
Oops yes indeed. I'm confident there's a better solution (namely something with ROW_NUMBER()), but this is all I can think of at the moment:


SELECT t.[country_id], t.[city_name], c2.[state_id], c2.[city_id]
FROM
(
SELECT [country_id], [city_name]
FROM [dbo].[city]
GROUP BY [country_id], [city_name]
HAVING COUNT(*) > 1
) t
INNER JOIN [dbo].[city] c1
ON t.[country_id] = c1.[country_id] AND t.[city_name] = c1.[city_name]
INNER JOIN dbo.city c2
ON c1.city_id = c2.city_id
WHERE c1.[state_id] <> c2.[state_id];


Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

hismightiness
Posting Yak Master

164 Posts

Posted - 2009-01-26 : 17:00:52
Thank you so much for your eager and quick responses. However, that last example didn't work either. Here is an example subset of data that might help you help me. :)

INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ACAPULCO',109,137);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ADELAIDE',80,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ADELAIDE',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ALBURY',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ALBURY',98,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ALICE SPRGS NOT',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ALICE SPRINGS',78,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('APODACA',113,137);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('ARMIDALE',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('AVALON',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('AYERS ROCK',78,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('AYERS ROCK NOT',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BAGOTVILLE',95,40);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BALLINA',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BALLINA',98,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BATH',106,224);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BATHURST',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BONAVENTURE',93,40);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BRISBANE',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BRISBANE',99,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BROKEN HILL',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BROKEN HILL',98,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BROOME',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BUNDABERG',83,13);
INSERT INTO [dbo].[city]([city_name],[state_id],[country_id])VALUES('BURNIE',83,13);


- - - -
- Will -
- - - -
http://www.willstrohl.com
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-01-26 : 17:20:44
I'm going to defer to the ROW_NUMBER() pros like visakh and Peso.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-01-26 : 19:58:40
--Using Row_Number()

Select p.city_id,t.city_name,p.state_id,t.country_id
from dbo.city p inner join
(Select Row_number()over(partition by city_name,country_id order by City_name)as seq,
city_name,country_id from dbo.city)t
on t.city_name = p.city_name
and t.country_id = p.country_id
Where t.seq> 1


--Otherway

Select p.city_id,t.city_name,p.state_id,t.country_id
from dbo.city p inner join
(Select city_name,country_id,count(*)as CityCount
from dbo.city
group by city_name,country_id
Having Count(*)>1)t
on t.city_name = p.city_name
and t.country_id = p.country_id
Go to Top of Page

hismightiness
Posting Yak Master

164 Posts

Posted - 2009-01-27 : 11:20:36
Thanks, sodeep... Your "otherway" example did the trick.

- - - -
- Will -
- - - -
http://www.willstrohl.com
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2009-01-27 : 11:47:13
quote:
Originally posted by sodeep

--Using Row_Number()

Select p.city_id,t.city_name,p.state_id,t.country_id
from dbo.city p inner join
(Select Row_number()over(partition by city_name,country_id order by City_name)as seq,
city_name,country_id from dbo.city)t
on t.city_name = p.city_name
and t.country_id = p.country_id
Where t.seq> 1


--Otherway

Select p.city_id,t.city_name,p.state_id,t.country_id
from dbo.city p inner join
(Select city_name,country_id,count(*)as CityCount
from dbo.city
group by city_name,country_id
Having Count(*)>1)t
on t.city_name = p.city_name
and t.country_id = p.country_id



you dont require the join in first case. you can just use like this

SELECT city_id,city_name,state_id,country_id
FROM
(
SELECT COUNT([state_id]) OVER (PARTITION BY city_name,country_id ) AS StateCount,*
FROM Table
)t
WHERE StateCount>1
Go to Top of Page

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-01-27 : 16:58:31
quote:
Originally posted by hismightiness

Thanks, sodeep... Your "otherway" example did the trick.

- - - -
- Will -
- - - -
http://www.willstrohl.com



You are Welcome.
Go to Top of Page

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-01-27 : 16:59:37
quote:
Originally posted by visakh16

quote:
Originally posted by sodeep

--Using Row_Number()

Select p.city_id,t.city_name,p.state_id,t.country_id
from dbo.city p inner join
(Select Row_number()over(partition by city_name,country_id order by City_name)as seq,
city_name,country_id from dbo.city)t
on t.city_name = p.city_name
and t.country_id = p.country_id
Where t.seq> 1


--Otherway

Select p.city_id,t.city_name,p.state_id,t.country_id
from dbo.city p inner join
(Select city_name,country_id,count(*)as CityCount
from dbo.city
group by city_name,country_id
Having Count(*)>1)t
on t.city_name = p.city_name
and t.country_id = p.country_id



you dont require the join in first case. you can just use like this

SELECT city_id,city_name,state_id,country_id
FROM
(
SELECT COUNT([state_id]) OVER (PARTITION BY city_name,country_id ) AS StateCount,*
FROM Table
)t
WHERE StateCount>1




Although First case works well ,this is simpler way.
Go to Top of Page
   

- Advertisement -