Duplicate records in the database

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Duplicate records in the database

Author

Topic

abuhassan

105 Posts

Posted - 2006-09-13 : 09:33:41

i have duplicate records, but they have unique ids.

the duplicates resulted as a result of using a view to enter the data using BCP to load data from text file that had many dulicate records as part of consolidting the data i would like to get rid of the duplicate

is there some way of checking if the Name, Address, Telephone and Image are the same as on a record with another then remove them.

the table has the following colunms:

ID, Name, Address, Telephone, Image

1, adam, 45 tree gate road, 00000 000000, adam.jpg
2, uno, 44 garstang road, 00001 000001, uno.jpg
3, adam, 45 tree gate road, 00000 000000, adam.jpg
4, brian, belgarth house, 00022 000022, adam.jpg
5, karen, 3 chester close, 00002 000002, adam.jpg
6, uno, 44 garstang road, 00001 000001, uno.jpg
7, adam, 45 tree gate road, 00000 000000, adam.jpg

from the sample data you can see that there are multiple records is there a way of clensing the data using a sql command?

is there any issues such as performance on clensing it etc?

are there any strategies that should be implemented at the data tier so as to prevent duplication of data?

chiragkhabaria
Master Smack Fu Yak Hacker

1907 Posts

Posted - 2006-09-13 : 10:09:15

Somthing like this


Declare @TAble TAble 
(
  [ID] int, 
  [Name] varchar(10),
  [Address] varchar(100),
  [Telephone] varchar(100),
  [Image] varchar(10)
)

INsert @Table 
Select 1, 'adam', '45 tree gate road', '00000 000000', 'adam.jpg' Union All 
Select 2, 'uno', '44 garstang road', '00001 000001', 'uno.jpg' Union All 
Select 3, 'adam', '45 tree gate road', '00000 000000', 'adam.jpg'Union All
Select 4, 'brian', 'belgarth house', '00022 000022', 'adam.jpg' Union All
Select 5, 'karen','3 chester close', '00002 000002', 'adam.jpg' Union All
Select 6, 'uno','44 garstang road', '00001 000001', 'uno.jpg' Union All
Select 7, 'adam','45 tree gate road', '00000 000000', 'adam.jpg' 



Delete From @Table Where ID Not IN 
(
	Select [ID] From @Table t1 Where [ID] In 
	(
		Select Min([ID]) From @Table t2 Where t1.[Name] = t2.[Name] And t1.[Address]  = t2.[Address] 
		And t1.[Telephone] = t2.[Telephone] And t1.[Image] = t2.[Image]
	)
)

Select * From @Table

Chirag

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2006-09-13 : 10:14:48

quote:
Originally posted by abuhassan

are there any strategies that should be implemented at the data tier so as to prevent duplication of data?

put a proper primary key on your table. Or add a unique key to your table on the columns that guarantee uniqueness.

-ec

chiragkhabaria
Master Smack Fu Yak Hacker

1907 Posts

Posted - 2006-09-13 : 10:25:23

quote:
Originally posted by eyechart
put a proper primary key on your table. Or add a unique key to your table on the columns that guarantee uniqueness.
-ec

I guess he has the unique key, but other records are duplicated.

Chirag

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2006-09-13 : 10:30:55

quote:
Originally posted by chiragkhabaria

I guess he has the unique key, but other records are duplicated.

Chirag

I mean create a composite unique key using several columns, not just on the identity column. This is exactly the argument against using a PK on the identity column - it does not guarantee uniqueness of your data.

-ec

abuhassan

105 Posts

Posted - 2006-09-13 : 10:54:16

Hi

I have a primary key but the problem was that the data was kept in various text files and when i import tha data i import the data into the view of the table using BCP to load the data from the text file to the table.

the data that comes in text files doesnt have a primary key.

hence i have duplicate records also some text files are exactly the same with different names hence the data is duplicated accidently and the same set of data is loaded twice.

is it possible to have a composite key when im using BCP will that detect dulicate records and avoid inserting them?

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2006-09-13 : 11:10:48

quote:
Originally posted by abuhassan

Hi

I have a primary key but the problem was that the data was kept in various text files and when i import tha data i import the data into the view of the table using BCP to load the data from the text file to the table.

the data that comes in text files doesnt have a primary key.

hence i have duplicate records also some text files are exactly the same with different names hence the data is duplicated accidently and the same set of data is loaded twice.

is it possible to have a composite key when im using BCP will that detect dulicate records and avoid inserting them?

you have a primary key on your identity column, right? do you have any other unique indexes or constraints on this table?

-ec

abuhassan

105 Posts

Posted - 2006-09-13 : 11:14:17

Currently no only one ID colunm as the primary key

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2006-09-13 : 11:22:14

quote:
Originally posted by abuhassan

Currently no only one ID colunm as the primary key

that is why you got duplicate rows inserted in your table. A unique identity value was generated for every inserted row so the PK constraint was never violated. But that did not guarantee that the rest of the data inserted into the table was actually unique. do you understand what I am getting at?

-ec

abuhassan

105 Posts

Posted - 2006-09-13 : 11:26:59

yes ..but if i have a composite that connsists of four colunms how do i get the data in...

please note the problem is that the data is in various text files and the text file dont hav a primary key..

is there a way to manage the import of data differently inorder to resolve this?

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2006-09-13 : 20:21:01

DTS

chiragkhabaria
Master Smack Fu Yak Hacker

1907 Posts

Posted - 2006-09-14 : 01:24:54

did you try the query which i posted for removing the duplicate values ??

Chirag

abuhassan

105 Posts

Posted - 2006-09-18 : 05:28:23

hi Chirag sorry for not replyin earlier

Thanks for the query it works.

i have two questions:

Question 1
==========
does this have a penality on performance for example if the remove duplicate script is run at certain intervals?

Question 2
==========
is there a better way for example when the data is being imported from a text file that it would detect the duplicate values and prevent from entering them?

currently i use the folowing command to load the data:


ALTER  PROCEDURE spLoadHistory @path varchar(300) AS 
Declare @bulk_cmd varchar(1000)
SET @bulk_cmd = 'BULK INSERT Cust..VW_CustHistoryLoad
FROM ''' + @path+ ''' 
WITH (FIELDTERMINATOR = '','', ROWTERMINATOR = '''+CHAR(10)+''')'
EXEC(@bulk_cmd)

chiragkhabaria
Master Smack Fu Yak Hacker

1907 Posts

Posted - 2006-09-18 : 06:22:03

How about using EC's solution, create the composite keys on the column where you want to prevent duplicate entries.

About the performance, its depends on how frequent you are planning to run the script and how many records are inserted, ofcourse it will have over head once the records in the table
increases.

How about, importing the data in the temp table and then while inserting into main table,
and while inserting into the main table eliminate the duplicate records?

Chirag

abuhassan

105 Posts

Posted - 2006-09-18 : 06:45:17

the data apart from the primary key is four out of five colunms

if i make the four other colunms as composite key woud that not affect the data base performance?

also i think thats using too many colunms for a composite key is that correct?

chiragkhabaria
Master Smack Fu Yak Hacker

1907 Posts

Posted - 2006-09-18 : 06:55:24

I dont think so it will affect the performance.

you can do the following thing, make other 4 columns has composite key and on which is primary key right now, you can make it as an unqiue key.

So it will not enter duplicate records.

I dont know about the performance issue, but you can acheive the same using Instead of Insert Trigger too.
Chirag

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-09-18 : 07:03:48

This has slightly better performance

DELETE		t1
FROM		@Table t1
WHERE		t1.ID NOT IN	(
					SELECT		MAX(t2.ID)
					FROM		@Table t2
					GROUP BY	t2.Name,
							t2.Address,
							t2.Telephone,
							t2.Image
				)

Peter Larsson
Helsingborg, Sweden

abuhassan

105 Posts

Posted - 2006-09-18 : 07:08:39

Thanks for the Info...

How do i create / alter the table to create a composite key? leaving the primary key as it is? will that affect the way im loading the data because im using a view to which im inserting the data via the procedure? Im not inserting the data into to the actual table?

You mentioned:
"but you can acheive the same using Instead of Insert Trigger too."

what do you mean? Whats a Trigger?

chiragkhabaria
Master Smack Fu Yak Hacker

1907 Posts

Posted - 2006-09-18 : 07:14:01

I dont think it will affect the way you are loading the data.

for Creating the composite key, do it via EM, and then get the script of the structure changed.

Triggers are the Events which are fired on the Tables, when any DML statements is fired on that table.

Normally Triggers are used to implement the bussiness rules.

So while inserting the records in the table, we can remove the duplicate records and insert the unique records.

Chirag

abuhassan

105 Posts

Posted - 2006-09-18 : 07:21:54

thanks

Wanderer
Master Smack Fu Yak Hacker

1168 Posts

Posted - 2006-09-18 : 07:40:18

quote:
Originally posted by abuhassan

i have duplicate records, but they have unique ids.

the duplicates resulted as a result of using a view to enter the data using BCP to load data from text file that had many dulicate records as part of consolidting the data i would like to get rid of the duplicate

is there some way of checking if the Name, Address, Telephone and Image are the same as on a record with another then remove them.

the table has the following colunms:

ID, Name, Address, Telephone, Image

1, adam, 45 tree gate road, 00000 000000, adam.jpg
2, uno, 44 garstang road, 00001 000001, uno.jpg
3, adam, 45 tree gate road, 00000 000000, adam.jpg
4, brian, belgarth house, 00022 000022, adam.jpg
5, karen, 3 chester close, 00002 000002, adam.jpg
6, uno, 44 garstang road, 00001 000001, uno.jpg
7, adam, 45 tree gate road, 00000 000000, adam.jpg

from the sample data you can see that there are multiple records is there a way of clensing the data using a sql command?

is there any issues such as performance on clensing it etc?

are there any strategies that should be implemented at the data tier so as to prevent duplication of data?

Something to check:
- Does this table with duplicates in it have any children rows? If so, you are in for a fun time, since you will need to select the row of the duplicate to retain, move children from dup-parent row to 'soon-to-be-unqiue' row, and only then do the remove.

I think a unqiue index on the composite key is a requirement. Consider the impact on inserts (which may be reasonably low, depending on frequency) versus to cost and impact on availability of the de-dup operation you are having to go through now. Now consider the impact of application issues where code incorrectly handles duplicates and mis-behaves/breaks (mis-behaves could be worse, since it may act in an unexpected manner, and you may not be aware of the problem until some time in the future).

Depending on the number of rows impacted, and the only requirement of the table, and the manner in which it is used, you may want to consider the following process (recently used to de-dup a 1.2 million row predominantly read only table ((updated by in-frequent batch jobs)) to the 70K it should have been, and de-dup that tables parent table from 59K to 3K, on a reasonable frequently used tables on a 24x7 transaction web-site).

1> Backup log
2> report on data counts in live tables (select count(*) from table with (nolock))
3> check for Temp tables, remove if they exist
4> create New Temp tables based on Live tables schema
5> insert required data into the TempParent and TempChild tables. This will contain only 1 entry for parent row,
and will retain the ID generated keys necessary to retain data integrity
6> removing foreign keys from live tables to prepare for renaming them
7> set seed value for identity to ensure no clashes for future insert of Parent rows
8> rename 'live' tables to _old
9> rename 'Temp' tables to 'live'
10> drop primary keys from _old tables (since PK names are unique in DB)
11> add primary keys and existing indexes to new 'live' tables
12> add foreign keys to new 'live' tables
13> report on data counts in live and _old tables
14> add unique index on candidate key to ensure future unqiueness on parent table.

*##* *##* *##* *##*

Chaos, Disorder and Panic ... my work is done here!

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources