Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
abuhassan
105 Posts |
Posted - 2006-09-13 : 09:33:41
|
| i have duplicate records, but they have unique ids.the duplicates resulted as a result of using a view to enter the data using BCP to load data from text file that had many dulicate records as part of consolidting the data i would like to get rid of the duplicate is there some way of checking if the Name, Address, Telephone and Image are the same as on a record with another then remove them.the table has the following colunms:ID, Name, Address, Telephone, Image1, adam, 45 tree gate road, 00000 000000, adam.jpg2, uno, 44 garstang road, 00001 000001, uno.jpg3, adam, 45 tree gate road, 00000 000000, adam.jpg4, brian, belgarth house, 00022 000022, adam.jpg5, karen, 3 chester close, 00002 000002, adam.jpg6, uno, 44 garstang road, 00001 000001, uno.jpg7, adam, 45 tree gate road, 00000 000000, adam.jpgfrom the sample data you can see that there are multiple records is there a way of clensing the data using a sql command?is there any issues such as performance on clensing it etc?are there any strategies that should be implemented at the data tier so as to prevent duplication of data? |
|
|
chiragkhabaria
Master Smack Fu Yak Hacker
1907 Posts |
Posted - 2006-09-13 : 10:09:15
|
Somthing like this Declare @TAble TAble ( [ID] int, [Name] varchar(10), [Address] varchar(100), [Telephone] varchar(100), [Image] varchar(10))INsert @Table Select 1, 'adam', '45 tree gate road', '00000 000000', 'adam.jpg' Union All Select 2, 'uno', '44 garstang road', '00001 000001', 'uno.jpg' Union All Select 3, 'adam', '45 tree gate road', '00000 000000', 'adam.jpg'Union AllSelect 4, 'brian', 'belgarth house', '00022 000022', 'adam.jpg' Union AllSelect 5, 'karen','3 chester close', '00002 000002', 'adam.jpg' Union AllSelect 6, 'uno','44 garstang road', '00001 000001', 'uno.jpg' Union AllSelect 7, 'adam','45 tree gate road', '00000 000000', 'adam.jpg' Delete From @Table Where ID Not IN ( Select [ID] From @Table t1 Where [ID] In ( Select Min([ID]) From @Table t2 Where t1.[Name] = t2.[Name] And t1.[Address] = t2.[Address] And t1.[Telephone] = t2.[Telephone] And t1.[Image] = t2.[Image] ))Select * From @Table Chirag |
 |
|
|
eyechart
Master Smack Fu Yak Hacker
3575 Posts |
Posted - 2006-09-13 : 10:14:48
|
quote: Originally posted by abuhassanare there any strategies that should be implemented at the data tier so as to prevent duplication of data?
put a proper primary key on your table. Or add a unique key to your table on the columns that guarantee uniqueness.-ec |
 |
|
|
chiragkhabaria
Master Smack Fu Yak Hacker
1907 Posts |
Posted - 2006-09-13 : 10:25:23
|
quote: Originally posted by eyechartput a proper primary key on your table. Or add a unique key to your table on the columns that guarantee uniqueness.-ec
I guess he has the unique key, but other records are duplicated.Chirag |
 |
|
|
eyechart
Master Smack Fu Yak Hacker
3575 Posts |
Posted - 2006-09-13 : 10:30:55
|
quote: Originally posted by chiragkhabariaI guess he has the unique key, but other records are duplicated.Chirag
I mean create a composite unique key using several columns, not just on the identity column. This is exactly the argument against using a PK on the identity column - it does not guarantee uniqueness of your data.-ec |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-13 : 10:54:16
|
| Hi I have a primary key but the problem was that the data was kept in various text files and when i import tha data i import the data into the view of the table using BCP to load the data from the text file to the table.the data that comes in text files doesnt have a primary key.hence i have duplicate records also some text files are exactly the same with different names hence the data is duplicated accidently and the same set of data is loaded twice.is it possible to have a composite key when im using BCP will that detect dulicate records and avoid inserting them? |
 |
|
|
eyechart
Master Smack Fu Yak Hacker
3575 Posts |
Posted - 2006-09-13 : 11:10:48
|
quote: Originally posted by abuhassan Hi I have a primary key but the problem was that the data was kept in various text files and when i import tha data i import the data into the view of the table using BCP to load the data from the text file to the table.the data that comes in text files doesnt have a primary key.hence i have duplicate records also some text files are exactly the same with different names hence the data is duplicated accidently and the same set of data is loaded twice.is it possible to have a composite key when im using BCP will that detect dulicate records and avoid inserting them?
you have a primary key on your identity column, right? do you have any other unique indexes or constraints on this table?-ec |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-13 : 11:14:17
|
| Currently no only one ID colunm as the primary key |
 |
|
|
eyechart
Master Smack Fu Yak Hacker
3575 Posts |
Posted - 2006-09-13 : 11:22:14
|
quote: Originally posted by abuhassan Currently no only one ID colunm as the primary key
that is why you got duplicate rows inserted in your table. A unique identity value was generated for every inserted row so the PK constraint was never violated. But that did not guarantee that the rest of the data inserted into the table was actually unique. do you understand what I am getting at?-ec |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-13 : 11:26:59
|
| yes ..but if i have a composite that connsists of four colunms how do i get the data in... please note the problem is that the data is in various text files and the text file dont hav a primary key.. is there a way to manage the import of data differently inorder to resolve this? |
 |
|
|
LoztInSpace
Aged Yak Warrior
940 Posts |
Posted - 2006-09-13 : 20:21:01
|
| DTS |
 |
|
|
chiragkhabaria
Master Smack Fu Yak Hacker
1907 Posts |
Posted - 2006-09-14 : 01:24:54
|
| did you try the query which i posted for removing the duplicate values ??Chirag |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-18 : 05:28:23
|
hi Chirag sorry for not replyin earlier Thanks for the query it works.i have two questions:Question 1==========does this have a penality on performance for example if the remove duplicate script is run at certain intervals?Question 2==========is there a better way for example when the data is being imported from a text file that it would detect the duplicate values and prevent from entering them?currently i use the folowing command to load the data:ALTER PROCEDURE spLoadHistory @path varchar(300) AS Declare @bulk_cmd varchar(1000)SET @bulk_cmd = 'BULK INSERT Cust..VW_CustHistoryLoadFROM ''' + @path+ ''' WITH (FIELDTERMINATOR = '','', ROWTERMINATOR = '''+CHAR(10)+''')'EXEC(@bulk_cmd) |
 |
|
|
chiragkhabaria
Master Smack Fu Yak Hacker
1907 Posts |
Posted - 2006-09-18 : 06:22:03
|
| How about using EC's solution, create the composite keys on the column where you want to prevent duplicate entries. About the performance, its depends on how frequent you are planning to run the script and how many records are inserted, ofcourse it will have over head once the records in the table increases. How about, importing the data in the temp table and then while inserting into main table, and while inserting into the main table eliminate the duplicate records?Chirag |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-18 : 06:45:17
|
| the data apart from the primary key is four out of five colunmsif i make the four other colunms as composite key woud that not affect the data base performance? also i think thats using too many colunms for a composite key is that correct? |
 |
|
|
chiragkhabaria
Master Smack Fu Yak Hacker
1907 Posts |
Posted - 2006-09-18 : 06:55:24
|
| I dont think so it will affect the performance.you can do the following thing, make other 4 columns has composite key and on which is primary key right now, you can make it as an unqiue key.So it will not enter duplicate records. I dont know about the performance issue, but you can acheive the same using Instead of Insert Trigger too. Chirag |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2006-09-18 : 07:03:48
|
This has slightly better performanceDELETE t1FROM @Table t1WHERE t1.ID NOT IN ( SELECT MAX(t2.ID) FROM @Table t2 GROUP BY t2.Name, t2.Address, t2.Telephone, t2.Image ) Peter LarssonHelsingborg, Sweden |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-18 : 07:08:39
|
| Thanks for the Info...How do i create / alter the table to create a composite key? leaving the primary key as it is? will that affect the way im loading the data because im using a view to which im inserting the data via the procedure? Im not inserting the data into to the actual table? You mentioned:"but you can acheive the same using Instead of Insert Trigger too."what do you mean? Whats a Trigger? |
 |
|
|
chiragkhabaria
Master Smack Fu Yak Hacker
1907 Posts |
Posted - 2006-09-18 : 07:14:01
|
| I dont think it will affect the way you are loading the data. for Creating the composite key, do it via EM, and then get the script of the structure changed. Triggers are the Events which are fired on the Tables, when any DML statements is fired on that table. Normally Triggers are used to implement the bussiness rules. So while inserting the records in the table, we can remove the duplicate records and insert the unique records.Chirag |
 |
|
|
abuhassan
105 Posts |
Posted - 2006-09-18 : 07:21:54
|
| thanks |
 |
|
|
Wanderer
Master Smack Fu Yak Hacker
1168 Posts |
Posted - 2006-09-18 : 07:40:18
|
quote: Originally posted by abuhassan i have duplicate records, but they have unique ids.the duplicates resulted as a result of using a view to enter the data using BCP to load data from text file that had many dulicate records as part of consolidting the data i would like to get rid of the duplicate is there some way of checking if the Name, Address, Telephone and Image are the same as on a record with another then remove them.the table has the following colunms:ID, Name, Address, Telephone, Image1, adam, 45 tree gate road, 00000 000000, adam.jpg2, uno, 44 garstang road, 00001 000001, uno.jpg3, adam, 45 tree gate road, 00000 000000, adam.jpg4, brian, belgarth house, 00022 000022, adam.jpg5, karen, 3 chester close, 00002 000002, adam.jpg6, uno, 44 garstang road, 00001 000001, uno.jpg7, adam, 45 tree gate road, 00000 000000, adam.jpgfrom the sample data you can see that there are multiple records is there a way of clensing the data using a sql command?is there any issues such as performance on clensing it etc?are there any strategies that should be implemented at the data tier so as to prevent duplication of data?
Something to check:- Does this table with duplicates in it have any children rows? If so, you are in for a fun time, since you will need to select the row of the duplicate to retain, move children from dup-parent row to 'soon-to-be-unqiue' row, and only then do the remove.I think a unqiue index on the composite key is a requirement. Consider the impact on inserts (which may be reasonably low, depending on frequency) versus to cost and impact on availability of the de-dup operation you are having to go through now. Now consider the impact of application issues where code incorrectly handles duplicates and mis-behaves/breaks (mis-behaves could be worse, since it may act in an unexpected manner, and you may not be aware of the problem until some time in the future).Depending on the number of rows impacted, and the only requirement of the table, and the manner in which it is used, you may want to consider the following process (recently used to de-dup a 1.2 million row predominantly read only table ((updated by in-frequent batch jobs)) to the 70K it should have been, and de-dup that tables parent table from 59K to 3K, on a reasonable frequently used tables on a 24x7 transaction web-site).1> Backup log2> report on data counts in live tables (select count(*) from table with (nolock))3> check for Temp tables, remove if they exist4> create New Temp tables based on Live tables schema5> insert required data into the TempParent and TempChild tables. This will contain only 1 entry for parent row, and will retain the ID generated keys necessary to retain data integrity6> removing foreign keys from live tables to prepare for renaming them7> set seed value for identity to ensure no clashes for future insert of Parent rows8> rename 'live' tables to _old9> rename 'Temp' tables to 'live'10> drop primary keys from _old tables (since PK names are unique in DB)11> add primary keys and existing indexes to new 'live' tables12> add foreign keys to new 'live' tables13> report on data counts in live and _old tables14> add unique index on candidate key to ensure future unqiueness on parent table.*##* *##* *##* *##* Chaos, Disorder and Panic ... my work is done here! |
 |
|
|
Next Page
|
|
|
|
|