Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Query formulation question from SQL Newbie

Author  Topic 

steveis
Starting Member

1 Post

Posted - 2007-10-20 : 14:07:19
I am bulk-loading a large amount of data into a group of related tables. Each table has an identity column as its primary key.

A problem is introduced because I need to break up the bulk-loads into multiple sessions, due to performance and memory constraints. I can get all the foreign key references correct within the scope of a single bulk-load session. However, there are many duplicate rows when you look at the job as a whole.

So, I need to find all the duplicate rows, collapse them down to a single row, and update all the foreign key references in other tables to reference this single row.

I can find all the duplicate rows and generate a table consisting of the primary key ids and the duplicated values, ie:

ID Value
---+-----
1 | Fred
2 | Fred
5 | Ethel
7 | Ethel
9 | Ethel

Now, all I need to do is DELETE all the duplicate rows (in the example above, rows 2, 7 and 9) and generate UPDATE statements for the other tables that reference this one which replace all the duplicate keys with the chosen one (ie. replace all references to foreign key 2 with 1, and all references to foreign key 7 and 9 with 5).

But this is where I'm getting stuck with SQL. I haven't been able to figure out a way to extract just the first row for each group of Values, which contains the primary key I will use to substitute, then generate the correct subquery for the UPDATEs.

I know I can do this sort of thing procedurally, but I'm wondering if there is also an elegant way to do this in SQL. Thanks so much for the help!!

rmiao
Master Smack Fu Yak Hacker

7266 Posts

Posted - 2007-10-20 : 16:22:59
Try following to delete dup rows:

declare @col1 same type as col1
declare @col2 same type as col2
declare @dupcnt int

declare dups cursor
for select col1, col2, (count(*) - 1) as Duplicates from datatbl
group by col1, col2 having count(*) > 1 order by col1, col2

open dups

while(1=1)
begin
fetch next from dups into @col1, @col2, @dupcnt
if (@@fetch_status < 0)
break
set rowcount @dupcnt
delete datatbl where col1 = @col1 and col2 = @col2
end

deallocate dups

set rowcount 0
Go to Top of Page
   

- Advertisement -