Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Best Way to avoid duplicates

Author  Topic 

matty1stop
Starting Member

37 Posts

Posted - 2007-04-19 : 08:50:11
I'm attempting to create a stored procedure that would transfer data daily from a data warehouse to a data mart. I'm running into slow performance so I'm trying to find the best way to structure the feed to avoid adding duplicate data.

The query I'm running is more complex than my example, but just to start out simple lets say I have a table in the warehouse called Transactions with two fields: UserID (varchar(32)) and ProcessTime(datetime). There is one record in this table for every transaction a user processes.

I want to populate a table (DailyTranCount, UserID (varchar(32), ProcessDate (smalldatetime), TransactionCount (int) ) in the data mart with a daily count of the user's transactions.
The insert statement would look something like this:

insert into DailyTranCount
select userid, convert(smalldatetime, processtime, 101), count(*)
from Transactions
group by userid, convert(smalldatetime, processtime, 101)

To avoid adding duplicate data is it best to:
A) use NOT EXISTS
B) use NOT IN
C) use a left join to the DailyTranCount including a null check in the where clause

Thanks for reading and please let me know if there is any other information I should have provided.

Matt

harsh_athalye
Master Smack Fu Yak Hacker

5581 Posts

Posted - 2007-04-19 : 09:08:11
My best bet would be to use LEFT JOIN with check on NULL.

Harsh Athalye
India.
"The IMPOSSIBLE is often UNTRIED"
Go to Top of Page

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2007-04-19 : 23:19:37
Try it & see the query plans.
I recon the optimiser will make IN the same as a join but it depends on other factors
Go to Top of Page

Vinnie881
Master Smack Fu Yak Hacker

1231 Posts

Posted - 2007-04-20 : 00:51:33
The general concensous is NOT EXISTS is better to use in this scenerio (It's been explained to me in a previous post that it's normally better practice to not need to add and join another table unless it's required for the query), but LoztinSpace is correct, check the execution plan on both to see what performs best for you.
Go to Top of Page
   

- Advertisement -