Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 compare of 2 big tables

Author  Topic 

pelegk2
Aged Yak Warrior

723 Posts

Posted - 2010-03-17 : 15:43:40
i have a table with 170million rows (tableA),
and a second table which have about 500,000 rows(tableB).
i want to compare a certain column from tableA (type bigint)
to check if a value of it exist in tableB.
i tried to do this using inner join, making a select with IN expression, but it runs for hours.
any idea how i can improve te result?
(the column from tableA dosent have an index, and the column from tableB is defined as primary key)

Thanks
Peleg

webfred
Master Smack Fu Yak Hacker

8781 Posts

Posted - 2010-03-17 : 15:57:50
I would create the needed index.


No, you're never too old to Yak'n'Roll if you're too young to die.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-03-17 : 16:19:40
" i tried to do this using inner join, making a select with IN expression, but it runs for hours."

Is it a JOIN or is it an IN clause?

An IN clause may make a Sort and DISTINCT on the whole of TableB

A JOIN would not have to do that, and if the value is the PK in TableB then a BookMark Lookup, or equivalent, should eb very quick (even if the Value column is not indexes in TableA)

(In writing that I realise that the Value Column is the PK for TableB, and thus QueryN Optimiser should not consider needing to do ORDER BY and DISTINCT as the values in TableB must be unique ... Best to look at the Query Plan and see what's what).

I have known EXISTS and IN produce disastrous query plans, where JOIN does not.

But JOIN will return multiple records if there are duplicates on TableB of course - which EXISTS and IN will not do.
Go to Top of Page

pelegk2
Aged Yak Warrior

723 Posts

Posted - 2010-03-18 : 04:42:29
i used the Exists and it seems to help,
the problem is that i split the compare into bulk of 1 million rows each times(which take 1.18 min), but when i try to make a query
with a parameter (@minid), the time increases
select distinct ReasonRef from Con_RecipientToMessages (nolock)where Con_RecipientToMessages.Id>=@minid and Con_RecipientToMessages.Id<= @minid+1000000and exists (select ReasonRef from #tblReasonList(nolock) where Con_RecipientToMessages.ReasonRef=#tblReasonList.ReasonRef)
what can i do with the query where i use a variable to improve performance?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 05:02:35
Does a JOIN perform differently?

select distinct RTM.ReasonRef
from Con_RecipientToMessages (nolock) AS RTM
JOIN #tblReasonList(nolock) AS RL
ON RL.ReasonRef = RTM.ReasonRef
where RTM.Id >= @minid
and RTM.Id <= @minid+1000000

Which is the smaller table? If I understood your first message correctly #tblReasonList is the smaller table and ReasonRef is the PK?

If so then

SELECT RL.ReasonRef
FROM #tblReasonList(nolock) AS RL
WHERE EXISTS
(
SELECT *
FROM Con_RecipientToMessages (nolock) AS RTM
WHERE RTM.ReasonRef = RL.ReasonRef
AND RTM.Id >= @minid
AND RTM.Id <= @minid+1000000
)

as this avoids the Distinct.

I don't like to see those NOLOCK there - serious possibility of highly undesirable side effects. Especially on the #tblReasonList - no other process can be referencing that, other than you, surely??

If you have blocking / deadlock issues use READ_COMMITTED_SNAPSHOT instead. It has none of the disastrous side effects of NOLOCK and all the upside - u9nless you have some special circumstances.
Go to Top of Page

pelegk2
Aged Yak Warrior

723 Posts

Posted - 2010-03-18 : 05:23:38
Kristen first thanks for the tips,
the thing that i want from this query is check if exist any value from one table inthe other table (it dosen't realy matter the value itself
or how many will be find)
for the ReasonRef - it's not a PK, but current i am building an index,so maybe it will improve the performance.

Go to Top of Page

pelegk2
Aged Yak Warrior

723 Posts

Posted - 2010-03-18 : 06:03:22
Thanks Kristen
the second query+the index improved it to 50-110 seconds search over 1million rows:)

the thing is that with parameters the query is still slower - any idea?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 06:16:55
Get rid of those NOLOCKs please!
Go to Top of Page

pelegk2
Aged Yak Warrior

723 Posts

Posted - 2010-03-18 : 06:50:17
quote:
Originally posted by Kristen

Get rid of those NOLOCKs please!



from the temporary table (i did so) or from the second one as well?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 07:38:51
UP to you.

Are you happy that some rows from Con_RecipientToMessages may not be included at all?

That is possible with NOLOCK, and that possibility is the case everywhere you use it - so if you have that in your code in lots of places they all have that risk. NOLOCK can also cause a row to be included twice (won't be a problem here as you are using DISTINCT)

Why do you use it? Like I said, if you having blocking/deadlock problems then use READ_COMMITTED_SNAPSHOT instead.
Go to Top of Page
   

- Advertisement -