Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 Are IN statements as good as JOINs?

Author  Topic 

sqlghost
Starting Member

23 Posts

Posted - 2005-07-31 : 13:02:08
Hi,

I'm writing an application using SQL server 2000, and I've come to a dillemma.

Programatically, it is tremendeously simpler for me to use an IN statement than it is to use a regular JOIN. I'm afraid this might cause performance issues, which could be a problem as this application might have to handle large returned lists from the IN statement?

Should I use the simpler IN method or work on a more complex solution with JOINs?

Kristen
Test

22859 Posts

Posted - 2005-07-31 : 14:08:20
Assuming you are meaning the difference between

SELECT *
FROM MyTableA
WHERE MyColA IN
(
SELECT MyColB
FROM MyTableB
WHERE SomeColB = 'FOO'
)

and something like

SELECT *
FROM MyTableA
JOIN MyTableB
ON MyColB = MyColA
WHERE SomeColB = 'FOO'

then I would expect the second method to be faster. But a test with some real world data & data volumes wouldn't do any harm!

Kristen
Go to Top of Page

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-07-31 : 14:27:23
It's not given that the one method is faster than the other.
The query optimizer can "rewrite" and change join order internally of the query in order to make the best execution plan it can estimate.

You would need to run real tests.
I suspect though that ANSI JOIN syntax is preferred, and the query optimizer makes the best job out of optimizing those.

I don't agree that the IN syntax is easier than the JOIN syntax.
At least not generally.

Post some sample query that You are doing, so that we can see what's going on.



rockmoose
Go to Top of Page

sqlghost
Starting Member

23 Posts

Posted - 2005-07-31 : 15:48:41
I'll try and explain this in a simple manner, since the database structure of my application is quite complex.

For simplicitie's sake, lets say I'll reffer to two of my tables: articles and comments. So, articles has all my articles with an identity int column, and comments holds comments to all the articles, with an int to identify which article the comment reffers to.

Now, the search I want to write has the ability to add an unlimited amount of conditions. You click add, select a condition type (lets say "Articles that have comments containing the words") and enter a value. If you want, you can add another condition of the same, or a different type (lets say "added after a date").

So, the way handle this, I have a header:
SELECT * FROM ARTICLES
and then I do a loop through all conditions adding each one with an AND between them and a WHERE at the start. Now, this only works if each condition is "self-contained". This is simple with IN statements, I just use:
ArticleId IN (SELECT ArticleId FROM ....)
If I want to use JOINs, I need to add a JOIN statement before or after my where clause. I can't add the JOINs in the middle, so my conditions arn't self contained, not to mention I need to JOIN every table only once.

So, is this serious? I'm gonna be running like 50-100k rows (database isn't filled yet, so I can't really test for performance).

Also, since this is a search, I'm not too worried about it taking 2-3 seconds longer. I'm more worried about memory issues (these searches could have several nested tables) and suffering a large performance loss. If the difference is like under 30%, I'd stick with the nested statement. Is this what I should expect from this method, or should I start looking into JOINs or perhaps something completely different?

Thanks for the help.
Go to Top of Page

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-07-31 : 16:20:05
Here is an alternative, that is pretty simple;

--------- Header (Can be created as a view) -------
SELECT a.* --<-- enumerate columns iof using * of course
FROM ARTICLES a
JOIN COMMENTS c ON a.ArticleId = c.ArticleId
JOIN XXXX x ON a.ArticleId = x.ArticleId
JOIN YYYY Y ON a.ArticleId = y.ArticleId
WHERE 1=1
--------- Genereated search criteria --------------
AND c.Comment IN('bla','blah','blaha')
AND x.www IN('bla','blah','blaha')
AND a.EntryDate > '20050101'


In the example above, since the table YYYY is no referenced in the genereated WHERE clause,
and since no columns are selected from it in the SELECT clause,
the query optimizer will efficiently remove it from the query plan.

If YOu create the header as a View, You will not even have to alias the tables.
And the client just selects and filters from that view.

Performancewise, this approach will be better if there are multiple filters on the same table.
Since in Your approach ther will be > 1 subselect from that table.
(if the query optimizer is really smart it could optimize that, but I'm not sure it will)
If that is not the case, I think the performance is the same.

But You will have to test

I hope this helps, and gives You some options.

rockmoose
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-07-31 : 16:35:04
So are you proposing something like

SELECT * FROM ARTICLES
WHERE ArticleId IN (SELECT ArticleId FROM COMMENTS WHERE CommentText LIKE '%FOO%')
AND ArticleId IN (SELECT ArticleId FROM COMMENTS WHERE CommentText LIKE '%BAR%')

if so I reckon its as easy to generate syntax like:

SELECT *
FROM ARTICLES A
JOIN COMMENTS C
ON C.ArticleId = A.ArticleId
AND
(
CommentText LIKE '%FOO%'
AND CommentText LIKE '%BAR%'
)

however, if you are going to use something along the line of LIKE '%FOO%' then its going to be hideously slow - the wildcard is going to force a table scan - possibly even two in this example.

Couple of possible alternatives:

1) make a table of keywords (split the Comments to give a list of the individual words, store those and then query against those)
2) Use the free text ability of SQL - in which case you can just search for a bunch of words in one strike (a bit like using Google) - and my best guess would be that dynamically generating the syntax for this would be pretty easy
3) Use a refining search

By (3) I mean store the results of matching the first word into a temporary table and then "refine" it using the result from the second, and so on. Alternatively keep ALL the individual resultsets, but add weighting - so an article that has a comment that matches both "foo" and "bar" scores 2 points, whereas one that only contains "foo" scores 1. Sort by that Ranking.

We use (3) when people search for words in a product description on our eCommerce sites. We split all the descriptions of the products into a table of Keywords. Admin folk then define additional keywords that are not in the description, or words that ARE in the description but are bad keywords - such as "Please note, FOO does not work with BAR" would mean that "BAR" was manually removed from the keywords.

We also have a table of aliases - so [given a toy shopping site] "Child" would have an alias of "Children" but possibly also of "boy" and "girl" etc.

(We also have a table of Noise words, so "the" etc. are removed from the keywords auto-generated from the Descripions - but, again, the Admin use can choose to add them back where that is important - we have an admin report that shows the frequence of words that are searched for, and they base their keywording decisions upon which words people are actually looking for)

Kristen
Go to Top of Page

sqlghost
Starting Member

23 Posts

Posted - 2005-07-31 : 16:44:42
rockmoose, that sounds like the best solution. I just need to make sure I don't run into more problems with this in other parts of my program. In fact, I can already think of one possible case in which I might use an IN clause, but that's just one of the many condition types I intend to add to the search, so that is acceptable for me.

Thanks again for all the help.
Go to Top of Page
   

- Advertisement -