Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Within table Intersection

Author  Topic 

raulcleary
Starting Member

14 Posts

Posted - 2008-06-03 : 12:09:23
Hello all,
I'm new to SQL and my Teach Yourself in Blah Blah Blah book doesn't index anything helpful to my task. I have a single
TABLE ratings (rid INTEGER PRIMARY KEY, mvid INTEGER, uid INTEGER, rating INTEGER)
with about 100,000,000 rows. I would ultimately like to be able to select rows that are the intersection of two mvid on uid or vice versa -- that is, all rows whose uid is present in the set of rows where mvid=a AND in the set where mvid=b (or vice versa). Something like:
rid mvid uid rating
----------------------
1 123 76 3
2 123 31 4
3 123 44 2
4 211 65 2
5 211 76 4
6 211 44 5
7 535 93 3
8 535 76 2

SELECT * FROM ratings INTO @temp_table WHERE mvid=123
SELECT * FROM ratings WHERE uid IN @temp_table AND mvid=211
returns
5 211 76 4
6 211 44 5

FIRST, I don't know how to accomplish this intersection at all and the only idea I've had is querying the first clause of the intersection and storing it in a temporary/variable table then selecting from that and the original table for the second clause (not that I know how to do this), but I'm afraid this will be very inefficient for the >O(n^2) queries I must perform, so SECOND, should I build derived intersection tables from the results so as to have ~O(1) when repeating the queries later, or will SQL be doing sufficient behind-the-scenes magic? Is there an efficient SQL statement that could generate table(s) of the set of intersections if necessary? Thanks

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-06-03 : 12:35:14
SELECT MAX(CASE WHEN mvid=211 THEN rid ELSE NULL END) as rid,
MAX(CASE WHEN mvid=211 THEN mvid ELSE NULL END) as mvid,
MAX(CASE WHEN mvid=211 THEN uid ELSE NULL END) as uid,
MAX(CASE WHEN mvid=211 THEN rating ELSE NULL END) as rating,
FROM ratings
WHERE mvid IN (123,211)
GROUP BY mvid
HAVING COUNT(DISTINCT mvid) =2
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-05 : 11:41:55
Thanks for the reply, but this doesn't seem to work; it always returns the empty set. It doesn't find intersections between two sets I know to have a nonempty intersection, not even when both mvid are equal (i.e. set intersection with a set itself). What do you think of:
SELECT * FROM ratings WHERE movie_id = 15124 AND user_id IN (SELECT user_id FROM ratings WHERE movie_id=5317)

The query is currently running and will take all day. If it finishes. And may not work.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-06-05 : 11:53:38
What you want is a query that returns all data from table ratings where users has rated BOTH movieX AND movieY ?



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-06-05 : 11:56:22
INNER JOIN is MUCH faster than IN.
SELECT		r.*
FROM Ratings AS r
INNER JOIN Ratings AS u ON u.UserID = r.UserID
WHERE r.MovieID = 15124
AND u.MovieID = 5317



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-05 : 19:55:27
Basically yes. I don't always need more than the user_id of people who have rated both movies, but I will need the information often. I think I ought to index it, but I'm not sure how the size/time of that works out -- 18k movies, 480k users, 100M ratings -- is there a meaningful index on a huge lump of data?

Hmmm...that INNER JOIN ran for 418 minutes before I killed it. My original syntax took 3.5 minutes (which is still unacceptable).
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2008-06-05 : 20:01:18
You should index the following:

Ratings.UserID
Ratings.MovieID

You may want to add include columns to the UserID index so that it covers the returned columns.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Database maintenance routines:
http://weblogs.sqlteam.com/tarad/archive/2004/07/02/1705.aspx
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-09 : 17:02:40
quote:
Originally posted by Peso

INNER JOIN is MUCH faster than IN.
SELECT		r.*
FROM Ratings AS r
INNER JOIN Ratings AS u ON u.UserID = r.UserID
WHERE r.MovieID = 15124
AND u.MovieID = 5317

This is still running so slow I've never let it finish. After indexing movie_id (the longer indexing keeps getting killed by timeouts), my other syntax takes 6s, but faster would still be better. Is there maybe a typo in the above code? Thanks!
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2008-06-09 : 17:04:36
If there was a typo, you'd receive an error.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-06-10 : 01:13:03
Replace "r.*" with a column list containing only the records you really want.
It takes a few queries until SQL Server has enough statistics to run the query fast.


E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-11 : 10:58:04
quote:
Originally posted by tkizer
If there was a typo, you'd receive an error.

I meant more a braino than a typo.
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-11 : 10:59:36
quote:
Originally posted by Peso
Replace "r.*" with a column list containing only the records you really want.
It takes a few queries until SQL Server has enough statistics to run the query fast.
Will do. Thanks.
Go to Top of Page

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2008-06-11 : 11:42:20
I have not tested performance, but if you are using SQL 2005 there is the INTERSECT operator:
SELECT A.uid FROM @Rating AS A WHERE mvid=123
INTERSECT
SELECT uid FROM @Rating AS B WHERE mvid=211
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-21 : 16:36:26
Okay, so now that I have this simple version working at last, now how do I use the results of a query be the set the next query iterates over without putting that logic outside the SQL. A simple example would be how I return a list of the movie_id's with the averages for the ratings of that movie.
SELECT movie_id, avg(ratings) FROM ratings WHERE movie_id = 1
gives me the average for the first movie and
SELECT DISTINCT movid_id FROM ratings
gives me a list of the movie_id's, but I don't know how to get the first to iterate over the second. I'm currently running
SELECT r.movie_id, avg(r.rating)
FROM ratings AS r
INNER JOIN ratings AS u
ON r.movie_id = u.movie_id
WHERE r.movie_id IN (1,2,3);
trying to get it to just do the first 3, but it returns the single average of the set of all three. In the future, the set to be iterated over will be like the intersections you all already helped me with. Thanks!
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-06-21 : 16:57:01
No need to self-join here.

SELECT movie_id, avg(rating)
FROM ratings
WHERE movie_id IN (1, 2, 3)
GROUP BY movie_id



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-22 : 14:46:53
quote:
Originally posted by Peso

No need to self-join here.

D'oh. I tried that, but in fact must not have. Thanks!
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-06-22 : 14:49:18
If rating are integers, you should include
avg(1.0 * rating)

instead, so that the result is not converted to integer and nither truncated nor rounded.



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-24 : 03:39:59
Okay, time for my daily lame (but I'm getting better). Trying to build a table of Hamming distances. Calculating a single distance is easy enough
SELECT COUNT(user_id) FROM
(SELECT user_id FROM ratings WHERE movie_id=1
INTERSECT
SELECT user_id FROM ratings WHERE movie_id=2);
but I'd again like to iterate over the whole set of movie_id's. I'd actually like to get the output to be in the form MOVIE A|MOVIE B|COUNT(user_id) to insert into the Hamming distance table where the pair of movie_id's forms the primary key.
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-06-24 : 03:46:12
So you want two movie ids at a time with count of users in both without taking each movie more than once?
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-06-24 : 04:05:25
Is it something like this you want?
DECLARE	@Sample TABLE
(
rid INT,
mvid INT,
uid INT,
rating INT
)

INSERT @Sample
SELECT 1, 123, 76, 3 UNION ALL
SELECT 2, 123, 31, 4 UNION ALL
SELECT 3, 123, 44, 2 UNION ALL
SELECT 4, 211, 65, 2 UNION ALL
SELECT 5, 211, 76, 4 UNION ALL
SELECT 6, 211, 44, 5 UNION ALL
SELECT 7, 535, 93, 3 UNION ALL
SELECT 8, 535, 76, 2

SELECT *
FROM @Sample

SELECT s1.mvid AS MovieA,
s2.mvid AS MovieB,
COUNT(*) AS Voters
FROM @Sample AS s1
INNER JOIN @Sample AS s2 ON s2.uid = s1.uid
WHERE s1.mvid < s2.mvid
GROUP BY s1.mvid,
s2.mvid
ORDER BY s1.mvid,
s2.mvid



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

raulcleary
Starting Member

14 Posts

Posted - 2008-06-24 : 14:57:34
quote:
Originally posted by Peso

Is it something like this you want?

Exactly like that as it turns out. Thank you again.
Go to Top of Page
    Next Page

- Advertisement -