| Author |
Topic |
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-03 : 12:09:23
|
Hello all,I'm new to SQL and my Teach Yourself in Blah Blah Blah book doesn't index anything helpful to my task. I have a single TABLE ratings (rid INTEGER PRIMARY KEY, mvid INTEGER, uid INTEGER, rating INTEGER) with about 100,000,000 rows. I would ultimately like to be able to select rows that are the intersection of two mvid on uid or vice versa -- that is, all rows whose uid is present in the set of rows where mvid=a AND in the set where mvid=b (or vice versa). Something like:rid mvid uid rating----------------------1 123 76 3 2 123 31 43 123 44 24 211 65 25 211 76 46 211 44 57 535 93 38 535 76 2SELECT * FROM ratings INTO @temp_table WHERE mvid=123SELECT * FROM ratings WHERE uid IN @temp_table AND mvid=211 returns5 211 76 46 211 44 5 FIRST, I don't know how to accomplish this intersection at all and the only idea I've had is querying the first clause of the intersection and storing it in a temporary/variable table then selecting from that and the original table for the second clause (not that I know how to do this), but I'm afraid this will be very inefficient for the >O(n^2) queries I must perform, so SECOND, should I build derived intersection tables from the results so as to have ~O(1) when repeating the queries later, or will SQL be doing sufficient behind-the-scenes magic? Is there an efficient SQL statement that could generate table(s) of the set of intersections if necessary? Thanks |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-06-03 : 12:35:14
|
| SELECT MAX(CASE WHEN mvid=211 THEN rid ELSE NULL END) as rid, MAX(CASE WHEN mvid=211 THEN mvid ELSE NULL END) as mvid, MAX(CASE WHEN mvid=211 THEN uid ELSE NULL END) as uid, MAX(CASE WHEN mvid=211 THEN rating ELSE NULL END) as rating, FROM ratings WHERE mvid IN (123,211)GROUP BY mvidHAVING COUNT(DISTINCT mvid) =2 |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-05 : 11:41:55
|
Thanks for the reply, but this doesn't seem to work; it always returns the empty set. It doesn't find intersections between two sets I know to have a nonempty intersection, not even when both mvid are equal (i.e. set intersection with a set itself). What do you think of:SELECT * FROM ratings WHERE movie_id = 15124 AND user_id IN (SELECT user_id FROM ratings WHERE movie_id=5317) The query is currently running and will take all day. If it finishes. And may not work. |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-06-05 : 11:53:38
|
What you want is a query that returns all data from table ratings where users has rated BOTH movieX AND movieY ? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-06-05 : 11:56:22
|
INNER JOIN is MUCH faster than IN.SELECT r.*FROM Ratings AS rINNER JOIN Ratings AS u ON u.UserID = r.UserIDWHERE r.MovieID = 15124 AND u.MovieID = 5317 E 12°55'05.25"N 56°04'39.16" |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-05 : 19:55:27
|
| Basically yes. I don't always need more than the user_id of people who have rated both movies, but I will need the information often. I think I ought to index it, but I'm not sure how the size/time of that works out -- 18k movies, 480k users, 100M ratings -- is there a meaningful index on a huge lump of data?Hmmm...that INNER JOIN ran for 418 minutes before I killed it. My original syntax took 3.5 minutes (which is still unacceptable). |
 |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2008-06-05 : 20:01:18
|
| You should index the following:Ratings.UserIDRatings.MovieIDYou may want to add include columns to the UserID index so that it covers the returned columns.Tara KizerMicrosoft MVP for Windows Server System - SQL Serverhttp://weblogs.sqlteam.com/tarad/Database maintenance routines:http://weblogs.sqlteam.com/tarad/archive/2004/07/02/1705.aspx |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-09 : 17:02:40
|
quote: Originally posted by Peso INNER JOIN is MUCH faster than IN.SELECT r.*FROM Ratings AS rINNER JOIN Ratings AS u ON u.UserID = r.UserIDWHERE r.MovieID = 15124 AND u.MovieID = 5317
This is still running so slow I've never let it finish. After indexing movie_id (the longer indexing keeps getting killed by timeouts), my other syntax takes 6s, but faster would still be better. Is there maybe a typo in the above code? Thanks! |
 |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-06-10 : 01:13:03
|
Replace "r.*" with a column list containing only the records you really want.It takes a few queries until SQL Server has enough statistics to run the query fast. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-11 : 10:58:04
|
quote: Originally posted by tkizerIf there was a typo, you'd receive an error.
I meant more a braino than a typo. |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-11 : 10:59:36
|
quote: Originally posted by PesoReplace "r.*" with a column list containing only the records you really want.It takes a few queries until SQL Server has enough statistics to run the query fast.
Will do. Thanks. |
 |
|
|
Lamprey
Master Smack Fu Yak Hacker
4614 Posts |
Posted - 2008-06-11 : 11:42:20
|
I have not tested performance, but if you are using SQL 2005 there is the INTERSECT operator:SELECT A.uid FROM @Rating AS A WHERE mvid=123INTERSECTSELECT uid FROM @Rating AS B WHERE mvid=211 |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-21 : 16:36:26
|
Okay, so now that I have this simple version working at last, now how do I use the results of a query be the set the next query iterates over without putting that logic outside the SQL. A simple example would be how I return a list of the movie_id's with the averages for the ratings of that movie. SELECT movie_id, avg(ratings) FROM ratings WHERE movie_id = 1 gives me the average for the first movie and SELECT DISTINCT movid_id FROM ratings gives me a list of the movie_id's, but I don't know how to get the first to iterate over the second. I'm currently running SELECT r.movie_id, avg(r.rating)FROM ratings AS rINNER JOIN ratings AS uON r.movie_id = u.movie_idWHERE r.movie_id IN (1,2,3); trying to get it to just do the first 3, but it returns the single average of the set of all three. In the future, the set to be iterated over will be like the intersections you all already helped me with. Thanks! |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-06-21 : 16:57:01
|
No need to self-join here.SELECT movie_id, avg(rating)FROM ratingsWHERE movie_id IN (1, 2, 3)GROUP BY movie_id E 12°55'05.25"N 56°04'39.16" |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-22 : 14:46:53
|
quote: Originally posted by Peso No need to self-join here.
D'oh. I tried that, but in fact must not have. Thanks! |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-06-22 : 14:49:18
|
If rating are integers, you should includeavg(1.0 * rating)instead, so that the result is not converted to integer and nither truncated nor rounded. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-24 : 03:39:59
|
Okay, time for my daily lame (but I'm getting better). Trying to build a table of Hamming distances. Calculating a single distance is easy enough SELECT COUNT(user_id) FROM(SELECT user_id FROM ratings WHERE movie_id=1INTERSECTSELECT user_id FROM ratings WHERE movie_id=2); but I'd again like to iterate over the whole set of movie_id's. I'd actually like to get the output to be in the form MOVIE A|MOVIE B|COUNT(user_id) to insert into the Hamming distance table where the pair of movie_id's forms the primary key. |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-06-24 : 03:46:12
|
| So you want two movie ids at a time with count of users in both without taking each movie more than once? |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-06-24 : 04:05:25
|
Is it something like this you want?DECLARE @Sample TABLE ( rid INT, mvid INT, uid INT, rating INT )INSERT @SampleSELECT 1, 123, 76, 3 UNION ALLSELECT 2, 123, 31, 4 UNION ALLSELECT 3, 123, 44, 2 UNION ALLSELECT 4, 211, 65, 2 UNION ALLSELECT 5, 211, 76, 4 UNION ALLSELECT 6, 211, 44, 5 UNION ALLSELECT 7, 535, 93, 3 UNION ALLSELECT 8, 535, 76, 2SELECT *FROM @SampleSELECT s1.mvid AS MovieA, s2.mvid AS MovieB, COUNT(*) AS VotersFROM @Sample AS s1INNER JOIN @Sample AS s2 ON s2.uid = s1.uidWHERE s1.mvid < s2.mvidGROUP BY s1.mvid, s2.mvidORDER BY s1.mvid, s2.mvid E 12°55'05.25"N 56°04'39.16" |
 |
|
|
raulcleary
Starting Member
14 Posts |
Posted - 2008-06-24 : 14:57:34
|
quote: Originally posted by Peso Is it something like this you want?
Exactly like that as it turns out. Thank you again. |
 |
|
|
Next Page
|