| Author |
Topic |
|
wgpubs
Yak Posting Veteran
67 Posts |
Posted - 2003-07-10 : 19:01:41
|
| I'm trying to reproduce something similar to the random picture functionality seen on sites like hotornot.com, amihot.com, etc.... I've reviewed just about every article on this site (I think) regarding various techniques to fetch a random single record BUT I have yet to fine one that performs well with a large table and various criteria specified in the 'where' clause.Here is the scenario:1. I've created a simple table consisting of photo specific info (photoid, gender, category, etc...)2. I've artificially created over 1 million records.3. I need to be able to fetch a random record based on gender and/or categoryWhat is the best, most efficient way to make that happen??? Any ideas on solving this dilemna would be extremely appreciated.thanks |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2003-07-10 : 19:05:43
|
| Have you tried this out:SELECT TOP 1 *FROM Table1WHERE Gender = 'M'ORDER BY NEWID()SELECT TOP 1 *FROM Table1WHERE Category = 'Sports'ORDER BY NEWID()Tara |
 |
|
|
wgpubs
Yak Posting Veteran
67 Posts |
Posted - 2003-07-10 : 19:13:12
|
yes I have ... while this technique works great for a relatively small table, it becomes unbareably slow as the size of the table increases.Any other ideas?thanks quote: Have you tried this out:SELECT TOP 1 *FROM Table1WHERE Gender = 'M'ORDER BY NEWID()SELECT TOP 1 *FROM Table1WHERE Category = 'Sports'ORDER BY NEWID()Tara
|
 |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2003-07-10 : 19:19:40
|
| How about setting the rowcount value so that you get a sample of data then get a random record from that sample:SET ROWCOUNT 10000SELECT *FROM Table1WHERE Category = 'Sports'ORDER BY NEWID()SET ROWCOUNT 0Change the ROWCOUNT value from 10000 to whatever size you want the sample to be. You should change rowcount back to 0 like in my example so that it doesn't affect other queries.Tara |
 |
|
|
wgpubs
Yak Posting Veteran
67 Posts |
Posted - 2003-07-10 : 19:33:55
|
actually the performance is worse using this code :(quote: How about setting the rowcount value so that you get a sample of data then get a random record from that sample:SET ROWCOUNT 10000SELECT *FROM Table1WHERE Category = 'Sports'ORDER BY NEWID()SET ROWCOUNT 0Change the ROWCOUNT value from 10000 to whatever size you want the sample to be. You should change rowcount back to 0 like in my example so that it doesn't affect other queries.Tara
|
 |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2003-07-10 : 19:45:15
|
| Well the performance should have been better. Do you have an index on the column in your WHERE clause? Anyway, have you tried this:[url]http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=173[/url]If you have tried it, then let us know what you have tried so we don't keep asking if you have tried something.Tara |
 |
|
|
wgpubs
Yak Posting Veteran
67 Posts |
Posted - 2003-07-10 : 21:12:39
|
unfortunately the technique described in the article requires a contiguous set of id values ... which would not work in my case since the where clause does not return a contigous set of id values.In terms of what I have tried ... basically I've gone thru and tried just about every technique described in the various articles on random selection in this site. The one that works the best so far is described on http://www.sqlteam.com/item.asp?ItemID=576. The query essentially looks like this:SELECT ID, yadaFROM FooWHERE ID = (SELECT CAST((rand() * @RowCount) as int) + 1)The problem: this too requires a contiguous set of IDs. If I can find a way to use almost the identical query WITHOUT having to rely on a sequential set of IDs .... that seems to me to be ideal. Any other ideas? Appreciate the help ... thanks!quote: Well the performance should have been better. Do you have an index on the column in your WHERE clause? Anyway, have you tried this:[url]http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=173[/url]If you have tried it, then let us know what you have tried so we don't keep asking if you have tried something.Tara
|
 |
|
|
jsmith8858
Dr. Cross Join
7423 Posts |
Posted - 2003-07-10 : 21:37:22
|
| you COULD say something like this:-- First, get your min and max range:select @minRange = min(ID), @maxRange = Max(ID)FROM Foo-- next, get the first ID LESS than a random number in that range:SELECT @RandomID = Max(ID)FROM Foo WHERE ID <= (SELECT CAST((rand() * (@MaxRange - @MinRange) + @MinRange as int) + 1) -- now, return it:SELECT * FROM FooWHERE ID = @RandomIDHOWEVER ... this will not be perfectly random. For example, if you have this data:23678Then MinRange will be 2, Max(Range) will be 8, and you will be generating a number from 2 to 8.But the distribution goes like this:Random Number Generated -> ID Returned2 -> 23 -> 34 -> 35 -> 36 -> 67 -> 78 -> 8Note that the ID of 3 has a better chance of being returned than the numbers 2,6,7 or 8.Somewhere near, the answer lies .... Arnold, where are you ???By the way -- i can totally see why the NEWID() thing will be slow. By definition, SQL Server will need to process EVERY SINGLE ROW in the table that it is considering, generate this NEWID() field, and then return the highest. No indexing will help in that particular case, hopefully that makes sense.- Jeff |
 |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2003-07-10 : 21:57:22
|
quote: No indexing will help in that particular case, hopefully that makes sense.
Perhaps it might:SELECT * FROM Table1 WHERE photoid=(SELECT TOP 1 photoid FROM Table1 ORDER BY NewID())In that case, the subquery can use the index on photoid and scan it instead of hitting the table. It will return a single ID value which the outer query can use to seek the table. If photoid is a clustered index then it's less likely to help. This also doesn't take into consideration the WHERE clause you need, but the same concept applies: if the WHERE clause can use indexes and you SELECT only the key column in the subquery, you can minimize the amount of data that needs to be read in order to return the results you want. |
 |
|
|
jsmith8858
Dr. Cross Join
7423 Posts |
Posted - 2003-07-10 : 22:20:24
|
| Only one way to find out!run this in the northwind database and look at the plans:-- first, ordering by an indexed field:select top 1 OrderID from orders order by orderID Descselect top 1 * from orders order by OrderID Desc-- next, ordering by newid()select top 1 OrderID from orders order by NewID() Descselect top 1 * from orders order by NewID()-- next, ordering by a non-indexed field:select top 1 OrderID from orders order by Freight Descselect top 1 * from orders order by Freight Desc-- finally, using the technique Rob suggests:select * from orders where orderID = (select top 1 orderID from orders order by NewID())It does seem faster than SELECT TOP * .... not bad! I didn't think it would make a difference. Still not an ideal solution, but it does improve performance a little.Edited by - jsmith8858 on 07/10/2003 22:31:45 |
 |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2003-07-11 : 00:43:07
|
quote: Still not an ideal solution, but it does improve performance a little.
Maybe not ideal, but I think it can be as close as possible with the table design. I dug into something that may reveal a key improvement: the row counts and row sizes for the two query versions:select * from orders where orderid=(select top 1 orderid from orders order by newid())select top 1 * from orders order by newid()The first version has to process a total of 1663 rows, the second 1661. BUT...the row sizes of the first one are MUCH smaller than the rows processed by the second. Bigger rows mean more bytes/pages read and processed. By my calculations, the first query only has to read 50,342 bytes, the second reads 410,267 bytes; an eight-fold increase. Considering that this is from a source table of only 830 rows, you can see how much of a greater difference there would be on a 1 million+ row table.Of course, as Jeff said, the only way to find out is to test it. Definitely try testing as many index combinations as you can and measure the results repeatedly, THEN compare them. I think you can find one that will give you much better performance than the method you're using now. A covering index on category and gender would probably be best, and perhaps another index on gender only. |
 |
|
|
|