Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Random Row Select and hotornot.com

Author  Topic 

wgpubs
Yak Posting Veteran

67 Posts

Posted - 2003-07-10 : 19:01:41
I'm trying to reproduce something similar to the random picture functionality seen on sites like hotornot.com, amihot.com, etc.... I've reviewed just about every article on this site (I think) regarding various techniques to fetch a random single record BUT I have yet to fine one that performs well with a large table and various criteria specified in the 'where' clause.

Here is the scenario:

1. I've created a simple table consisting of photo specific info (photoid, gender, category, etc...)
2. I've artificially created over 1 million records.
3. I need to be able to fetch a random record based on gender and/or category

What is the best, most efficient way to make that happen??? Any ideas on solving this dilemna would be extremely appreciated.

thanks

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2003-07-10 : 19:05:43
Have you tried this out:

SELECT TOP 1 *
FROM Table1
WHERE Gender = 'M'
ORDER BY NEWID()

SELECT TOP 1 *
FROM Table1
WHERE Category = 'Sports'
ORDER BY NEWID()

Tara
Go to Top of Page

wgpubs
Yak Posting Veteran

67 Posts

Posted - 2003-07-10 : 19:13:12
yes I have ... while this technique works great for a relatively small table, it becomes unbareably slow as the size of the table increases.

Any other ideas?

thanks

quote:

Have you tried this out:

SELECT TOP 1 *
FROM Table1
WHERE Gender = 'M'
ORDER BY NEWID()

SELECT TOP 1 *
FROM Table1
WHERE Category = 'Sports'
ORDER BY NEWID()

Tara



Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2003-07-10 : 19:19:40
How about setting the rowcount value so that you get a sample of data then get a random record from that sample:

SET ROWCOUNT 10000

SELECT *
FROM Table1
WHERE Category = 'Sports'
ORDER BY NEWID()

SET ROWCOUNT 0


Change the ROWCOUNT value from 10000 to whatever size you want the sample to be. You should change rowcount back to 0 like in my example so that it doesn't affect other queries.

Tara
Go to Top of Page

wgpubs
Yak Posting Veteran

67 Posts

Posted - 2003-07-10 : 19:33:55
actually the performance is worse using this code :(


quote:

How about setting the rowcount value so that you get a sample of data then get a random record from that sample:

SET ROWCOUNT 10000

SELECT *
FROM Table1
WHERE Category = 'Sports'
ORDER BY NEWID()

SET ROWCOUNT 0


Change the ROWCOUNT value from 10000 to whatever size you want the sample to be. You should change rowcount back to 0 like in my example so that it doesn't affect other queries.

Tara



Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2003-07-10 : 19:45:15
Well the performance should have been better. Do you have an index on the column in your WHERE clause?

Anyway, have you tried this:
[url]http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=173[/url]

If you have tried it, then let us know what you have tried so we don't keep asking if you have tried something.

Tara
Go to Top of Page

wgpubs
Yak Posting Veteran

67 Posts

Posted - 2003-07-10 : 21:12:39
unfortunately the technique described in the article requires a contiguous set of id values ... which would not work in my case since the where clause does not return a contigous set of id values.

In terms of what I have tried ... basically I've gone thru and tried just about every technique described in the various articles on random selection in this site. The one that works the best so far is described on http://www.sqlteam.com/item.asp?ItemID=576. The query essentially looks like this:

SELECT ID, yada
FROM Foo
WHERE ID = (SELECT CAST((rand() * @RowCount) as int) + 1)

The problem: this too requires a contiguous set of IDs. If I can find a way to use almost the identical query WITHOUT having to rely on a sequential set of IDs .... that seems to me to be ideal.

Any other ideas? Appreciate the help ... thanks!

quote:

Well the performance should have been better. Do you have an index on the column in your WHERE clause?

Anyway, have you tried this:
[url]http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=173[/url]

If you have tried it, then let us know what you have tried so we don't keep asking if you have tried something.

Tara



Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-07-10 : 21:37:22
you COULD say something like this:

-- First, get your min and max range:
select @minRange = min(ID), @maxRange = Max(ID)
FROM Foo

-- next, get the first ID LESS than a random number in that range:
SELECT @RandomID = Max(ID)
FROM Foo
WHERE ID <= (SELECT CAST((rand() * (@MaxRange - @MinRange) + @MinRange as int) + 1)

-- now, return it:
SELECT *
FROM Foo
WHERE ID = @RandomID


HOWEVER ... this will not be perfectly random. For example, if you have this data:


2
3
6
7
8

Then MinRange will be 2, Max(Range) will be 8, and you will be generating a number from 2 to 8.

But the distribution goes like this:

Random Number Generated -> ID Returned
2 -> 2
3 -> 3
4 -> 3
5 -> 3
6 -> 6
7 -> 7
8 -> 8

Note that the ID of 3 has a better chance of being returned than the numbers 2,6,7 or 8.

Somewhere near, the answer lies .... Arnold, where are you ???


By the way -- i can totally see why the NEWID() thing will be slow. By definition, SQL Server will need to process EVERY SINGLE ROW in the table that it is considering, generate this NEWID() field, and then return the highest. No indexing will help in that particular case, hopefully that makes sense.

- Jeff
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-07-10 : 21:57:22
quote:
No indexing will help in that particular case, hopefully that makes sense.
Perhaps it might:

SELECT * FROM Table1 WHERE photoid=(SELECT TOP 1 photoid FROM Table1 ORDER BY NewID())

In that case, the subquery can use the index on photoid and scan it instead of hitting the table. It will return a single ID value which the outer query can use to seek the table. If photoid is a clustered index then it's less likely to help. This also doesn't take into consideration the WHERE clause you need, but the same concept applies: if the WHERE clause can use indexes and you SELECT only the key column in the subquery, you can minimize the amount of data that needs to be read in order to return the results you want.

Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-07-10 : 22:20:24
Only one way to find out!

run this in the northwind database and look at the plans:

-- first, ordering by an indexed field:

select top 1 OrderID from orders order by orderID Desc
select top 1 * from orders order by OrderID Desc

-- next, ordering by newid()
select top 1 OrderID from orders order by NewID() Desc
select top 1 * from orders order by NewID()

-- next, ordering by a non-indexed field:
select top 1 OrderID from orders order by Freight Desc
select top 1 * from orders order by Freight Desc

-- finally, using the technique Rob suggests:
select * from orders where orderID = (select top 1 orderID from orders order by NewID())


It does seem faster than SELECT TOP * .... not bad! I didn't think it would make a difference. Still not an ideal solution, but it does improve performance a little.

Edited by - jsmith8858 on 07/10/2003 22:31:45
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-07-11 : 00:43:07
quote:
Still not an ideal solution, but it does improve performance a little.
Maybe not ideal, but I think it can be as close as possible with the table design. I dug into something that may reveal a key improvement: the row counts and row sizes for the two query versions:

select * from orders where orderid=(select top 1 orderid from orders order by newid())

select top 1 * from orders order by newid()


The first version has to process a total of 1663 rows, the second 1661. BUT...the row sizes of the first one are MUCH smaller than the rows processed by the second. Bigger rows mean more bytes/pages read and processed. By my calculations, the first query only has to read 50,342 bytes, the second reads 410,267 bytes; an eight-fold increase. Considering that this is from a source table of only 830 rows, you can see how much of a greater difference there would be on a 1 million+ row table.

Of course, as Jeff said, the only way to find out is to test it. Definitely try testing as many index combinations as you can and measure the results repeatedly, THEN compare them. I think you can find one that will give you much better performance than the method you're using now. A covering index on category and gender would probably be best, and perhaps another index on gender only.

Go to Top of Page
   

- Advertisement -