| Author |
Topic |
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-06 : 10:30:35
|
Hi there,I want to update table B with values randomly chosen from table A, for examplecreate table A( id int primary key, value int)create table B( id int primary key, value int)insert into Aselect 1, 4 union allselect 2, 4 union allselect 3, 4 union allselect 4, 4 union allselect 5, 2 union allselect 6, 2insert into Bselect 1, NULL union allselect 2, NULL union allselect 3, NULL union allselect 4, NULL union allselect 5, NULL union allselect 6, NULLupdate Bset value = ( select top 1 value from A where B.id > 0 order by newid())select * from B The where clause is necessary because an uncorrelated subquery is evaluated only once. My solution is not only ugly, it also performs badly (my real tables have like 500000 rows). Two questions:(1) The execution plan shows a sort. Does SQL Server really sort the whole table just to find the minimum (top 1)?(2) Is there a better or more elegant way to update table B the way I want to do it?Thanks,Bjoern(54224 row(s) affected)The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION. |
|
|
LoztInSpace
Aged Yak Warrior
940 Posts |
Posted - 2008-08-06 : 10:42:49
|
| There is no correlation between A and B so it is not doing whatever you think it is (i.e. choosing from A). It's not clear what you want. |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-06 : 10:59:51
|
| Try it with and without where-clause, that's why I posted the code.With where-clause, it does exactly what I want, namely populate B with random values, but with a similar distribution as in table A.The question is why SQL Server does a sort to find top 1 value, i.e. the minimum, and how can I avoid that to improve performance. |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-08-06 : 13:03:47
|
quote: Originally posted by bjoerns Try it with and without where-clause, that's why I posted the code.With where-clause, it does exactly what I want, namely populate B with random values, but with a similar distribution as in table A.The question is why SQL Server does a sort to find top 1 value, i.e. the minimum, and how can I avoid that to improve performance.
you've an order by specified. so sql server will order the results out for you in the inner query. i didnt understand that purpose of where clause though. what difference will it make? |
 |
|
|
Lamprey
Master Smack Fu Yak Hacker
4614 Posts |
Posted - 2008-08-06 : 13:16:00
|
quote: Originally posted by bjoerns Try it with and without where-clause, that's why I posted the code.With where-clause, it does exactly what I want, namely populate B with random values, but with a similar distribution as in table A.The question is why SQL Server does a sort to find top 1 value, i.e. the minimum, and how can I avoid that to improve performance.
Because it is assigning a NEWID to every row then ordering the whole thing then taking the TOP rowid.One thing you might try is SELECTing the DISTINCT values into a table variable with a CLUSTERED INDEX on the value. Then you could use that sub set to order by NEWID instead of the intire source table. |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-06 : 13:21:46
|
| Hi Visakh,it's TOP 1 not TOP 1000 so I expected the analyzer to be smart enough to scan the table for the minimum value instead of sorting the whole thing.The difference is: without the where clause the query is evaluated once and the whole table B is assigned the same value. With the where clause, it's evaluated for each row of table B, leading to different values. Bjoern |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2008-08-06 : 13:25:19
|
quote: Originally posted by bjoerns Hi Visakh,it's TOP 1 not TOP 1000 so I expected the analyzer to be smart enough to scan the table for the minimum value instead of sorting the whole thing.The difference is: without the where clause the query is evaluated once and the whole table B is assigned the same value. With the where clause, it's evaluated for each row of table B, leading to different values. Bjoern
be it 1 or 1000 it has to first sort based on id generated by newid() to determine the top record(s).Ok. understood the purpose. thanks. |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-06 : 13:30:58
|
quote: Because it is assigning a NEWID to every row then ordering the whole thing then taking the TOP rowid.
I know, I just wonder why. Since it's TOP 1 it could simply scan for the minimum value. quote: One thing you might try is SELECTing the DISTINCT values into a table
The distinct would lead to evenly distributed values in table B. I want to keep the distribution from table A, in the example 2/3 value 4 and 1/3 value 2. |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-06 : 13:52:18
|
With "random" do you mean that you can get 6 4's?Or should all numbers "shuffle" and be used only once?Do the A and B tables hold exactly the same primary key numbers? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-07 : 04:30:34
|
| No shuffle and 6 4's is okay, yes. The real tables are large.And no, primary key numbers can be disjoint even.Thanks,Bjoern |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 05:14:09
|
Here is a "shuffle" method. Feel free to modifiy it to deal with your needs.DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)INSERT @aSELECT 1, 4 UNION ALLSELECT 2, 4 UNION ALLSELECT 3, 4 UNION ALLSELECT 4, 4 UNION ALLSELECT 5, 2 UNION ALLSELECT 6, 2DECLARE @b TABLE (ID INT PRIMARY KEY, Value INT)INSERT @bSELECT 1, NULL UNION ALLSELECT 2, NULL UNION ALLSELECT 3, NULL UNION ALLSELECT 4, NULL UNION ALLSELECT 5, NULL UNION ALLSELECT 6, NULLUPDATE tSET t.Value = a.ValueFROM @b AS tINNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID FROM @b ) AS x ON x.ID = t.IDINNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID, Value FROM @a ) AS a ON a.JoinID = x.JoinIDSELECT *FROM @b E 12°55'05.25"N 56°04'39.16" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 06:08:05
|
[code]-- Prepare sample dataDECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)INSERT @aSELECT 1, 4 UNION ALLSELECT 2, 4 UNION ALLSELECT 3, 4 UNION ALLSELECT 4, 4 UNION ALLSELECT 5, 2 UNION ALLSELECT 6, 2DECLARE @b TABLE (ID INT PRIMARY KEY, Value INT)INSERT @bSELECT 1, NULL UNION ALLSELECT 2, NULL UNION ALLSELECT 3, NULL UNION ALLSELECT 4, NULL UNION ALLSELECT 5, NULL UNION ALLSELECT 6, NULL-- Prepare stagingDECLARE @Stage TABLE (Value INT, Items INT, Running INT)INSERT @StageSELECT Value, COUNT(*), 0FROM @aGROUP BY Value-- Do a running totalDECLARE @Running INTUPDATE @StageSET @Running = Running = Items + COALESCE(@Running, -1)-- Do the actual UPDATEUPDATE bSET b.Value = s.ValueFROM @b AS bINNER JOIN @Stage AS s ON s.Running >= ABS(CHECKSUM(NEWID())) % (@Running + 1)-- Show the resultSELECT *FROM @b[/code] E 12°55'05.25"N 56°04'39.16" |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-07 : 07:45:55
|
Thanks a lot, Peso! The second solution does exactly what I want, and I learned a lot.As for the shuffle, it is sufficient to randomise once:UPDATE tSET t.Value = a.ValueFROM @b AS t/*INNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID FROM @b ) AS x ON x.joinID = t.ID*/INNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID, Value FROM @a ) AS a ON a.JoinID = t.ID But if I run your code, I don't get a shuffle, which boggles me. Aren't two shuffles combined supposed to result in another shuffle?And the still remaining question is... why doesn't the optimiser simplify a TOP 1 to a scan for the minimum?Bjoern |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 09:03:42
|
quote: Originally posted by bjoerns And the still remaining question is... why doesn't the optimiser simplify a TOP 1 to a scan for the minimum?
Beacuse of the ORDER BY clause. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 09:05:18
|
quote: Originally posted by bjoerns As for the shuffle, it is sufficient to randomise once
I don't think so, because neither @a table nor @b table need to have their ID's in sequential order.What I do is create a sequential order for each set of data, regardless of sequence in @a or @b.DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)INSERT @aSELECT 101, 4 UNION ALLSELECT 202, 4 UNION ALLSELECT 303, 4 UNION ALLSELECT 404, 4 UNION ALLSELECT 505, 2 UNION ALLSELECT 606, 2DECLARE @b TABLE (ID INT PRIMARY KEY, Value INT)INSERT @bSELECT 1001, NULL UNION ALLSELECT 2002, NULL UNION ALLSELECT 3003, NULL UNION ALLSELECT 4004, NULL UNION ALLSELECT 5005, NULL UNION ALLSELECT 6006, NULLUPDATE tSET t.Value = a.ValueFROM @b AS tINNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID FROM @b ) AS x ON x.ID = t.IDINNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID, Value FROM @a ) AS a ON a.JoinID = x.JoinIDSELECT *FROM @b E 12°55'05.25"N 56°04'39.16" |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-07 : 10:09:22
|
| But when I run your code I still get results with five or six 4s... that scares me since it's not what I expect :-) I expect an one-to-one matching.Bjoern(54224 row(s) affected)The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION. |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 10:21:12
|
It's because there is a distribution built-in.Since there are only 2 2's the likelyness to get a 2 is 1/3 (33.33%) because there are also 4 4's and the likelyness to get a 4 is 66.67%. That is what the running total is for.It's very easy to rewrite so you get a 50/50 chance instead. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-07 : 10:28:52
|
| Sorry if I don't get it but you assign both tables a unique joinID between 1 and 6, join them 1:1 so I expect the result to have always four 4s and two 2s?! |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 10:31:25
|
Oh, yo meant the SHUFFLE routine? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-08-07 : 10:37:32
|
Here is a work-around with a staging table for shuffle routine.DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)INSERT @aSELECT 101, 4 UNION ALLSELECT 202, 4 UNION ALLSELECT 303, 4 UNION ALLSELECT 404, 4 UNION ALLSELECT 505, 2 UNION ALLSELECT 606, 2DECLARE @b TABLE (ID INT PRIMARY KEY, Value INT)INSERT @bSELECT 1001, NULL UNION ALLSELECT 2002, NULL UNION ALLSELECT 3003, NULL UNION ALLSELECT 4004, NULL UNION ALLSELECT 5005, NULL UNION ALLSELECT 6006, NULLDECLARE @Stage TABLE (ID INT, Value INT)INSERT @StageSELECT t.ID, a.ValueFROM @b AS tINNER JOIN ( SELECT ID, ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID FROM @b ) AS x ON x.ID = t.IDINNER JOIN ( SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID, Value FROM @a ) AS a ON a.JoinID = x.JoinIDUPDATE bSET b.Value = s.ValueFROM @b AS bINNER JOIN @Stage AS s ON s.ID = b.IDSELECT *FROM @b E 12°55'05.25"N 56°04'39.16" |
 |
|
|
bjoerns
Posting Yak Master
154 Posts |
Posted - 2008-08-07 : 11:05:08
|
Yes, I'm referring to your post from 09:05:18. b.ID x.joinID a.joinID a.ID 6006 1 1 303 2002 2 2 101 1001 3 3 505 4004 4 4 606 3003 5 5 202 5005 6 6 404 This implies a 1:1 relation between A and B so there should be always four 4s in table B. But there aren't?! |
 |
|
|
Next Page
|