Update with random values - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Update with random values

Author

Topic

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-06 : 10:30:35

Hi there,

I want to update table B with values randomly chosen from table A, for example


create table A(
	id int primary key,
	value int
)

create table B(
	id int primary key,
	value int
)

insert into A
select 1, 4 union all
select 2, 4 union all
select 3, 4 union all
select 4, 4 union all
select 5, 2 union all
select 6, 2

insert into B
select 1, NULL union all
select 2, NULL union all
select 3, NULL union all
select 4, NULL union all
select 5, NULL union all
select 6, NULL

update B
set value = (
	select top 1 value
	from A
	where B.id > 0
	order by newid()
)
select * from B

The where clause is necessary because an uncorrelated subquery is evaluated only once. My solution is not only ugly, it also performs badly (my real tables have like 500000 rows). Two questions:

(1) The execution plan shows a sort. Does SQL Server really sort the whole table just to find the minimum (top 1)?

(2) Is there a better or more elegant way to update table B the way I want to do it?

Thanks,
Bjoern

(54224 row(s) affected)

The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2008-08-06 : 10:42:49

There is no correlation between A and B so it is not doing whatever you think it is (i.e. choosing from A). It's not clear what you want.

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-06 : 10:59:51

Try it with and without where-clause, that's why I posted the code.

With where-clause, it does exactly what I want, namely populate B with random values, but with a similar distribution as in table A.

The question is why SQL Server does a sort to find top 1 value, i.e. the minimum, and how can I avoid that to improve performance.

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-08-06 : 13:03:47

quote:
Originally posted by bjoerns

Try it with and without where-clause, that's why I posted the code.

With where-clause, it does exactly what I want, namely populate B with random values, but with a similar distribution as in table A.

The question is why SQL Server does a sort to find top 1 value, i.e. the minimum, and how can I avoid that to improve performance.

you've an order by specified. so sql server will order the results out for you in the inner query. i didnt understand that purpose of where clause though. what difference will it make?

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2008-08-06 : 13:16:00

quote:
Originally posted by bjoerns

Try it with and without where-clause, that's why I posted the code.

With where-clause, it does exactly what I want, namely populate B with random values, but with a similar distribution as in table A.

The question is why SQL Server does a sort to find top 1 value, i.e. the minimum, and how can I avoid that to improve performance.

Because it is assigning a NEWID to every row then ordering the whole thing then taking the TOP rowid.

One thing you might try is SELECTing the DISTINCT values into a table variable with a CLUSTERED INDEX on the value. Then you could use that sub set to order by NEWID instead of the intire source table.

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-06 : 13:21:46

Hi Visakh,

it's TOP 1 not TOP 1000 so I expected the analyzer to be smart enough to scan the table for the minimum value instead of sorting the whole thing.

The difference is: without the where clause the query is evaluated once and the whole table B is assigned the same value. With the where clause, it's evaluated for each row of table B, leading to different values.

Bjoern

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2008-08-06 : 13:25:19

quote:
Originally posted by bjoerns

Hi Visakh,

it's TOP 1 not TOP 1000 so I expected the analyzer to be smart enough to scan the table for the minimum value instead of sorting the whole thing.

The difference is: without the where clause the query is evaluated once and the whole table B is assigned the same value. With the where clause, it's evaluated for each row of table B, leading to different values.

Bjoern

be it 1 or 1000 it has to first sort based on id generated by newid() to determine the top record(s).
Ok. understood the purpose. thanks.

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-06 : 13:30:58

quote:
Because it is assigning a NEWID to every row then ordering the whole thing then taking the TOP rowid.

I know, I just wonder why. Since it's TOP 1 it could simply scan for the minimum value.

quote:
One thing you might try is SELECTing the DISTINCT values into a table

The distinct would lead to evenly distributed values in table B. I want to keep the distribution from table A, in the example 2/3 value 4 and 1/3 value 2.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-06 : 13:52:18

With "random" do you mean that you can get 6 4's?
Or should all numbers "shuffle" and be used only once?
Do the A and B tables hold exactly the same primary key numbers?

E 12°55'05.25"
N 56°04'39.16"

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-07 : 04:30:34

No shuffle and 6 4's is okay, yes. The real tables are large.
And no, primary key numbers can be disjoint even.

Thanks,
Bjoern

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 05:14:09

Here is a "shuffle" method. Feel free to modifiy it to deal with your needs.

DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)

INSERT	@a
SELECT	1, 4 UNION ALL
SELECT	2, 4 UNION ALL
SELECT	3, 4 UNION ALL
SELECT	4, 4 UNION ALL
SELECT	5, 2 UNION ALL
SELECT	6, 2

DECLARE	@b TABLE (ID INT PRIMARY KEY, Value INT)

INSERT	@b
SELECT	1, NULL UNION ALL
SELECT	2, NULL UNION ALL
SELECT	3, NULL UNION ALL
SELECT	4, NULL UNION ALL
SELECT	5, NULL UNION ALL
SELECT	6, NULL

UPDATE		t
SET		t.Value = a.Value
FROM		@b AS t
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID
			FROM	@b
		) AS x ON x.ID = t.ID
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID,
				Value
			FROM	@a
		) AS a ON a.JoinID = x.JoinID

SELECT	*
FROM	@b

E 12°55'05.25"
N 56°04'39.16"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 06:08:05

[code]-- Prepare sample data
DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)

INSERT @a
SELECT 1, 4 UNION ALL
SELECT 2, 4 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 4, 4 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 6, 2

DECLARE @b TABLE (ID INT PRIMARY KEY, Value INT)

INSERT @b
SELECT 1, NULL UNION ALL
SELECT 2, NULL UNION ALL
SELECT 3, NULL UNION ALL
SELECT 4, NULL UNION ALL
SELECT 5, NULL UNION ALL
SELECT 6, NULL

-- Prepare staging
DECLARE @Stage TABLE (Value INT, Items INT, Running INT)

INSERT @Stage
SELECT Value,
COUNT(*),
0
FROM @a
GROUP BY Value

-- Do a running total
DECLARE @Running INT

UPDATE @Stage
SET @Running = Running = Items + COALESCE(@Running, -1)

-- Do the actual UPDATE
UPDATE b
SET b.Value = s.Value
FROM @b AS b
INNER JOIN @Stage AS s ON s.Running >= ABS(CHECKSUM(NEWID())) % (@Running + 1)

-- Show the result
SELECT *
FROM @b[/code]

E 12°55'05.25"
N 56°04'39.16"

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-07 : 07:45:55

Thanks a lot, Peso! The second solution does exactly what I want, and I learned a lot.

As for the shuffle, it is sufficient to randomise once:

UPDATE		t
SET		t.Value = a.Value
FROM		@b AS t
/*
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID
			FROM	@b
		) AS x ON x.joinID = t.ID
*/
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID,
				Value
			FROM	@a
		) AS a ON a.JoinID = t.ID

But if I run your code, I don't get a shuffle, which boggles me. Aren't two shuffles combined supposed to result in another shuffle?

And the still remaining question is... why doesn't the optimiser simplify a TOP 1 to a scan for the minimum?

Bjoern

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 09:03:42

quote:
Originally posted by bjoerns

And the still remaining question is... why doesn't the optimiser simplify a TOP 1 to a scan for the minimum?

Beacuse of the ORDER BY clause.

E 12°55'05.25"
N 56°04'39.16"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 09:05:18

quote:
Originally posted by bjoerns

As for the shuffle, it is sufficient to randomise once

I don't think so, because neither @a table nor @b table need to have their ID's in sequential order.
What I do is create a sequential order for each set of data, regardless of sequence in @a or @b.

DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)

INSERT	@a
SELECT	101, 4 UNION ALL
SELECT	202, 4 UNION ALL
SELECT	303, 4 UNION ALL
SELECT	404, 4 UNION ALL
SELECT	505, 2 UNION ALL
SELECT	606, 2

DECLARE	@b TABLE (ID INT PRIMARY KEY, Value INT)

INSERT	@b
SELECT	1001, NULL UNION ALL
SELECT	2002, NULL UNION ALL
SELECT	3003, NULL UNION ALL
SELECT	4004, NULL UNION ALL
SELECT	5005, NULL UNION ALL
SELECT	6006, NULL

UPDATE		t
SET		t.Value = a.Value
FROM		@b AS t
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID
			FROM	@b
		) AS x ON x.ID = t.ID
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID,
				Value
			FROM	@a
		) AS a ON a.JoinID = x.JoinID

SELECT	*
FROM	@b

E 12°55'05.25"
N 56°04'39.16"

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-07 : 10:09:22

But when I run your code I still get results with five or six 4s... that scares me since it's not what I expect :-) I expect an one-to-one matching.

Bjoern

(54224 row(s) affected)

The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 10:21:12

It's because there is a distribution built-in.
Since there are only 2 2's the likelyness to get a 2 is 1/3 (33.33%) because there are also 4 4's and the likelyness to get a 4 is 66.67%. That is what the running total is for.

It's very easy to rewrite so you get a 50/50 chance instead.

E 12°55'05.25"
N 56°04'39.16"

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-07 : 10:28:52

Sorry if I don't get it but you assign both tables a unique joinID between 1 and 6, join them 1:1 so I expect the result to have always four 4s and two 2s?!

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 10:31:25

Oh, yo meant the SHUFFLE routine?

E 12°55'05.25"
N 56°04'39.16"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-08-07 : 10:37:32

Here is a work-around with a staging table for shuffle routine.

DECLARE @a TABLE (ID INT PRIMARY KEY, Value INT)

INSERT	@a
SELECT	101, 4 UNION ALL
SELECT	202, 4 UNION ALL
SELECT	303, 4 UNION ALL
SELECT	404, 4 UNION ALL
SELECT	505, 2 UNION ALL
SELECT	606, 2

DECLARE	@b TABLE (ID INT PRIMARY KEY, Value INT)

INSERT	@b
SELECT	1001, NULL UNION ALL
SELECT	2002, NULL UNION ALL
SELECT	3003, NULL UNION ALL
SELECT	4004, NULL UNION ALL
SELECT	5005, NULL UNION ALL
SELECT	6006, NULL

DECLARE	@Stage TABLE (ID INT, Value INT)

INSERT		@Stage
SELECT		t.ID,
		a.Value
FROM		@b AS t
INNER JOIN	(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID
			FROM	@b
		) AS x ON x.ID = t.ID
INNER JOIN	(
			SELECT	ROW_NUMBER() OVER (ORDER BY NEWID()) AS JoinID,
				Value
			FROM	@a
		) AS a ON a.JoinID = x.JoinID

UPDATE		b
SET		b.Value = s.Value
FROM		@b AS b
INNER JOIN	@Stage AS s ON s.ID = b.ID

SELECT	*
FROM	@b

E 12°55'05.25"
N 56°04'39.16"

bjoerns
Posting Yak Master

154 Posts

Posted - 2008-08-07 : 11:05:08

Yes, I'm referring to your post from 09:05:18.


  b.ID   x.joinID   a.joinID    a.ID
  6006          1          1     303
  2002          2          2     101
  1001          3          3     505
  4004          4          4     606
  3003          5          5     202
  5005          6          6     404

This implies a 1:1 relation between A and B so there should be always four 4s in table B. But there aren't?!

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources