working with Shuffling of records in sqlserver

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

working with Shuffling of records in sqlserver

Author

Topic

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-02 : 08:45:48

Hi all

I have a table by name table1 it has following columns
FirstName LastName Amount
Atmuri Neelima 1000
Munagala Sreenivas 2000
Koppuravuri Satya 3000
Maradugu Jahnavi 4000
Kadem Rajesh 5000
Jonnala Omkar 6000
Mallela Subramanyam 7000

My Requirement is I wanto shuffle the records which are in FirstName(column)
Following is the scenario which I have followed
1.I have taken a temp table and inserted the shuffled records into it
SELECT t2.FirstName into #temptable1 FROM
(SELECT FirstName,ROW_NUMBER() OVER (ORDER BY NEWID()) AS recid FROM table1) AS t1 INNER JOIN (SELECT FirstName,ROW_NUMBER() OVER (ORDER BY NEWID()) AS recid FROM table1) AS t2 ON t2.recID = t1.recID
The above query generates the FirstNames in a shuffled manner.

2.Now I wanto update the records into Table1 exactly as it got shuffled in the above query. To achieve the same I have written the below query

UPDATE Table1 SET FirstName = t2.FirstName FROM Table1 t1 INNER JOIN # temptable1 t2 ON t1. FirstName = t2. FirstName

But this is not getting worked out.
Finally my goal is to shuffle the records which are in FirstName(Column).

Note : The above table consists of 75 lakhs of records in it. Please suggest me a best approach
Kindly Help

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 09:04:08

If you mean you want to control the physical order of the records, you have to build a clustered index. That index might be the shuffling order you're talking about. Note that, without a clustered index, the table is stored as a hash table on disk; the physical order is then determined by SQL according to its best judgement. You cannot influence or alter the physical order of a hash table in any way.

Abstractly, SQL works with sets. Mathematically, the elements of a set are not ordered in any way.

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-02 : 09:36:15

i dont wanto keep the actual data . i wanto modify the FirstName(column) where the actual data is modified as shuffled. I wanto modify only the FirstName(column) but not the entire record. Please suggest me the answer in such a way where i wanto shuffle only the FirstName(Column) randomly.

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 10:25:09

This seemed to do it for me:


with t1 as (
	select FirstName, LastName, Amount,
		rn = row_number() over(order by newid())
	from #t
	),

	t2 as (
	select FirstName, LastName, Amount,
		rn = row_number() over(order by newid())
	from #t
	)
    

select t2.FirstName,  t1.LastName, t1.Amount from t2, t1
where t1.rn = t2.rn
order by t1.lastname asc

the trick is the ORDER BY NEWID() clause, which causes a random ordering for the computed column "rn"

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2014-05-02 : 11:18:18

I think the OP knows the random trick as they used it in the original post. Also, I think the question is how to update the real table with these randomized results.
this should work:


update t1 set
       firstName = (select top 1 firstname 
                     from table1 
                     where firstname != t1.firstname 
                     order by newid())
from   table1 t1

Be One with the Optimizer
TG

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 11:35:15

quote:
Originally posted by TG

I think the OP knows the random trick as they used it in the original post. Also, I think the question is how to update the real table with these randomized results.
this should work:

update t1 set
firstName = (select top 1 firstname
from table1
where firstname != t1.firstname
order by newid())
from table1 t1

Be One with the Optimizer
TG

Well...not quite. You can get duplicate firstnames in the result with that. e.g. when I first tried it:


select firstName = (select top 1 firstname 
                     from #t  
                     where firstname != t1.firstname 
                     order by newid())
from   #t t1

I got:

quote:

firstName
Jonnala
Mallela
Munagala
Munagala
Atmuri
Mallela
Jonnala

with repeated names.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2014-05-02 : 12:05:59

that will happen if you want to randomize it - plus with "75 lakhs" rows i'm sure there are folks with the same first name anyway

. How much is a lakh anyway?

Be One with the Optimizer
TG

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 12:17:03

Remember, the OP wants to shuffle, not randomize!

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2014-05-02 : 12:24:18

quote:
Originally posted by gbritton

If you mean you want to control the physical order of the records, you have to build a clustered index.

Sorry to get pedantic on you, but a clustered index doesn't control the physical order, rather, it controls the logical order.

That may be a distinction that isn't irrelevant in the case, but there is a difference. :)

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 12:34:51

well...as I understand it the rows in each cluster should be physically ordered. the clusters, not so much. so you can "control" the physical order (as opposed to hash table), but not completely determine it.

Microsoft emphasizes the stored order:

The only time the data rows in a table are stored in sorted order is when the table contains a clustered index

http://technet.microsoft.com/en-us/library/ms190457.aspx

but that may not result in a total physical ordering from row 1 to row n because of a variety of factors, including filegroups, RAID levels, fragmentation etc.

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2014-05-02 : 12:40:22

Rows may not even be physically ordered on a page. yes the pages are linked in "physical" order. But you are right that given sql just hands off the disk request to the OS, there really is no telling where, physically, the data lives.

Here is a quote I ripped from Gail Shaw. I don't have a link to the original post (as it seems to have died), but maybe this will help explain:

quote:
Clustered index determines the logical order of the data, not the physical. Rows can be stored physically on a page in an order differing from the clustered index order and, if there is any fragmentation, the physical order and logical order of the leaf pages will not match (the definition of fragmentation).

Let's take an example of a clustered index on an int column (not an identity) and let's say there are 100 rows in the index at 10 rows per page and those int values are contiguous

The index will have rows 1..10 on page 1, 11..20 on page 2 and so on with 91..100 on page 10. That the clustered index guarantees. Within the physical page however, there's no guarantee that the rows will physically be written on the page in that order. Let's say, for some odd reason, first all the odd numbered rows were inserted and then all the even. If you went and read the page physically, you'd see this:

Page Header
Row 1
Row 3
Row 5
Row 7
Row 9
Row 2
Row 4
Row 6
Row 8
Slot index
Page footer

Because, when SQL goes to insert row 2, it doesn't want the cost of shifting all the rows down to make space, same as with page splits it doesn't want have to shift pages around to maintain order.
The slot index, at the bottom of the page, maintains pointers into the page that are in the order of the clustering key, so that the correct rows can be retrieved. So logically, the slot index maintains the 'order' of rows but physically the rows may not be stored in order.

You could argue that I'm been pedantic about it, but the prevalent myth that clustered index order = physical order on disk gives rise to lots of misunderstandings about indexes and ordering.

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 12:47:08

I believe Gail may be wrong about this. Other commentators agree with my position. Of course the only way to be sure would be to provide a sector dump as a counterexample. I wish I had the tools to do that at my fingertips!

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-02 : 13:08:08

quote:
Originally posted by gbritton

I believe Gail may be wrong about this. Other commentators agree with my position. Of course the only way to be sure would be to provide a sector dump as a counterexample. I wish I had the tools to do that at my fingertips!

Here's a great detailed description from the source:

http://technet.microsoft.com/en-us/library/ms177443(v=sql.105).aspx

Check out this section:

"The pages in the data chain and the rows in them are ordered on the value of the clustered index key. "

Though the sentence omits the word "physically", in context that is the only possible meaning.

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2014-05-02 : 13:53:12

You are certainly free to believe what you like. :)

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-03 : 06:26:16

I have used select firstName = (select top 1 firstname
from #t
where firstname != t1.firstname
order by newid())
from #t t1

but its taking lot of time . Currently it has been 3 hours and still the update is running. Kindly suggest me with an alternative. However can i use any kind of batch wise updates ie save points are some thing to go ahead

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-05 : 03:19:48

can any body help me in improving the performance of the udpate

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-05 : 03:30:01

I have used the below query as well, Its still running

update t1 set
firstName = (select top 1 firstname
from table1
where firstname != t1.firstname
order by newid())
from table1 t1

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-05 : 08:26:30

quote:
Originally posted by sreeni12345

I have used the below query as well, Its still running

update t1 set
firstName = (select top 1 firstname
from table1
where firstname != t1.firstname
order by newid())
from table1 t1

This is basically an O(n**2) algorithm. So if your table is large enough, yes it will run a while!

Did you try my earlier suggestion? I framed it as a query, but wrapping it in an update should be pretty simple.


with t1 as (
	select FirstName, LastName, Amount,
		rn = row_number() over(order by newid())
	from #t
	),

	t2 as (
	select FirstName, LastName, Amount,
		rn = row_number() over(order by newid())
	from #t
	)
    

select t2.FirstName,  t1.LastName, t1.Amount from t2, t1
where t1.rn = t2.rn
order by t1.lastname asc

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-05 : 08:33:49

yaa i have tried it but can u let me know how to wrap it in an update , so sorry i am very much new to sql

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-05-05 : 09:20:14

This worked for me:


;with t1 as (
	select FirstName, LastName, Amount,
		rn = row_number() over(order by newid())
	from #t
	),

	t2 as (
	select FirstName, LastName, Amount,
		rn = row_number() over(order by newid())
	from #t
	)

update t1
set t1.Firstname = t2.Firstname
from t1, t2
where t1.rn = t2.rn

select * from #t

sreeni12345
Starting Member

14 Posts

Posted - 2014-05-05 : 10:51:45

Actually i have worked out the query given by you . But its updating the same value in all the records. However i have used the below query and currently it worked out

UPDATE table1
SET table.FirstName = #temptable1.FirstName
FROM table
INNER JOIN
#temptable1
ON table.tablekey = #temptable1.tablekey
WHERE
table.tablekey = #temptable1.tablekey or
(#temptable1.tablekey is not null and #temptable1.tablekey is null) or
(#temptable1.tablekey is not null and #temptable1.tablekey is null)

Correct me if i am wrong .

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources