SELECT...INTO row order vs IDENTITY

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

SELECT...INTO row order vs IDENTITY

Author

Topic

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2002-04-17 : 06:31:18

I'm not convinced this is a bug, but it sure looks strange to me. I'd be interested to know what other people think.

A commonly used way of ranking rows in large tables is to use a temporary table with an identity column -- using self-joins gets very slow because SQL Server (currently) generates a query plan with a quadratic run time.
So I was a bit taken aback when using SELECT...INTO to find that the identity values are not always in the same order as the ORDER BY clause (using SQL Server 2000 sp 2). On further reflection, it doesn't seem to be a huge problem, but if it's a bug, it's difficult to predict when it will strike!

Examples here are difficult because it depends (I think) on the extent order of the index being used. This example may or may not work -- I've tested it on a couple of machines and it exhibited the symptoms on both the SQL Server 2000 sp 2 installations but not on the server with SQL Server 7 sp2. The table generated for testing is not small: 100000 rows.


DROP TABLE #ForRanking

CREATE TABLE #ForRanking (
  id int PRIMARY KEY,
  idcopy int NOT NULL
)

SET NOCOUNT ON

DECLARE @i int, @r int

SET @i = 0
SET @r = 0
WHILE @i < 100000
BEGIN
  INSERT INTO #ForRanking (id, idcopy) VALUES (@r, @r)
  SET @i = @i + 1
  SET @r = @r - FLOOR(RAND() * 10)-1
END

SET NOCOUNT OFF

Here we're generating the table with the values that need to be ranked. The primary keys are generated starting at 0 and going down by a random number between 1 and 10 -- we're inserting them into the clustered index in the 'wrong' order. The idcopy column is just to see what happens with the same table using an unindexed column.

Now to test it. It's worth turning on the execution plan display for this (but not the part above! 100000 inserts will make a big mess!). Using both SELECT...INTO and INSERT...SELECT, we're creating a table with an identity column and populating it with the rows from #ForRanking. The six tables are being populated with 3 different values: id (the primary key), id+0 (an order preserving, scalar calculation on the primary key) and idcopy (an unindexed value). Our assumption is that the identity values will be generated such that their order matches the row order of the ORDER BY.
If this is so, rows in the resulting tables with consecutive identity values should have ascending id values. This is tested by the self-join queries: if they return 0 there are no out-of-place values.


DROP TABLE #RankTest1
DROP TABLE #RankTest2
DROP TABLE #RankTest3
DROP TABLE #RankTest4
DROP TABLE #RankTest5
DROP TABLE #RankTest6

SELECT id, IDENTITY(int, 1, 1) AS rank
INTO #RankTest1
FROM #ForRanking
ORDER BY id

SELECT id+0 AS id, IDENTITY(int, 1, 1) AS rank
INTO #RankTest2
FROM #ForRanking
ORDER BY id+0

SELECT idcopy AS id, IDENTITY(int, 1, 1) AS rank
INTO #RankTest3
FROM #ForRanking
ORDER BY idcopy


SELECT COUNT(*)
FROM #RankTest1 R1
INNER JOIN #RankTest1 R2 ON R1.rank = R2.rank - 1
WHERE R1.id > R2.id

SELECT COUNT(*)
FROM #RankTest2 R1
INNER JOIN #RankTest2 R2 ON R1.rank = R2.rank - 1
WHERE R1.id > R2.id

SELECT COUNT(*)
FROM #RankTest3 R1
INNER JOIN #RankTest3 R2 ON R1.rank = R2.rank - 1
WHERE R1.id > R2.id



CREATE TABLE #RankTest4 (id int NOT NULL, rank int IDENTITY(1, 1))
CREATE TABLE #RankTest5 (id int NOT NULL, rank int IDENTITY(1, 1))
CREATE TABLE #RankTest6 (id int NOT NULL, rank int IDENTITY(1, 1))

INSERT INTO #RankTest4 (id)
SELECT id
FROM #ForRanking
ORDER BY id

INSERT INTO #RankTest5 (id)
SELECT id+0
FROM #ForRanking
ORDER BY id+0

INSERT INTO #RankTest6 (id)
SELECT idcopy
FROM #ForRanking
ORDER BY idcopy

SELECT COUNT(*)
FROM #RankTest4 R1
INNER JOIN #RankTest4 R2 ON R1.rank = R2.rank - 1
WHERE R1.id > R2.id

SELECT COUNT(*)
FROM #RankTest5 R1
INNER JOIN #RankTest5 R2 ON R1.rank = R2.rank - 1
WHERE R1.id > R2.id

SELECT COUNT(*)
FROM #RankTest6 R1
INNER JOIN #RankTest6 R2 ON R1.rank = R2.rank - 1
WHERE R1.id > R2.id

These are the results I get from the self-joins:
#RankTest1 0
#RankTest2 46866 (example: varies somewhat)
#RankTest3 0
#RankTest4 0
#RankTest5 0
#RankTest6 0

The scalar calculation on the primary key using SELECT...INTO doesn't work as expected!
Examining the execution plan:
#RankTest1 uses ORDERED FORWARD on the Clustered Index Scan, calls getidentity (and inserts the rows).
#RankTest3 uses gets the rows, sorts them and calls getidentity.
But #RankTest2 gets the rows (without ORDERED FORWARD), calls getidentity and then sorts them.

By contrast #RankTest5 (the equivalent of #RankTest2 using INSERT...INTO) does the getidentity after the sort.

Edited by - Arnold Fribble on 04/17/2002 06:36:42

VyasKN
SQL Server MVP & SQLTeam MVY

313 Posts

Posted - 2002-04-17 : 07:33:07

Have you seen this article at Microsoft KB:
Q273586
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q273586

--
HTH,
Vyas
Check out my SQL Server site @
http://vyaskn.tripod.com

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2002-04-17 : 07:44:26

I hadn't. Thank you!

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources