Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Most efficient "Find first missing gap" algorithm

Author  Topic 

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-14 : 11:40:31
Hi.

I am interested in finding a generic efficient algorithm to find "first missing ID".
Here is my first stab at the problem. Feel free to suggest your own algorithm.
/*******************************************************************************
Create staging table for missing gap test
*******************************************************************************/

SET NOCOUNT ON

CREATE TABLE #Sample
(
ID INT PRIMARY KEY CLUSTERED
)

-- Define which ID is missing
DECLARE @ID INT

-- Use 1 for low missing value test
-- Use 16379 for high missing value test

SET @ID = ABS(CHECKSUM(NEWID())) % 16384

SELECT @ID AS theMissingID

-- Populate staging table
INSERT #Sample
SELECT v.Number + 2048 * x.i
FROM master..spt_values AS v
CROSS JOIN (
SELECT 0 AS i UNION ALL
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 7
) AS x
WHERE v.Type = 'P'

DELETE
FROM #Sample
WHERE ID IN (@ID, @ID + 1, @ID + 2, @ID + 4, @ID + 5, @ID + 6)
OR ID IN (@ID + 9, @ID + 10, @ID + 11)

/*******************************************************************************
Setup participants queries
*******************************************************************************/


-- Peso 1 with {low 6, high 58} reads
SELECT TOP 1 s.ID + 1 AS [Peso 1 ID]
FROM #Sample AS s
WHERE NOT EXISTS (SELECT * FROM #Sample AS t WHERE t.ID = s.ID + 1)

-- Peso 2 with {low 29, high 29} reads
SELECT MAX(ID) + 1 AS [Peso 2 ID]
FROM (
SELECT TOP 1 WITH TIES
ID
FROM #Sample
ORDER BY ID - ROW_NUMBER() OVER (ORDER BY ID)
) AS d

-- Charlie 1 with {low 58, high 58) reads
SELECT MIN(s.ID) + 1 AS [Charlie 1 ID]
FROM #Sample AS s
LEFT JOIN #Sample AS s2 ON s2.ID = s.ID + 1
WHERE s2.ID IS NULL

-- Charlie 2 with {low 29, high 29) reads
SELECT MAX(rowNumber) + 1 AS [Charlie 2 ID]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY ID) - 1 AS rowNumber,
ID
FROM #Sample
) AS s
WHERE s.ID = s.rowNumber

-- Hanbingl 1 with {low 6, high 58} reads
SELECT TOP 1
ID AS [Hanbingl 1 ID]
FROM (
SELECT ID + 1 AS ID
FROM #Sample

EXCEPT

SELECT ID
FROM #Sample
) AS t

-- Hanbingl 2 with {low 27, high 163807} reads
;WITH findMe(ID)
AS (
SELECT MIN(ID)
FROM #Sample

UNION ALL

SELECT fm.ID + 1
FROM findMe AS fm
INNER JOIN #Sample AS s ON s.ID = fm.ID
)

SELECT MAX(ID) AS [Hanbingl 2 ID]
FROM findMe
OPTION (MAXRECURSION 0)

-- Fribble 1 with {low 4, high 31} reads
SELECT TOP 1 r AS [Fribble 1 ID]
FROM (
SELECT ID,
ROW_NUMBER() OVER (ORDER BY ID) + (SELECT MIN(ID) FROM #Sample) - 1 AS r
FROM #Sample
) AS a
WHERE ID <> r
ORDER BY ID

/*******************************************************************************
Clean up
*******************************************************************************/


DROP TABLE #Sample


NB:

When using @ID = 0 Charlie 2 returns NULL - should return 4
When using @ID = 16383 Fribble 1 returns nothing - should return 16383


E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:02:11
I guess my first stab would be logic like this.


SELECT MIN(s.[Id]) + 1
FROM
#sample s
LEFT JOIN #sample s2 ON s2.[Id] = s.[Id] + 1
WHERE
s2.[Id] IS NULL


-------------
Charlie
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:08:34
Another method,

SELECT
MAX([rownumber]) + 1
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 AS [rownumber]
, [Id] AS [Id]
FROM
#sample
)
s
WHERE
s.[ID] = s.[rownumber]


I think this one is probably not good for performance though.

-------------
Charlie
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:15:37
Actually, According to the execution plan I just looked at the rownumber method seems to be far faster than my left join and a little faster than your NOT EXISTS.

Looking at the cost relative to batch.

PESO 1 = 18%

LEFT JOIN = 38% (slow)

ROWNUMBER = 12%

-------------
Charlie
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:18:31
ROWNUMBER.

Doesn't cope with a break in sequence right at the start though it is efficiant at finding one after that.

More logically I think it should be

SELECT
MIN([rownumber]) AS [Charlie - ROWNUMBER]
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 AS [rownumber]
, [Id] AS [Id]
FROM
#sample
)
s
WHERE
s.[ID] <> s.[rownumber]

Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:44:56
What about this for a modified rownumbr method?


-- Charlie -- Test ROWNUMBER

DECLARE @start INT
SELECT @start = MIN([Id]) FROM #sample

SELECT
MIN([rownumber]) AS [Charlie - ROWNUMBER]
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 + @start AS [rownumber]
, [Id] AS [Id]
FROM
#sample
)
s
WHERE
s.[ID] <> s.[rownumber]


-------------
Charlie
Go to Top of Page

hanbingl
Aged Yak Warrior

652 Posts

Posted - 2008-10-14 : 15:10:23
select top 1 id from (
select id+1 as id from #sample
except
select id from #sample
)t
Go to Top of Page

hanbingl
Aged Yak Warrior

652 Posts

Posted - 2008-10-14 : 15:23:48
I've been thinking about a CTE recursion. Anyone wanna go for it?
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-14 : 15:25:23
I have thought about that too.
It will be efficient for a low missing ID, but I have no idea for high missing id.
Please go ahead.



E 12°55'05.63"
N 56°04'39.26"
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 17:05:40
quote:


-- Peso 2 with {low 6, high 6} reads
SELECT MAX(ID) + 1 AS [Peso 2 ID]
FROM (
SELECT TOP 1 WITH TIES
ID
FROM #Sample
ORDER BY ID - ROW_NUMBER() OVER (ORDER BY ID)
) AS d



Peso -- that's brilliant!

I was scratching my head after work trying to work out a way to avoid having to find the lowest starting number for sequencing in the #sample.

Hail to the chief!

-------------
Charlie
Go to Top of Page

hanbingl
Aged Yak Warrior

652 Posts

Posted - 2008-10-14 : 17:40:00
[code]
with findme as (
select (select min(id) from #sample) as a
union all
select a+1 from findme inner join #sample
on findme.a = #sample.id
)
select max(a) as RECURSIVE_ID from findme
option (maxrecursion 0)
[/code]
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-10-14 : 17:58:17
Some comments:

Case: scripts need modification to work when default server collation is case-sensitive.

TOP: solutions that use TOP without an ORDER BY are not guaranteed to yield the first gap.

Sample data: too small!

Joins: Joining (either explicitly or with NOT EXISTS) on S.id = S0.id+1 is well optimized in SQL Server when id is a clustered key: you'll get a (not many-to-many) merge join which means that it won't use any extra storage and will only do O(n+m) comparisons. Admittedly it will access each row twice, but the rows will never be more than 1 page apart and usually on the same page.

This should cost 1 read to get the minimum, plus reads of each page up to the first gap, and no sorts.

SELECT TOP 1 r
FROM (
SELECT ID, ROW_NUMBER() OVER (ORDER BY ID) +
( SELECT MIN(ID) FROM #Sample ) - 1 AS r
FROM #Sample
) AS A
WHERE ID <> r
ORDER BY ID

Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-15 : 03:11:27
Points taken and test script is updated to 16384 sample records which is 9 pages.


E 12°55'05.63"
N 56°04'39.26"
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 04:36:44
Which metrics are important here?

IF you run the code with all the methods in then do you care about the cost relative to batch? Or are you only concerned about reads?

-------------
Charlie
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-15 : 04:46:28
I am concerned about reads, because that is what is hitting the database.
And in the beginning when finding missing gaps, they will be closer to 1 than 16384.
When ID's are reused, they probability is higher that the table already is filled and the ID to return is the next in line.
And sometimes the ID will be found in the table as "missing".



E 12°55'05.63"
N 56°04'39.26"
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 04:53:24
quote:
Originally posted by Peso
And sometimes the ID will be found in the table as "missing".


I'm not sure I understand this.

Are you saying that (if an identity seed is reset) when using up gaps in an Id sequence, sometimes the IDENTITY property will return a number that already exists? Or that it will miss out a number for no reason?

-------------
Charlie
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-15 : 05:02:13
The column has not to be an IDENTITY. It could later be a datetime also.
The common thing is that it is sequential and incremental.

For any given set {1, 2, ... , n} the goal is to get the first unused number (gap), and if all numbers are taken, return the value n + 1.

This can be for allocating resort apartments or anything. Get the first unused or create the next in line for me.
It can be seat placement on a bus or train or theater or whatever.

I know you can have n records prestored in table and use a status flag with values of "occupied" or "free" to easier get what you want, but I want to do this somewhat normalized.
I am happy with the algorithm I already have, I just wanted to know if there was a more efficient one.
The reason for asking is that I might have come to a threshold with current system because the n value is about 15 million.

And I think it is not efficient to have 15 million records prestored. We don't know if this client even may get as many records (n) as 15 millions. This client may only use 15000 or 50000.


E 12°55'05.63"
N 56°04'39.26"
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 05:23:16
Yes I see.

So if your potential set is huge but client may only need use a small subset of the set then you only want to store and work with that small sub-set rather than having a data item for every possible record in the set.

-- makes perfect sense.

So I take it you must be doing 1 item at a time in whatever process you implement as you only care about finding the *first* 1 gap

-------------
Charlie
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 05:35:28
What about returning the set of missing Id's rather than just the first such Id.

If you have multiple items to allocate at once then you could use the set rather than having to find the first open entry every time. If the set returned is empty then you know there are no missing Id's

Something like.

-- Charlie 3 (Set based)
SELECT
[Rownumber]
FROM
(
SELECT
ROW_NUMBER() OVER (ORDER BY [Id] ASC) + 1 AS [Rownumber]
FROM
#sample
)
s
WHERE
[rowNumber] NOT IN (SELECT [Id] FROM #sample)


NB -- This assumes that the set should be {1,2....n} change as required if it can be {x,y.....n} -- Charlie.
-------------
Charlie
Go to Top of Page
   

- Advertisement -