Most efficient "Find first missing gap" algorithm

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Most efficient "Find first missing gap" algorithm

Author

Topic

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-14 : 11:40:31

Hi.

I am interested in finding a generic efficient algorithm to find "first missing ID".
Here is my first stab at the problem. Feel free to suggest your own algorithm.

/*******************************************************************************
  Create staging table for missing gap test
*******************************************************************************/
SET NOCOUNT ON

CREATE TABLE	#Sample
	(
		ID INT PRIMARY KEY CLUSTERED
	)

-- Define which ID is missing
DECLARE	@ID INT

-- Use     1 for low  missing value test
-- Use 16379 for high missing value test
SET	@ID = ABS(CHECKSUM(NEWID())) % 16384

SELECT	@ID AS theMissingID

-- Populate staging table
INSERT		#Sample
SELECT		v.Number + 2048 * x.i
FROM		master..spt_values AS v
CROSS JOIN	(
			SELECT 0 AS i UNION ALL
			SELECT 1 UNION ALL
			SELECT 2 UNION ALL
			SELECT 3 UNION ALL
			SELECT 4 UNION ALL
			SELECT 5 UNION ALL
			SELECT 6 UNION ALL
			SELECT 7
		) AS x
WHERE		v.Type = 'P'

DELETE
FROM	#Sample
WHERE	ID IN (@ID, @ID + 1, @ID + 2, @ID + 4, @ID + 5, @ID + 6)
	OR ID IN (@ID + 9, @ID + 10, @ID + 11)

/*******************************************************************************
  Setup participants queries
*******************************************************************************/

-- Peso 1 with {low 6, high 58} reads
SELECT TOP 1	s.ID + 1 AS [Peso 1 ID]
FROM		#Sample AS s
WHERE		NOT EXISTS (SELECT * FROM #Sample AS t WHERE t.ID = s.ID + 1)

-- Peso 2 with {low 29, high 29} reads
SELECT	MAX(ID) + 1 AS [Peso 2 ID]
FROM	(
		SELECT		TOP 1 WITH TIES
				ID
		FROM		#Sample
		ORDER BY	ID - ROW_NUMBER() OVER (ORDER BY ID)
	) AS d

-- Charlie 1 with {low 58, high 58) reads
SELECT		MIN(s.ID) + 1 AS [Charlie 1 ID]
FROM		#Sample AS s
LEFT JOIN	#Sample AS s2 ON s2.ID = s.ID + 1
WHERE		s2.ID IS NULL

-- Charlie 2 with {low 29, high 29) reads
SELECT	MAX(rowNumber) + 1 AS [Charlie 2 ID]
FROM	(
		SELECT	ROW_NUMBER() OVER (ORDER BY ID) - 1 AS rowNumber,
			ID
		FROM	#Sample
	) AS s
WHERE	s.ID = s.rowNumber

-- Hanbingl 1 with {low 6, high 58} reads
SELECT	TOP 1
	ID AS [Hanbingl 1 ID]
FROM	(
		SELECT ID + 1 AS ID
		FROM	#Sample 

		EXCEPT

		SELECT	ID
		FROM	#Sample
	) AS t

-- Hanbingl 2 with {low 27, high 163807} reads
;WITH findMe(ID)
AS ( 
	SELECT	MIN(ID)
	FROM	#Sample

	UNION ALL

	SELECT		fm.ID + 1
	FROM		findMe AS fm
	INNER JOIN	#Sample AS s ON s.ID = fm.ID
)

SELECT	MAX(ID) AS [Hanbingl 2 ID]
FROM	findMe 
OPTION	(MAXRECURSION 0)

-- Fribble 1 with {low 4, high 31} reads
SELECT TOP 1	r AS [Fribble 1 ID]
FROM		(
			SELECT	ID,
				ROW_NUMBER() OVER (ORDER BY ID) + (SELECT MIN(ID) FROM #Sample) - 1 AS r
			FROM	#Sample
		) AS a
WHERE		ID <> r
ORDER BY	ID

/*******************************************************************************
  Clean up
*******************************************************************************/

DROP TABLE	#Sample

NB:

When using @ID = 0 Charlie 2 returns NULL - should return 4
When using @ID = 16383 Fribble 1 returns nothing - should return 16383

E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:02:11

I guess my first stab would be logic like this.


SELECT MIN(s.[Id]) + 1
FROM
	#sample s
	LEFT JOIN #sample s2 ON s2.[Id] = s.[Id] + 1
WHERE
	s2.[Id] IS NULL

-------------
Charlie

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:08:34

Another method,


SELECT
	MAX([rownumber]) + 1
FROM
	(
	SELECT
		ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 AS [rownumber]
		, [Id] AS [Id]
	FROM
		#sample
	)
	s
WHERE
	s.[ID] = s.[rownumber]

I think this one is probably not good for performance though.

-------------
Charlie

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:15:37

Actually, According to the execution plan I just looked at the rownumber method seems to be far faster than my left join and a little faster than your NOT EXISTS.

Looking at the cost relative to batch.

PESO 1 = 18%

LEFT JOIN = 38% (slow)

ROWNUMBER = 12%

-------------
Charlie

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:18:31

ROWNUMBER.

Doesn't cope with a break in sequence right at the start though it is efficiant at finding one after that.

More logically I think it should be


SELECT
	 MIN([rownumber]) AS [Charlie - ROWNUMBER]
FROM
	(
	SELECT
		ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 AS [rownumber]
		, [Id] AS [Id]
	FROM
		#sample
	)
	s
WHERE
	s.[ID] <> s.[rownumber]

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 12:44:56

What about this for a modified rownumbr method?


-- Charlie -- Test ROWNUMBER

DECLARE @start INT
SELECT @start = MIN([Id]) FROM #sample

SELECT
	 MIN([rownumber]) AS [Charlie - ROWNUMBER]
FROM
	(
	SELECT
		ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 + @start AS [rownumber]
		, [Id] AS [Id]
	FROM
		#sample
	)
	s
WHERE
	s.[ID] <> s.[rownumber]

-------------
Charlie

hanbingl
Aged Yak Warrior

652 Posts

Posted - 2008-10-14 : 15:10:23

select top 1 id from (
select id+1 as id from #sample
except
select id from #sample
)t

hanbingl
Aged Yak Warrior

652 Posts

Posted - 2008-10-14 : 15:23:48

I've been thinking about a CTE recursion. Anyone wanna go for it?

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-14 : 15:25:23

I have thought about that too.
It will be efficient for a low missing ID, but I have no idea for high missing id.
Please go ahead.

E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-14 : 17:05:40

quote:


-- Peso 2 with {low 6, high 6} reads
SELECT	MAX(ID) + 1 AS [Peso 2 ID]
FROM	(
		SELECT		TOP 1 WITH TIES
				ID
		FROM		#Sample
		ORDER BY	ID - ROW_NUMBER() OVER (ORDER BY ID)
	) AS d

Peso -- that's brilliant!

I was scratching my head after work trying to work out a way to avoid having to find the lowest starting number for sequencing in the #sample.

Hail to the chief!

-------------
Charlie

hanbingl
Aged Yak Warrior

652 Posts

Posted - 2008-10-14 : 17:40:00

[code]
with findme as (
select (select min(id) from #sample) as a
union all
select a+1 from findme inner join #sample
on findme.a = #sample.id
)
select max(a) as RECURSIVE_ID from findme
option (maxrecursion 0)
[/code]

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-10-14 : 17:58:17

Some comments:

Case: scripts need modification to work when default server collation is case-sensitive.

TOP: solutions that use TOP without an ORDER BY are not guaranteed to yield the first gap.

Sample data: too small!

Joins: Joining (either explicitly or with NOT EXISTS) on S.id = S0.id+1 is well optimized in SQL Server when id is a clustered key: you'll get a (not many-to-many) merge join which means that it won't use any extra storage and will only do O(n+m) comparisons. Admittedly it will access each row twice, but the rows will never be more than 1 page apart and usually on the same page.

This should cost 1 read to get the minimum, plus reads of each page up to the first gap, and no sorts.


SELECT TOP 1 r
FROM (
    SELECT ID, ROW_NUMBER() OVER (ORDER BY ID) +
      ( SELECT MIN(ID) FROM #Sample ) - 1 AS r
    FROM #Sample
  ) AS A
WHERE ID <> r
ORDER BY ID

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-15 : 03:11:27

Points taken and test script is updated to 16384 sample records which is 9 pages.

E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 04:36:44

Which metrics are important here?

IF you run the code with all the methods in then do you care about the cost relative to batch? Or are you only concerned about reads?

-------------
Charlie

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-15 : 04:46:28

I am concerned about reads, because that is what is hitting the database.
And in the beginning when finding missing gaps, they will be closer to 1 than 16384.
When ID's are reused, they probability is higher that the table already is filled and the ID to return is the next in line.
And sometimes the ID will be found in the table as "missing".

E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 04:53:24

quote:
Originally posted by Peso
And sometimes the ID will be found in the table as "missing".

I'm not sure I understand this.

Are you saying that (if an identity seed is reset) when using up gaps in an Id sequence, sometimes the IDENTITY property will return a number that already exists? Or that it will miss out a number for no reason?

-------------
Charlie

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-15 : 05:02:13

The column has not to be an IDENTITY. It could later be a datetime also.
The common thing is that it is sequential and incremental.

For any given set {1, 2, ... , n} the goal is to get the first unused number (gap), and if all numbers are taken, return the value n + 1.

This can be for allocating resort apartments or anything. Get the first unused or create the next in line for me.
It can be seat placement on a bus or train or theater or whatever.

I know you can have n records prestored in table and use a status flag with values of "occupied" or "free" to easier get what you want, but I want to do this somewhat normalized.
I am happy with the algorithm I already have, I just wanted to know if there was a more efficient one.
The reason for asking is that I might have come to a threshold with current system because the n value is about 15 million.

And I think it is not efficient to have 15 million records prestored. We don't know if this client even may get as many records (n) as 15 millions. This client may only use 15000 or 50000.

E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 05:23:16

Yes I see.

So if your potential set is huge but client may only ~~need~~ use a small subset of the set then you only want to store and work with that small sub-set rather than having a data item for every possible record in the set.

-- makes perfect sense.

So I take it you must be doing 1 item at a time in whatever process you implement as you only care about finding the *first* 1 gap

-------------
Charlie

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-10-15 : 05:35:28

What about returning the set of missing Id's rather than just the first such Id.

If you have multiple items to allocate at once then you could use the set rather than having to find the first open entry every time. If the set returned is empty then you know there are no missing Id's

Something like.


-- Charlie 3 (Set based)
SELECT
	[Rownumber]
FROM
	(
	SELECT
		ROW_NUMBER() OVER (ORDER BY [Id] ASC) + 1 AS [Rownumber]
	FROM
		#sample
	)
	s
WHERE
	[rowNumber] NOT IN (SELECT [Id] FROM #sample)

NB -- This assumes that the set should be {1,2....n} change as required if it can be {x,y.....n} -- Charlie.
-------------
Charlie

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources