| Author |
Topic |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-10-14 : 11:40:31
|
Hi.I am interested in finding a generic efficient algorithm to find "first missing ID".Here is my first stab at the problem. Feel free to suggest your own algorithm./******************************************************************************* Create staging table for missing gap test*******************************************************************************/SET NOCOUNT ONCREATE TABLE #Sample ( ID INT PRIMARY KEY CLUSTERED )-- Define which ID is missingDECLARE @ID INT-- Use 1 for low missing value test-- Use 16379 for high missing value testSET @ID = ABS(CHECKSUM(NEWID())) % 16384SELECT @ID AS theMissingID-- Populate staging tableINSERT #SampleSELECT v.Number + 2048 * x.iFROM master..spt_values AS vCROSS JOIN ( SELECT 0 AS i UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 ) AS xWHERE v.Type = 'P'DELETEFROM #SampleWHERE ID IN (@ID, @ID + 1, @ID + 2, @ID + 4, @ID + 5, @ID + 6) OR ID IN (@ID + 9, @ID + 10, @ID + 11)/******************************************************************************* Setup participants queries*******************************************************************************/-- Peso 1 with {low 6, high 58} readsSELECT TOP 1 s.ID + 1 AS [Peso 1 ID]FROM #Sample AS sWHERE NOT EXISTS (SELECT * FROM #Sample AS t WHERE t.ID = s.ID + 1)-- Peso 2 with {low 29, high 29} readsSELECT MAX(ID) + 1 AS [Peso 2 ID]FROM ( SELECT TOP 1 WITH TIES ID FROM #Sample ORDER BY ID - ROW_NUMBER() OVER (ORDER BY ID) ) AS d-- Charlie 1 with {low 58, high 58) readsSELECT MIN(s.ID) + 1 AS [Charlie 1 ID]FROM #Sample AS sLEFT JOIN #Sample AS s2 ON s2.ID = s.ID + 1WHERE s2.ID IS NULL-- Charlie 2 with {low 29, high 29) readsSELECT MAX(rowNumber) + 1 AS [Charlie 2 ID]FROM ( SELECT ROW_NUMBER() OVER (ORDER BY ID) - 1 AS rowNumber, ID FROM #Sample ) AS sWHERE s.ID = s.rowNumber-- Hanbingl 1 with {low 6, high 58} readsSELECT TOP 1 ID AS [Hanbingl 1 ID]FROM ( SELECT ID + 1 AS ID FROM #Sample EXCEPT SELECT ID FROM #Sample ) AS t-- Hanbingl 2 with {low 27, high 163807} reads;WITH findMe(ID)AS ( SELECT MIN(ID) FROM #Sample UNION ALL SELECT fm.ID + 1 FROM findMe AS fm INNER JOIN #Sample AS s ON s.ID = fm.ID)SELECT MAX(ID) AS [Hanbingl 2 ID]FROM findMe OPTION (MAXRECURSION 0)-- Fribble 1 with {low 4, high 31} readsSELECT TOP 1 r AS [Fribble 1 ID]FROM ( SELECT ID, ROW_NUMBER() OVER (ORDER BY ID) + (SELECT MIN(ID) FROM #Sample) - 1 AS r FROM #Sample ) AS aWHERE ID <> rORDER BY ID/******************************************************************************* Clean up*******************************************************************************/DROP TABLE #SampleNB:When using @ID = 0 Charlie 2 returns NULL - should return 4When using @ID = 16383 Fribble 1 returns nothing - should return 16383 E 12°55'05.63"N 56°04'39.26" |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-14 : 12:02:11
|
I guess my first stab would be logic like this.SELECT MIN(s.[Id]) + 1FROM #sample s LEFT JOIN #sample s2 ON s2.[Id] = s.[Id] + 1WHERE s2.[Id] IS NULL -------------Charlie |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-14 : 12:08:34
|
Another method,SELECT MAX([rownumber]) + 1FROM ( SELECT ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 AS [rownumber] , [Id] AS [Id] FROM #sample ) sWHERE s.[ID] = s.[rownumber] I think this one is probably not good for performance though.-------------Charlie |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-14 : 12:15:37
|
| Actually, According to the execution plan I just looked at the rownumber method seems to be far faster than my left join and a little faster than your NOT EXISTS.Looking at the cost relative to batch.PESO 1 = 18%LEFT JOIN = 38% (slow)ROWNUMBER = 12%-------------Charlie |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-14 : 12:18:31
|
ROWNUMBER.Doesn't cope with a break in sequence right at the start though it is efficiant at finding one after that.More logically I think it should beSELECT MIN([rownumber]) AS [Charlie - ROWNUMBER]FROM ( SELECT ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 AS [rownumber] , [Id] AS [Id] FROM #sample ) sWHERE s.[ID] <> s.[rownumber] |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-14 : 12:44:56
|
What about this for a modified rownumbr method?-- Charlie -- Test ROWNUMBERDECLARE @start INTSELECT @start = MIN([Id]) FROM #sampleSELECT MIN([rownumber]) AS [Charlie - ROWNUMBER]FROM ( SELECT ROW_NUMBER() OVER(ORDER BY [Id] ASC) - 1 + @start AS [rownumber] , [Id] AS [Id] FROM #sample ) sWHERE s.[ID] <> s.[rownumber] -------------Charlie |
 |
|
|
hanbingl
Aged Yak Warrior
652 Posts |
Posted - 2008-10-14 : 15:10:23
|
| select top 1 id from (select id+1 as id from #sample exceptselect id from #sample)t |
 |
|
|
hanbingl
Aged Yak Warrior
652 Posts |
Posted - 2008-10-14 : 15:23:48
|
| I've been thinking about a CTE recursion. Anyone wanna go for it? |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-10-14 : 15:25:23
|
I have thought about that too.It will be efficient for a low missing ID, but I have no idea for high missing id.Please go ahead. E 12°55'05.63"N 56°04'39.26" |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-14 : 17:05:40
|
quote:
-- Peso 2 with {low 6, high 6} readsSELECT MAX(ID) + 1 AS [Peso 2 ID]FROM ( SELECT TOP 1 WITH TIES ID FROM #Sample ORDER BY ID - ROW_NUMBER() OVER (ORDER BY ID) ) AS d
Peso -- that's brilliant!I was scratching my head after work trying to work out a way to avoid having to find the lowest starting number for sequencing in the #sample.Hail to the chief!-------------Charlie |
 |
|
|
hanbingl
Aged Yak Warrior
652 Posts |
Posted - 2008-10-14 : 17:40:00
|
| [code]with findme as ( select (select min(id) from #sample) as aunion allselect a+1 from findme inner join #sample on findme.a = #sample.id)select max(a) as RECURSIVE_ID from findme option (maxrecursion 0)[/code] |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2008-10-14 : 17:58:17
|
Some comments:Case: scripts need modification to work when default server collation is case-sensitive.TOP: solutions that use TOP without an ORDER BY are not guaranteed to yield the first gap.Sample data: too small!Joins: Joining (either explicitly or with NOT EXISTS) on S.id = S0.id+1 is well optimized in SQL Server when id is a clustered key: you'll get a (not many-to-many) merge join which means that it won't use any extra storage and will only do O(n+m) comparisons. Admittedly it will access each row twice, but the rows will never be more than 1 page apart and usually on the same page.This should cost 1 read to get the minimum, plus reads of each page up to the first gap, and no sorts.SELECT TOP 1 rFROM ( SELECT ID, ROW_NUMBER() OVER (ORDER BY ID) + ( SELECT MIN(ID) FROM #Sample ) - 1 AS r FROM #Sample ) AS AWHERE ID <> rORDER BY ID |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-10-15 : 03:11:27
|
Points taken and test script is updated to 16384 sample records which is 9 pages. E 12°55'05.63"N 56°04'39.26" |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-15 : 04:36:44
|
| Which metrics are important here?IF you run the code with all the methods in then do you care about the cost relative to batch? Or are you only concerned about reads?-------------Charlie |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-10-15 : 04:46:28
|
I am concerned about reads, because that is what is hitting the database.And in the beginning when finding missing gaps, they will be closer to 1 than 16384.When ID's are reused, they probability is higher that the table already is filled and the ID to return is the next in line.And sometimes the ID will be found in the table as "missing". E 12°55'05.63"N 56°04'39.26" |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-15 : 04:53:24
|
quote: Originally posted by PesoAnd sometimes the ID will be found in the table as "missing".
I'm not sure I understand this.Are you saying that (if an identity seed is reset) when using up gaps in an Id sequence, sometimes the IDENTITY property will return a number that already exists? Or that it will miss out a number for no reason?-------------Charlie |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-10-15 : 05:02:13
|
The column has not to be an IDENTITY. It could later be a datetime also.The common thing is that it is sequential and incremental.For any given set {1, 2, ... , n} the goal is to get the first unused number (gap), and if all numbers are taken, return the value n + 1.This can be for allocating resort apartments or anything. Get the first unused or create the next in line for me.It can be seat placement on a bus or train or theater or whatever.I know you can have n records prestored in table and use a status flag with values of "occupied" or "free" to easier get what you want, but I want to do this somewhat normalized.I am happy with the algorithm I already have, I just wanted to know if there was a more efficient one.The reason for asking is that I might have come to a threshold with current system because the n value is about 15 million.And I think it is not efficient to have 15 million records prestored. We don't know if this client even may get as many records (n) as 15 millions. This client may only use 15000 or 50000. E 12°55'05.63"N 56°04'39.26" |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-15 : 05:23:16
|
Yes I see.So if your potential set is huge but client may only need use a small subset of the set then you only want to store and work with that small sub-set rather than having a data item for every possible record in the set.-- makes perfect sense.So I take it you must be doing 1 item at a time in whatever process you implement as you only care about finding the *first* 1 gap-------------Charlie |
 |
|
|
Transact Charlie
Master Smack Fu Yak Hacker
3451 Posts |
Posted - 2008-10-15 : 05:35:28
|
What about returning the set of missing Id's rather than just the first such Id.If you have multiple items to allocate at once then you could use the set rather than having to find the first open entry every time. If the set returned is empty then you know there are no missing Id'sSomething like.-- Charlie 3 (Set based)SELECT [Rownumber]FROM ( SELECT ROW_NUMBER() OVER (ORDER BY [Id] ASC) + 1 AS [Rownumber] FROM #sample ) sWHERE [rowNumber] NOT IN (SELECT [Id] FROM #sample) NB -- This assumes that the set should be {1,2....n} change as required if it can be {x,y.....n} -- Charlie.-------------Charlie |
 |
|
|
|