Article: Spotting breaks in data sequences

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

Site Related Forums

Article Discussion

Article: Spotting breaks in data sequences

Author

Topic

AskSQLTeam
Ask SQLTeam Question

0 Posts

Posted - 2002-07-16 : 09:21:52

neil submitted "A TSQL tutorial on spotting breaks in data sequences"

Article Link.

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-07-16 : 09:33:07

In almost all cases the WHERE NOT EXISTS correlated sub-query will outperform the LEFT JOIN WHERE therightside IS NULL....

Run this in QA and Show Execution Plan...


create table testTable (i int) 
go

insert testTable 
    select 1
    union select 2
    union select 3
    union select 4
    union select 6
    union select 7
    union select 10
    union select 11
    union select 15
    union select 21
go

select  'there is a gap between ' 
+       convert(varchar, l.i) 
+       ' and ' 
+       (select convert(varchar, min(i)) from testTable where i > l.i)
from    testTable l
left    join (select i from testTable ) r
        on l.i + 1 = r.i 
where   r.i IS NULL
and     l.i < (select max(i) from testTable)
order   by l.i

select  'there is a gap between ' 
+       convert(varchar, l.i) 
+       ' and ' 
+       (select convert(varchar, min(i)) from testTable where i > l.i)
from    testTable l
where   not exists (
        select 47
        from testTable r
        where l.i + 1 = r.i )
and     l.i < (select max(i) from testTable)
order   by l.i

go
drop table testTable
go

SubTree cost of .266 vs. .196 . . .

[edit]Although, this is highly dependend on indexing. A clustered index on testTable(i) gives the LEFT JOIN query a .0578 and the NOT EXISTS query a .0747.[/edit]
<O>

Edited by - Page47 on 07/16/2002 10:09:01

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2002-07-16 : 13:13:08

If i is a clustered primary key (etc.) then (SQL Server 2000 certainly) is able to use a merge join between l and r, which is handy. This applies to both the LEFT JOIN and NOT EXISTS versions. Unfortunately, if you try r.i - l.i = 1 it is unable to rearrange this!

BTW, the number of gaps for unique values 1 .. n where each value has an independent probability of being present of p is n*p - n*p^2.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2002-11-08 : 16:43:23

OK, here's an alternate and shorter very quick way to spot gaps:

SELECT "Gap between " & x.Value & " and " & Max(y.value) AS Message
FROM Vals AS y, Vals AS x
WHERE y.Value < x.[value]
GROUP BY x.Value
HAVING Max(y.Value) <> x.value - 1

If you are using dates, just edit the "x.value - 1" as needed. For example, DateAdd(Day,-1,x.value).

I think this takes care of all cases ..... Have fun!

billsox
Yak Posting Veteran

74 Posts

Posted - 2002-11-08 : 22:30:07

jsmith8858, this code is golden! Can you explain exactly what it's doing?

Bill

quote:

OK, here's an alternate and shorter very quick way to spot gaps:

SELECT "Gap between " & x.Value & " and " & Max(y.value) AS Message
FROM Vals AS y, Vals AS x
WHERE y.Value < x.[value]
GROUP BY x.Value
HAVING Max(y.Value) <> x.value - 1

If you are using dates, just edit the "x.value - 1" as needed. For example, DateAdd(Day,-1,x.value).

I think this takes care of all cases ..... Have fun!

nr
SQLTeam MVY

12543 Posts

Posted - 2002-11-10 : 18:01:38

SELECT "Gap between " & x.Value & " and " & Max(y.value) AS Message
FROM Vals AS y, Vals AS x
WHERE y.Value < x.[value]
GROUP BY x.Value
HAVING Max(y.Value) <> x.value - 1

WHERE y.Value < x.[value]
get all y values that are less than the x in this group

HAVING Max(y.Value) <> x.value - 1
Exclude groups where there is a a y = x - 1 i.e. no gap

SELECT "Gap between " & x.Value & " and " & Max(y.value)
give the x and max(y) = greatest value less than the x where there is a gap.

same as (hope this works - removed some strange " and &)
SELECT 'Gap between ' + x.Value + ' and ' + y.value AS Message
FROM Vals AS y, Vals AS x
WHERE y.Value = (select max(z.Value) from Vals z where z.Value < x.Value)
and y.Value <> x.Value - 1

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2002-11-11 : 15:04:49

sorry, my SQL was kind of a mix of SQL server and Access.

NR, you are pretty right on about explaining how my query works. However, it is very different from your example for mostly 1 reason: your example needs to enumerate through the "Values" table 3 times, and mine only twice. If that table is really a slow view or huge table, the performance difference can be very important. Also, I am not sure what your example is doing, or why it needs the subquery in the WHERE clause. All of the logic components are provided just by doing a cartesian of the table to itself 1 time.

Another way of wording my query in english:

"For each value in the table (call this "x"), get the highest value ("y") less than x. If x <> (y + 1), there must be a gap between x and y."

Examples like this illustrate the big difference between WHERE and HAVING clauses which confused me for quite some time as I was beginning SQL.

Edited by - jsmith8858 on 11/15/2002 13:50:59

BusmasterJones
Starting Member

4 Posts

Posted - 2004-01-15 : 11:42:30

I ran a comparision test between jsmith's HAVING version of the query and a slightly modified (to fix the datatype problem on concatenating the output) version of nr's MAX query. I ran these queries on a table with approx 14000 rows and 33 gaps. The column checked is an IDENTITY column functioning as the PK and has a nonclustered index on it. Here are the results:

nr query: 33 gaps found in < 1 sec execution time

jsmith query: 33 gaps found in 1 min 49 sec execution time

The query plans are radically different. The faster query really only uses the indexes to solve the problem. The slower query gets a cartesian product of the table being searched joined to itself (in this case 14000 x 14000 rows). Because of some filtering first due to the WHERE clause, it is actually more like 14000 x 7000 rows, but this still deals with almost 100,000,000 rows.

Clearly "small" differences in query construction can mean BIG differences in performance.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2004-01-15 : 12:36:21

you know, i think i wrote that one before i fully grasped the concepts of subqueries in SQL Server -- i was used to doing things more of the old Access way.

And, of course, as time goes on we get much wiser -- when I revisit this problem, I then realize that this might be the most efficient way:

select v1,v2 from
(
SELECT x.i v1, (select max(y.i) from vals y where y.i < x.i) as v2
from vals x
) a
where v1 <> v2+1

which is a little bit faster than NR's method, and requires a little less work.

And MUCH MUCH more efficient than my old way of doing it !!!!

Great job, Busmaster, thanks for the info and doing the actual test. FYI -- here's how I tested:


create table vals(i int primary key, d varchar(10))

GO
declare @i int;
set @i=0
set nocount on
while @i < 14000
	begin
	if @i % 424 <> 0 insert into vals(i, d) values(@i, 'dummy data')
	set @i = @i+1
	end
set nocount off
GO

-- my "new" way:

select v1,v2 from
(
SELECT x.i v1, (select max(y.i) from vals y where y.i < x.i) as v2
from vals x
) a
where v1 <> v2+1

-- NR's method:

SELECT x.i, y.i
FROM Vals y, Vals x 
WHERE y.i = (select max(z.i) from Vals z where z.i < x.i)
and y.i <> x.i - 1

GO

drop table vals

- Jeff

CactusJuice
Starting Member

46 Posts

Posted - 2005-04-15 : 12:25:39

What happened to the original article? I'd like to see it. Get a 404 error and Google's cache is empty so I assume it's been missing for a while.

Eejit
Starting Member

1 Post

Posted - 2005-07-14 : 10:33:48

Ditto Cactus Juice! Cannot find original article, which seems to cover the exact problem I've been struggling with all day. Googled it, tried different variations on DNS, etc. Any chance someone could fix the broken link please?

nr
SQLTeam MVY

12543 Posts

Posted - 2005-07-18 : 15:56:08

Don't know what the article said but have a look at
http://www.mindsdoor.net/SQLTsql/FindGapsInSequence.html

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources