Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 skip dupe records within n seconds of each other

Author  Topic 

blake
Starting Member

29 Posts

Posted - 2006-08-09 : 20:14:47
Hi,

We have an database that I need to query in order to report on, and one criteria is that duplicates that occur close enough to each other shouldn't be reported, but the other duplicates should.

I have a database of users who submit information and they provide their name, email, etc. with a dateadded column storing the timestamp of when it was added. I'd like to query for all the records but not include any where the name and email are the same and the dateadded time is less than 10 seconds apart, for example.

So if the database contained:


name dateadded
------------ --------------------
John Smith Aug 9, 2006 00:00:00
Neo Anderson Aug 9, 2006 00:00:03
Tom Jones Aug 9, 2006 00:00:05
John Smith Aug 9, 2006 00:00:07
Mary Jane Aug 9, 2006 00:00:09
John Smith Aug 9, 2006 00:00:11
Tom Jones Aug 9, 2006 00:00:14
Tom Jones Aug 9, 2006 00:00:25


Then the query should return:


name dateadded
------------ --------------------
John Smith Aug 9, 2006 00:00:00
Neo Anderson Aug 9, 2006 00:00:03
Tom Jones Aug 9, 2006 00:00:05
Mary Jane Aug 9, 2006 00:00:09
John Smith Aug 9, 2006 00:00:11
Tom Jones Aug 9, 2006 00:00:25


Some duplicates are there, just not the ones where the other entries from the same person are within 10 seconds of the last reported entry. It can be within 10 seconds of an omitted entry, however.

Any ideas if this is possible within a single query?

Thanks!

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-08-10 : 01:51:25
[code]-- prepare test data
declare @test table (name varchar(100), dateadded datetime)

insert @test
select 'John Smith', 'Aug 9, 2006 00:00:00' union all
select 'Neo Anderson', 'Aug 9, 2006 00:00:03' union all
select 'Tom Jones', 'Aug 9, 2006 00:00:05' union all
select 'John Smith ', 'Aug 9, 2006 00:00:07' union all
select 'Mary Jane', 'Aug 9, 2006 00:00:09' union all
select 'John Smith', 'Aug 9, 2006 00:00:11' union all
select 'Tom Jones', 'Aug 9, 2006 00:00:14' union all
select 'Tom Jones', 'Aug 9, 2006 00:00:25'

-- do the work
select t1.name,
t2.dateadded
from @test t1
inner join @test t2 on t2.name = t1.name
where datediff(second, t1.dateadded, t2.dateadded) > 10
union
select t3.name,
min(t3.dateadded)
from @test t3
group by t3.name
order by 2,
1[/code]Writing the query like this, makes it really easy for you to write a function that takes the 10 second as a parameter and compare to that instead.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-08-10 : 01:57:57
[code]CREATE FUNCTION dbo.fnGetNameList
(
@SecondsApart INT
)
RETURNS @Table TABLE (Name VARCHAR(100), DateAdded DATETIME)
AS

BEGIN
INSERT @Table
(
Name,
DateAdded
)
select t1.name,
t2.dateadded
from yourtable t1
inner join yourtable t2 on t2.name = t1.name
where datediff(second, t1.dateadded, t2.dateadded) > @SecondsApart
union
select t3.name,
min(t3.dateadded)
from yourtable t3
group by t3.name
order by 2,
1

RETURN
END[/code]Call with SELECT * FROM dbo.fnNameList(10), SELECT * FROM dbo.fnNameList(15) or any integer value.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-08-10 : 08:44:41
I have found a problem with this approach. If Tom Jones is 00:00:00, 00:00:14 and 00:00:15, 00:00:00 and 00:00:14 is valid, not 00:00:15.

Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-08-10 : 14:02:37
This is the code i came up with. It is no "one-liner" but it covers all bases.
-- prepare test data
declare @test table (name varchar(100), dateadded datetime)

insert @test
select 'John Smith', 'Aug 9, 2006 00:00:00' union all
select 'Neo Anderson', 'Aug 9, 2006 00:00:03' union all
select 'Tom Jones', 'Aug 9, 2006 00:00:05' union all
select 'John Smith ', 'Aug 9, 2006 00:00:07' union all
select 'Mary Jane', 'Aug 9, 2006 00:00:09' union all
select 'John Smith', 'Aug 9, 2006 00:00:11' union all
select 'Tom Jones', 'Aug 9, 2006 00:00:14' union all
select 'Tom Jones', 'Aug 9, 2006 00:00:25'

-- do the work
declare @output table (generation int, name varchar(100), dateadded datetime)

insert @output
select 0,
name,
min(dateadded)
from @test
group by name

while @@rowcount > 0
insert @output
select 1 + max(o.generation),
t.name,
min(t.dateadded)
from @output o
inner join @test t on t.name = o.name and t.dateadded >= dateadd(second, 10, o.dateadded)
where o.generation = (select max(o.generation) from @output o)
group by t.name

-- show the output
select o.name,
o.dateadded
from @output o
order by o.dateadded,
o.name


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2006-08-10 : 16:13:06
Wouldn't this work?


select *
from @test a
where not exists (select * from @test b where a.name = b.name and b.dateadded > dateadd(s,-10,a.dateadded) and b.dateadded < a.dateadded)

which translates to:

"return all rows except for those where a matching name exists within 10 seconds before"

Keep in mind that your requirements may need to be clarified. For example, what happens if one person has 100 entries, all 9 seconds apart? Should only the first be returned? Or should every other one be skipped?

If the answer is only the first should be returned, then my solution above should work. If the answer is every other one, then you might want to re-work your requirements to be something like this:

"If a name appears more than once in a 10-second interval (0-9,10-19,20-29,30-39,40-49,50-59) only show it once."

In which case, you'd be Grouping by (Seconds/10):

select name, min(dateadded)
from @test
group by name, datediff(s,'1/1/2006',dateadded)/10

(note that your sample data doesn't work too well with that scenerio since all dup names are in seperate 10-second 'blocks' except for the first and second John Smith, which the above code filters out).

I hope this makes sense. the key is to define the alogorithm fully first, and then write the code.


- Jeff
Go to Top of Page

blake
Starting Member

29 Posts

Posted - 2006-08-10 : 21:13:05
Thanks Peter! I'm going to have to wrap my head around how your query works, but it seems to do the trick. I thought this would have required the use of cursors to iterate over each one and compare.

Thanks again!
Go to Top of Page

blake
Starting Member

29 Posts

Posted - 2006-08-10 : 21:24:48
quote:
Keep in mind that your requirements may need to be clarified. For example, what happens if one person has 100 entries, all 9 seconds apart? Should only the first be returned? Or should every other one be skipped?

If the answer is only the first should be returned, then my solution above should work. If the answer is every other one, then you might want to re-work your requirements to be something like this:

"If a name appears more than once in a 10-second interval (0-9,10-19,20-29,30-39,40-49,50-59) only show it once."

In which case, you'd be Grouping by (Seconds/10):

select name, min(dateadded)
from @test
group by name, datediff(s,'1/1/2006',dateadded)/10

(note that your sample data doesn't work too well with that scenerio since all dup names are in seperate 10-second 'blocks' except for the first and second John Smith, which the above code filters out).

I hope this makes sense. the key is to define the alogorithm fully first, and then write the code.


- Jeff



I suppose I wasn't clear enough. It's not the case where I can simply group by 10 second intervals and only output the first one. If Tom Jones is output at the 9 second mark, I don't want him also output at the 11 second mark even though that's in a new 10 second window. So the Tom Jones record gets output at the 9 second mark, and the next earliest time he should be output is at the 19 second mark.

If you can imagine all the records being from the same person, the algorithm is like this:

- while not at end:
- output current record
- skip ahead to first record 10+ seconds after the current one


Of course, there's all different people mixed in with that, not just one person.
Go to Top of Page
   

- Advertisement -