Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
blake
Starting Member
29 Posts |
Posted - 2006-08-09 : 20:14:47
|
Hi,We have an database that I need to query in order to report on, and one criteria is that duplicates that occur close enough to each other shouldn't be reported, but the other duplicates should.I have a database of users who submit information and they provide their name, email, etc. with a dateadded column storing the timestamp of when it was added. I'd like to query for all the records but not include any where the name and email are the same and the dateadded time is less than 10 seconds apart, for example.So if the database contained:name dateadded------------ --------------------John Smith Aug 9, 2006 00:00:00Neo Anderson Aug 9, 2006 00:00:03Tom Jones Aug 9, 2006 00:00:05John Smith Aug 9, 2006 00:00:07Mary Jane Aug 9, 2006 00:00:09John Smith Aug 9, 2006 00:00:11Tom Jones Aug 9, 2006 00:00:14Tom Jones Aug 9, 2006 00:00:25 Then the query should return:name dateadded------------ --------------------John Smith Aug 9, 2006 00:00:00Neo Anderson Aug 9, 2006 00:00:03Tom Jones Aug 9, 2006 00:00:05Mary Jane Aug 9, 2006 00:00:09John Smith Aug 9, 2006 00:00:11Tom Jones Aug 9, 2006 00:00:25 Some duplicates are there, just not the ones where the other entries from the same person are within 10 seconds of the last reported entry. It can be within 10 seconds of an omitted entry, however.Any ideas if this is possible within a single query?Thanks! |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2006-08-10 : 01:51:25
|
[code]-- prepare test datadeclare @test table (name varchar(100), dateadded datetime)insert @testselect 'John Smith', 'Aug 9, 2006 00:00:00' union allselect 'Neo Anderson', 'Aug 9, 2006 00:00:03' union allselect 'Tom Jones', 'Aug 9, 2006 00:00:05' union allselect 'John Smith ', 'Aug 9, 2006 00:00:07' union allselect 'Mary Jane', 'Aug 9, 2006 00:00:09' union allselect 'John Smith', 'Aug 9, 2006 00:00:11' union allselect 'Tom Jones', 'Aug 9, 2006 00:00:14' union allselect 'Tom Jones', 'Aug 9, 2006 00:00:25'-- do the workselect t1.name, t2.dateaddedfrom @test t1inner join @test t2 on t2.name = t1.namewhere datediff(second, t1.dateadded, t2.dateadded) > 10unionselect t3.name, min(t3.dateadded)from @test t3group by t3.nameorder by 2, 1[/code]Writing the query like this, makes it really easy for you to write a function that takes the 10 second as a parameter and compare to that instead.Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2006-08-10 : 01:57:57
|
[code]CREATE FUNCTION dbo.fnGetNameList( @SecondsApart INT)RETURNS @Table TABLE (Name VARCHAR(100), DateAdded DATETIME)ASBEGIN INSERT @Table ( Name, DateAdded ) select t1.name, t2.dateadded from yourtable t1 inner join yourtable t2 on t2.name = t1.name where datediff(second, t1.dateadded, t2.dateadded) > @SecondsApart union select t3.name, min(t3.dateadded) from yourtable t3 group by t3.name order by 2, 1 RETURNEND[/code]Call with SELECT * FROM dbo.fnNameList(10), SELECT * FROM dbo.fnNameList(15) or any integer value.Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2006-08-10 : 08:44:41
|
I have found a problem with this approach. If Tom Jones is 00:00:00, 00:00:14 and 00:00:15, 00:00:00 and 00:00:14 is valid, not 00:00:15.Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2006-08-10 : 14:02:37
|
This is the code i came up with. It is no "one-liner" but it covers all bases.-- prepare test datadeclare @test table (name varchar(100), dateadded datetime)insert @testselect 'John Smith', 'Aug 9, 2006 00:00:00' union allselect 'Neo Anderson', 'Aug 9, 2006 00:00:03' union allselect 'Tom Jones', 'Aug 9, 2006 00:00:05' union allselect 'John Smith ', 'Aug 9, 2006 00:00:07' union allselect 'Mary Jane', 'Aug 9, 2006 00:00:09' union allselect 'John Smith', 'Aug 9, 2006 00:00:11' union allselect 'Tom Jones', 'Aug 9, 2006 00:00:14' union allselect 'Tom Jones', 'Aug 9, 2006 00:00:25'-- do the workdeclare @output table (generation int, name varchar(100), dateadded datetime)insert @outputselect 0, name, min(dateadded)from @testgroup by namewhile @@rowcount > 0 insert @output select 1 + max(o.generation), t.name, min(t.dateadded) from @output o inner join @test t on t.name = o.name and t.dateadded >= dateadd(second, 10, o.dateadded) where o.generation = (select max(o.generation) from @output o) group by t.name-- show the outputselect o.name, o.dateaddedfrom @output oorder by o.dateadded, o.name Peter LarssonHelsingborg, Sweden |
 |
|
jsmith8858
Dr. Cross Join
7423 Posts |
Posted - 2006-08-10 : 16:13:06
|
Wouldn't this work?select * from @test awhere not exists (select * from @test b where a.name = b.name and b.dateadded > dateadd(s,-10,a.dateadded) and b.dateadded < a.dateadded)which translates to:"return all rows except for those where a matching name exists within 10 seconds before"Keep in mind that your requirements may need to be clarified. For example, what happens if one person has 100 entries, all 9 seconds apart? Should only the first be returned? Or should every other one be skipped?If the answer is only the first should be returned, then my solution above should work. If the answer is every other one, then you might want to re-work your requirements to be something like this:"If a name appears more than once in a 10-second interval (0-9,10-19,20-29,30-39,40-49,50-59) only show it once."In which case, you'd be Grouping by (Seconds/10):select name, min(dateadded)from @testgroup by name, datediff(s,'1/1/2006',dateadded)/10(note that your sample data doesn't work too well with that scenerio since all dup names are in seperate 10-second 'blocks' except for the first and second John Smith, which the above code filters out).I hope this makes sense. the key is to define the alogorithm fully first, and then write the code.- Jeff |
 |
|
blake
Starting Member
29 Posts |
Posted - 2006-08-10 : 21:13:05
|
Thanks Peter! I'm going to have to wrap my head around how your query works, but it seems to do the trick. I thought this would have required the use of cursors to iterate over each one and compare.Thanks again! |
 |
|
blake
Starting Member
29 Posts |
Posted - 2006-08-10 : 21:24:48
|
quote: Keep in mind that your requirements may need to be clarified. For example, what happens if one person has 100 entries, all 9 seconds apart? Should only the first be returned? Or should every other one be skipped?If the answer is only the first should be returned, then my solution above should work. If the answer is every other one, then you might want to re-work your requirements to be something like this:"If a name appears more than once in a 10-second interval (0-9,10-19,20-29,30-39,40-49,50-59) only show it once."In which case, you'd be Grouping by (Seconds/10):select name, min(dateadded)from @testgroup by name, datediff(s,'1/1/2006',dateadded)/10(note that your sample data doesn't work too well with that scenerio since all dup names are in seperate 10-second 'blocks' except for the first and second John Smith, which the above code filters out).I hope this makes sense. the key is to define the alogorithm fully first, and then write the code.- Jeff
I suppose I wasn't clear enough. It's not the case where I can simply group by 10 second intervals and only output the first one. If Tom Jones is output at the 9 second mark, I don't want him also output at the 11 second mark even though that's in a new 10 second window. So the Tom Jones record gets output at the 9 second mark, and the next earliest time he should be output is at the 19 second mark.If you can imagine all the records being from the same person, the algorithm is like this:- while not at end: - output current record - skip ahead to first record 10+ seconds after the current one Of course, there's all different people mixed in with that, not just one person. |
 |
|
|
|
|
|
|