Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
awolfe
Starting Member
8 Posts |
Posted - 2007-04-13 : 13:58:22
|
I have the need to compare records in a table and combine records where possible.The input would be as follows:ID Start Date End Date 1111 01/01/2006 01/31/20061111 02/01/2006 12/31/20061111 04/01/2007 12/31/99992222 01/01/2007 12/31/99993333 01/01/2007 03/31/20074444 01/01/2006 03/31/20064444 06/01/2006 09/30/20064444 01/01/2007 12/31/99995555 01/01/2006 05/30/20065555 06/01/2006 12/31/20065555 01/01/2006 12/31/9999I need to combine records where the end date of record 1 is one day before the start date of record 2. The date of 12/31/9999 means no End Date. So the output to the above would be as follows:ID Start Date End Date 1111 01/01/2006 12/31/20061111 04/01/2007 12/31/99992222 01/01/2007 12/31/99993333 01/01/2007 03/31/20074444 01/01/2006 03/31/20064444 06/01/2006 09/30/20064444 01/01/2007 12/31/99995555 01/01/2006 12/31/9999Currently I am using a while loop to compare the current record and previous record and combine when possible. Can anyone suggest a set based solution?Thank you for your help |
|
dinakar
Master Smack Fu Yak Hacker
2507 Posts |
|
awolfe
Starting Member
8 Posts |
Posted - 2007-04-13 : 14:51:25
|
The solution suggested there is not working. |
 |
|
jsmith8858
Dr. Cross Join
7423 Posts |
Posted - 2007-04-13 : 15:05:38
|
if your data looks like this, what should be returned:ID Start Date End Date1111 01/01/2006 01/31/20061111 02/01/2006 03/31/20061111 04/01/2007 12/31/9999??Also, can it ever be like this?ID Start Date End Date1111 01/01/2006 02/31/20061111 02/01/2006 03/31/20061111 04/01/2007 12/31/9999(note that the first two rows overlap)What happens then?Finally, what is the primary key of this data? Can there be two rows with same ID/StartDate? What happens then?You are not getting good answers because your specification are not clearly explained. And it doesn't help to simply say "it isn't working" without explaining why. Do you drop your car off at the mechanic and simply say "it isn't working" and leave it at that and expect him to fix it?- Jeffhttp://weblogs.sqlteam.com/JeffS |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 15:22:30
|
Try this!First install the function found here http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61519-- Prepare sample dataDECLARE @Sample TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)INSERT @SampleSELECT 1111, '01/01/2006', '01/31/2006' UNION ALLSELECT 1111, '02/01/2006', '12/31/2006' UNION ALLSELECT 1111, '04/01/2007', '12/31/9999' UNION ALLSELECT 2222, '01/01/2007', '12/31/9999' UNION ALLSELECT 3333, '01/01/2007', '03/31/2007' UNION ALLSELECT 4444, '01/01/2006', '03/31/2006' UNION ALLSELECT 4444, '06/01/2006', '09/30/2006' UNION ALLSELECT 4444, '01/01/2007', '12/31/9999' UNION ALLSELECT 5555, '01/01/2006', '05/30/2006' UNION ALLSELECT 5555, '06/01/2006', '12/31/2006' UNION ALLSELECT 5555, '01/01/2006', '12/31/9999'-- Stage the dataDECLARE @Stage TABLE (RecID INT IDENTITY(1, 1) PRIMARY KEY, ID INT, theDate DATETIME, grp INT)DECLARE @Control TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)INSERT @ControlSELECT ID, MIN(theDate), MAX(theDate)FROM ( SELECT ID, MIN(StartDate) AS theDate FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL SELECT ID, MAX(StartDate) FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL SELECT ID, MIN(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID UNION ALL SELECT ID, MAX(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID ) AS xGROUP BY IDDECLARE @ID INT, @MaxID INT, @StartDate DATETIME, @EndDate DATETIMESELECT @ID = MIN(ID), @MaxID = MAX(ID)FROM @ControlWHILE @ID <= @MaxID BEGIN SELECT @StartDate = StartDate, @EndDate = EndDate FROM @Control WHERE ID = @ID INSERT @Stage SELECT DISTINCT @ID, x.DATE, 0 FROM F_TABLE_DATE(@StartDate, @EndDate) AS x INNER JOIN @Sample AS s ON x.DATE BETWEEN s.StartDate AND s.EndDate AND s.ID = @ID ORDER BY DATE SELECT @ID = MIN(ID) FROM @Control WHERE ID > @ID ENDSELECT @EndDate = NULL, @StartDate = NULLUPDATE @StageSET @StartDate = theDate, @ID = CASE WHEN @EndDate IS NULL THEN 1 WHEN DATEDIFF(DAY, @EndDate, @StartDate) = 1 THEN @ID ELSE @ID + 1 END, grp = @ID, @EndDate = @StartDateINSERT @StageSELECT e.ID, k.EndDate, e.grpFROM ( SELECT ID, MAX(grp) AS grp FROM @Stage GROUP BY ID ) AS eINNER JOIN ( SELECT ID, EndDate FROM @Sample WHERE EndDate = '99991231' ) AS k ON k.ID = e.ID-- Show the expected outputSELECT ID, MIN(theDate) AS StartDate, MAX(theDate) AS EndDateFROM @StageGROUP BY ID, grpORDER BY ID, grp Peter LarssonHelsingborg, Sweden |
 |
|
awolfe
Starting Member
8 Posts |
Posted - 2007-04-13 : 16:08:19
|
PeterThank you for your suggestion. When I try your example I get the following messageServer: Msg 8101, Level 16, State 1, Line 50An explicit value for the identity column in table '@Stage' can only be specified when a column list is used and IDENTITY_INSERT is ON.I have tried to use the SET IDENTITY INSERT @Stage ON but to no avail. Can you help me get past this error so that I can test your suggestion. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 16:10:04
|
<removed due to bug>Use first suggestion. |
 |
|
jsmith8858
Dr. Cross Join
7423 Posts |
Posted - 2007-04-13 : 16:17:41
|
it helps to clarify and define the specs before writing the code, you know ... try running your code for this data:INSERT @SampleSELECT 1111, '01/01/2006', '01/31/2006' UNION ALLSELECT 1111, '02/01/2006', '03/31/2006' UNION ALLSELECT 1111, '04/01/2007', '12/31/9999' UNION ALLSELECT 2222, '01/01/2007', '12/31/9999' UNION ALLSELECT 3333, '01/01/2007', '03/31/2007' UNION ALLSELECT 4444, '01/01/2006', '03/31/2006' UNION ALLSELECT 4444, '06/01/2006', '09/30/2006' UNION ALLSELECT 4444, '01/01/2007', '12/31/9999' UNION ALLSELECT 5555, '01/01/2006', '05/30/2006' UNION ALLSELECT 5555, '06/01/2006', '12/31/2006' UNION ALLSELECT 5555, '01/01/2006', '12/31/9999'for 111, the second row is completely ignored. Now, maybe that's what you want to do (who knows? the specs are incomplete) but I doubt it.Just some general advice: Gather and state the specifications and requirments CLEARLY first, and THEN write the code.- Jeffhttp://weblogs.sqlteam.com/JeffS |
 |
|
awolfe
Starting Member
8 Posts |
Posted - 2007-04-13 : 16:18:46
|
Jeff,To answer the questions that you presented. if your data looks like this, what should be returned:ID Start Date End Date1111 01/01/2006 01/31/20061111 02/01/2006 03/31/20061111 04/01/2007 12/31/9999The output expected would beID Start Date End Date1111 01/01/2006 03/31/20061111 04/01/2007 12/31/9999??Also, can it ever be like this?Currently I have no records that would fit your example of overlapping dates.ID Start Date End Date1111 01/01/2006 02/31/20061111 02/01/2006 03/31/20061111 04/01/2007 12/31/9999(note that the first two rows overlap)What happens then? However, if this did happen in the data then the current process would provide the following outputID Start Date End Date1111 01/01/2006 03/31/20061111 04/01/2007 12/31/9999Finally, what is the primary key of this data? Can there be two rows with same ID/StartDate? What happens then?The primary key on this data is ID, Startdate and will not allow two rows to have the same ID/StartDate.You are not getting good answers because your specification are not clearly explained. And it doesn't help to simply say "it isn't working" without explaining why. Do you drop your car off at the mechanic and simply say "it isn't working" and leave it at that and expect him to fix it?The reason that I didn't provide more details when I said this wasn't working is because I provided those details on the other message board. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 16:21:22
|
<removed due to bug>Use first suggestion.awolfe, take this as a lesson.Next time, provide sample for all possible combinations for sample/source data, so there will be little room for misunderstandings.Peter LarssonHelsingborg, Sweden |
 |
|
awolfe
Starting Member
8 Posts |
Posted - 2007-04-13 : 16:25:49
|
To further clarify my original question. I am trying to change some existing code. The code was first built using a cursor which looped through all of the records in a table and combined records where possible. It was requested that the original code be changed to remove the cursor. This was because the input file has over 12 million records and the code was taking to long to run. So the code was then changed to use a while loop instead of a cursor. There were also some additional modifications that were made including the addition of indexes to some tables. We are now being requested to remove the while loop from the code if possible. This is why I am trying to determine if there is a set based solution or if a cursor or while loop is the only solution.I do appreciate all of the suggestions I am receiving.Thank you. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 16:29:54
|
Depending on the quality of your source data, a SET based solution might not be the best or fastest solution with whis problem.Also if you change the @Stage table variable to #Stage temporary table and add a proper covering index over ID, DateStart and DateEnd, this might be the fastest and most complete solution.Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 16:36:16
|
<removed due to bug>Use first suggestion. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 16:39:27
|
How long do the suggestion right above take on your 12 million records?How long time did it take before?Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-13 : 18:19:52
|
This code takes 105 seconds for 1,000 sample records. And 791 seconds for 100,000 records.Which means 100 times more records and 8 times more in time. Seems to scale well.-- Stage the dataDECLARE @Stage TABLE (RecID INT IDENTITY(1, 1) PRIMARY KEY, ID INT, theDate DATETIME, grp INT)DECLARE @Control TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)INSERT @Control (ID, StartDate, EndDate)SELECT ID, MIN(theDate), MAX(theDate)FROM ( SELECT ID, MIN(StartDate) AS theDate FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL SELECT ID, MAX(StartDate) FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL SELECT ID, MIN(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID UNION ALL SELECT ID, MAX(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID ) AS xGROUP BY IDDECLARE @ID INT, @MaxID INT, @StartDate DATETIME, @EndDate DATETIMESELECT @ID = MIN(ID), @MaxID = MAX(ID)FROM @ControlWHILE @ID <= @MaxID BEGIN SELECT @StartDate = StartDate, @EndDate = EndDate FROM @Control WHERE ID = @ID INSERT @Stage (ID, theDate, grp) SELECT @ID, x.DATE, 0 FROM @Sample AS s INNER JOIN F_TABLE_DATE(@StartDate, @EndDate) AS x ON x.DATE BETWEEN s.StartDate AND s.EndDate WHERE s.ID = @ID GROUP BY x.DATE ORDER BY x.DATE SELECT @ID = MIN(ID) FROM @Control WHERE ID > @ID ENDSELECT @EndDate = NULL, @StartDate = NULLUPDATE @StageSET @StartDate = theDate, @ID = CASE WHEN @EndDate IS NULL THEN 1 WHEN DATEDIFF(DAY, @EndDate, @StartDate) = 1 THEN @ID ELSE @ID + 1 END, grp = @ID, @EndDate = @StartDateINSERT @StageSELECT e.ID, k.EndDate, e.grpFROM ( SELECT ID, MAX(grp) AS grp FROM @Stage GROUP BY ID ) AS eINNER JOIN ( SELECT ID, EndDate FROM @Sample WHERE EndDate = '99991231' ) AS k ON k.ID = e.ID-- Show the expected outputSELECT ID, MIN(theDate) AS StartDate, MAX(theDate) AS EndDateFROM @StageGROUP BY ID, grpORDER BY ID, grp Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-14 : 10:56:04
|
This code should be fast!My tests show that 100,000 records take 6 seconds1,000,000 records take 41 seconds Here is the code-- Prepare sample dataSET DATEFORMAT MDYDECLARE @Sample TABLE ( ID INT, StartDate DATETIME, EndDate DATETIME )INSERT @Sample ( ID, StartDate, EndDate )SELECT 1111, '01/01/2006', '01/31/2006' UNION ALLSELECT 1111, '02/01/2006', '12/31/2006' UNION ALLSELECT 1111, '04/01/2007', '12/31/9999' UNION ALLSELECT 2222, '01/01/2007', '12/31/9999' UNION ALLSELECT 3333, '01/01/2007', '03/31/2007' UNION ALLSELECT 4444, '01/01/2006', '03/31/2006' UNION ALLSELECT 4444, '06/01/2006', '09/30/2006' UNION ALLSELECT 4444, '01/01/2007', '12/31/9999' UNION ALLSELECT 5555, '01/01/2006', '05/30/2006' UNION ALLSELECT 5555, '06/01/2006', '12/31/2006' UNION ALLSELECT 5555, '01/01/2006', '12/31/9999'-- Stage the dataDECLARE curStage CURSOR FORWARD_ONLY READ_ONLY FOR SELECT ID, CASE WHEN StartDate <= EndDate THEN StartDate ELSE EndDate END, CASE WHEN StartDate <= EndDate THEN EndDate ELSE StartDate END FROM @Sample ORDER BY ID, 2, 3 DESCDECLARE @Stage TABLE ( ID INT, StartDate DATETIME, EndDate DATETIME )DECLARE @LastID INT, @CurrID INT, @LastStartDate DATETIME, @CurrStartDate DATETIME, @LastEndDate DATETIME, @CurrEndDate DATETIMEOPEN curStageFETCH NEXTFROM curStageINTO @CurrID, @CurrStartDate, @CurrEndDateSELECT @LastID = @CurrID, @LastStartDate = @CurrStartDate, @LastEndDate = @CurrEndDateWHILE @@FETCH_STATUS = 0 BEGIN IF @LastID <> @CurrID OR @CurrStartDate > DATEADD(DAY, CASE WHEN @LastEndDate = '99991231' THEN 0 ELSE 1 END, @LastEndDate) BEGIN INSERT @Stage ( ID, StartDate, EndDate ) VALUES ( @LastID, @LastStartDate, @LastEndDate ) SELECT @LastID = @CurrID, @LastStartDate = @CurrStartDate, @LastEndDate = @CurrEndDate END ELSE IF @CurrEndDate > @LastEndDate SET @LastEndDate = @CurrEndDate FETCH NEXT FROM curStage INTO @CurrID, @CurrStartDate, @CurrEndDate ENDCLOSE curStageDEALLOCATE curStageINSERT @Stage ( ID, StartDate, EndDate )VALUES ( @LastID, @LastStartDate, @LastEndDate )-- Show the expected outputSELECT ID, StartDate, EndDateFROM @StageORDER BY ID, StartDate Peter LarssonHelsingborg, Sweden |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-15 : 07:24:42
|
With the method posted above, I got these result with a table of 12 million sample recordsTest # Index Constraint Time------ ---------------------------------------------- ------------------------------ ----1 No index at all, records randimized Allow EndDate < StartDate 5:392 Clustered (ID, StartDate) Allow EndDate < StartDate 5:133 Clustered (ID, StartDate) Always StartDate <= EndDate 5:274 Clustered (ID, StartDate, EndDate DESC) Always StartDate <= EndDate 4:30 The result seems to take about 5 minutes, regardless of index and sample data set up, for 12 million records.Peter LarssonHelsingborg, Sweden |
 |
|
awolfe
Starting Member
8 Posts |
Posted - 2007-04-16 : 11:22:58
|
Peter,The solution that you provided on 04/14/2007 at 10:56:04 works. It is providing much better results than my current solution. Thank you for all of your help. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-16 : 11:27:43
|
Thanks!What is the time difference? The code above takes what seconds to run, and your old code took what seconds to run?Peter LarssonHelsingborg, Sweden |
 |
|
awolfe
Starting Member
8 Posts |
Posted - 2007-04-16 : 11:30:54
|
I am unable to provide exact numbers for you because of the development environment that we have. The best information that I can provide is that the old way would take at least 6 hours to run and your solution took 1 hour and 21 minutes. Sorry I cannot provide more detailed numbers for you. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-04-16 : 11:35:43
|
Well, it is at least 4 times better.Peter LarssonHelsingborg, Sweden |
 |
|
Next Page
|
|
|
|
|