Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 replace while loop with set based query

Author  Topic 

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 13:58:22
I have the need to compare records in a table and combine records where possible.

The input would be as follows:

ID Start Date End Date
1111 01/01/2006 01/31/2006
1111 02/01/2006 12/31/2006
1111 04/01/2007 12/31/9999
2222 01/01/2007 12/31/9999
3333 01/01/2007 03/31/2007
4444 01/01/2006 03/31/2006
4444 06/01/2006 09/30/2006
4444 01/01/2007 12/31/9999
5555 01/01/2006 05/30/2006
5555 06/01/2006 12/31/2006
5555 01/01/2006 12/31/9999

I need to combine records where the end date of record 1 is one day before the start date of record 2. The date of 12/31/9999 means no End Date. So the output to the above would be as follows:

ID Start Date End Date
1111 01/01/2006 12/31/2006
1111 04/01/2007 12/31/9999
2222 01/01/2007 12/31/9999
3333 01/01/2007 03/31/2007
4444 01/01/2006 03/31/2006
4444 06/01/2006 09/30/2006
4444 01/01/2007 12/31/9999
5555 01/01/2006 12/31/9999

Currently I am using a while loop to compare the current record and previous record and combine when possible. Can anyone suggest a set based solution?

Thank you for your help

dinakar
Master Smack Fu Yak Hacker

2507 Posts

Posted - 2007-04-13 : 14:33:27
Question already answered here: http://www.sql-server-performance.com/forum/topic.asp?TOPIC_ID=20709


************************
Life is short. Enjoy it.
************************
http://weblogs.sqlteam.com/dinakar/
Go to Top of Page

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 14:51:25
The solution suggested there is not working.
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2007-04-13 : 15:05:38
if your data looks like this, what should be returned:

ID Start Date End Date
1111 01/01/2006 01/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

??

Also, can it ever be like this?

ID Start Date End Date
1111 01/01/2006 02/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

(note that the first two rows overlap)

What happens then?

Finally, what is the primary key of this data? Can there be two rows with same ID/StartDate? What happens then?

You are not getting good answers because your specification are not clearly explained. And it doesn't help to simply say "it isn't working" without explaining why. Do you drop your car off at the mechanic and simply say "it isn't working" and leave it at that and expect him to fix it?


- Jeff
http://weblogs.sqlteam.com/JeffS
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 15:22:30
Try this!
First install the function found here http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61519
-- Prepare sample data
DECLARE @Sample TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)

INSERT @Sample
SELECT 1111, '01/01/2006', '01/31/2006' UNION ALL
SELECT 1111, '02/01/2006', '12/31/2006' UNION ALL
SELECT 1111, '04/01/2007', '12/31/9999' UNION ALL
SELECT 2222, '01/01/2007', '12/31/9999' UNION ALL
SELECT 3333, '01/01/2007', '03/31/2007' UNION ALL
SELECT 4444, '01/01/2006', '03/31/2006' UNION ALL
SELECT 4444, '06/01/2006', '09/30/2006' UNION ALL
SELECT 4444, '01/01/2007', '12/31/9999' UNION ALL
SELECT 5555, '01/01/2006', '05/30/2006' UNION ALL
SELECT 5555, '06/01/2006', '12/31/2006' UNION ALL
SELECT 5555, '01/01/2006', '12/31/9999'

-- Stage the data
DECLARE @Stage TABLE (RecID INT IDENTITY(1, 1) PRIMARY KEY, ID INT, theDate DATETIME, grp INT)

DECLARE @Control TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)

INSERT @Control
SELECT ID,
MIN(theDate),
MAX(theDate)
FROM (
SELECT ID, MIN(StartDate) AS theDate FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
SELECT ID, MAX(StartDate) FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
SELECT ID, MIN(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID UNION ALL
SELECT ID, MAX(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID
) AS x
GROUP BY ID

DECLARE @ID INT,
@MaxID INT,
@StartDate DATETIME,
@EndDate DATETIME

SELECT @ID = MIN(ID),
@MaxID = MAX(ID)
FROM @Control

WHILE @ID <= @MaxID
BEGIN
SELECT @StartDate = StartDate,
@EndDate = EndDate
FROM @Control
WHERE ID = @ID

INSERT @Stage
SELECT DISTINCT @ID,
x.DATE,
0
FROM F_TABLE_DATE(@StartDate, @EndDate) AS x
INNER JOIN @Sample AS s ON x.DATE BETWEEN s.StartDate AND s.EndDate AND s.ID = @ID
ORDER BY DATE

SELECT @ID = MIN(ID)
FROM @Control
WHERE ID > @ID
END

SELECT @EndDate = NULL,
@StartDate = NULL

UPDATE @Stage
SET @StartDate = theDate,
@ID = CASE
WHEN @EndDate IS NULL THEN 1
WHEN DATEDIFF(DAY, @EndDate, @StartDate) = 1 THEN @ID
ELSE @ID + 1
END,
grp = @ID,
@EndDate = @StartDate

INSERT @Stage
SELECT e.ID,
k.EndDate,
e.grp
FROM (
SELECT ID,
MAX(grp) AS grp
FROM @Stage
GROUP BY ID
) AS e
INNER JOIN (
SELECT ID,
EndDate
FROM @Sample
WHERE EndDate = '99991231'
) AS k ON k.ID = e.ID

-- Show the expected output
SELECT ID,
MIN(theDate) AS StartDate,
MAX(theDate) AS EndDate
FROM @Stage
GROUP BY ID,
grp
ORDER BY ID,
grp


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 16:08:19
Peter

Thank you for your suggestion. When I try your example I get the following message

Server: Msg 8101, Level 16, State 1, Line 50
An explicit value for the identity column in table '@Stage' can only be specified when a column list is used and IDENTITY_INSERT is ON.

I have tried to use the SET IDENTITY INSERT @Stage ON but to no avail. Can you help me get past this error so that I can test your suggestion.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:10:04
<removed due to bug>

Use first suggestion.
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2007-04-13 : 16:17:41
it helps to clarify and define the specs before writing the code, you know ... try running your code for this data:

INSERT @Sample
SELECT 1111, '01/01/2006', '01/31/2006' UNION ALL
SELECT 1111, '02/01/2006', '03/31/2006' UNION ALL
SELECT 1111, '04/01/2007', '12/31/9999' UNION ALL
SELECT 2222, '01/01/2007', '12/31/9999' UNION ALL
SELECT 3333, '01/01/2007', '03/31/2007' UNION ALL
SELECT 4444, '01/01/2006', '03/31/2006' UNION ALL
SELECT 4444, '06/01/2006', '09/30/2006' UNION ALL
SELECT 4444, '01/01/2007', '12/31/9999' UNION ALL
SELECT 5555, '01/01/2006', '05/30/2006' UNION ALL
SELECT 5555, '06/01/2006', '12/31/2006' UNION ALL
SELECT 5555, '01/01/2006', '12/31/9999'


for 111, the second row is completely ignored. Now, maybe that's what you want to do (who knows? the specs are incomplete) but I doubt it.

Just some general advice: Gather and state the specifications and requirments CLEARLY first, and THEN write the code.

- Jeff
http://weblogs.sqlteam.com/JeffS
Go to Top of Page

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 16:18:46
Jeff,

To answer the questions that you presented.

if your data looks like this, what should be returned:

ID Start Date End Date
1111 01/01/2006 01/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999


The output expected would be
ID Start Date End Date
1111 01/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

??

Also, can it ever be like this?

Currently I have no records that would fit your example of overlapping dates.
ID Start Date End Date
1111 01/01/2006 02/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

(note that the first two rows overlap)

What happens then? However, if this did happen in the data then the current process would provide the following output
ID Start Date End Date
1111 01/01/2006 03/31/2006
1111 04/01/2007 12/31/9999


Finally, what is the primary key of this data? Can there be two rows with same ID/StartDate? What happens then?
The primary key on this data is ID, Startdate and will not allow two rows to have the same ID/StartDate.

You are not getting good answers because your specification are not clearly explained. And it doesn't help to simply say "it isn't working" without explaining why. Do you drop your car off at the mechanic and simply say "it isn't working" and leave it at that and expect him to fix it?

The reason that I didn't provide more details when I said this wasn't working is because I provided those details on the other message board.


Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:21:22
<removed due to bug>

Use first suggestion.

awolfe, take this as a lesson.
Next time, provide sample for all possible combinations for sample/source data, so there will be little room for misunderstandings.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 16:25:49
To further clarify my original question. I am trying to change some existing code. The code was first built using a cursor which looped through all of the records in a table and combined records where possible. It was requested that the original code be changed to remove the cursor. This was because the input file has over 12 million records and the code was taking to long to run. So the code was then changed to use a while loop instead of a cursor. There were also some additional modifications that were made including the addition of indexes to some tables. We are now being requested to remove the while loop from the code if possible. This is why I am trying to determine if there is a set based solution or if a cursor or while loop is the only solution.

I do appreciate all of the suggestions I am receiving.

Thank you.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:29:54
Depending on the quality of your source data, a SET based solution might not be the best or fastest solution with whis problem.
Also if you change the @Stage table variable to #Stage temporary table and add a proper covering index over ID, DateStart and DateEnd, this might be the fastest and most complete solution.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:36:16
<removed due to bug>

Use first suggestion.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:39:27
How long do the suggestion right above take on your 12 million records?
How long time did it take before?


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 18:19:52
This code takes 105 seconds for 1,000 sample records. And 791 seconds for 100,000 records.
Which means 100 times more records and 8 times more in time. Seems to scale well.
-- Stage the data
DECLARE @Stage TABLE (RecID INT IDENTITY(1, 1) PRIMARY KEY, ID INT, theDate DATETIME, grp INT)

DECLARE @Control TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)

INSERT @Control (ID, StartDate, EndDate)
SELECT ID,
MIN(theDate),
MAX(theDate)
FROM (
SELECT ID, MIN(StartDate) AS theDate FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
SELECT ID, MAX(StartDate) FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
SELECT ID, MIN(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID UNION ALL
SELECT ID, MAX(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID
) AS x
GROUP BY ID

DECLARE @ID INT,
@MaxID INT,
@StartDate DATETIME,
@EndDate DATETIME

SELECT @ID = MIN(ID),
@MaxID = MAX(ID)
FROM @Control

WHILE @ID <= @MaxID
BEGIN

SELECT @StartDate = StartDate,
@EndDate = EndDate
FROM @Control
WHERE ID = @ID

INSERT @Stage (ID, theDate, grp)
SELECT @ID,
x.DATE,
0
FROM @Sample AS s
INNER JOIN F_TABLE_DATE(@StartDate, @EndDate) AS x ON x.DATE BETWEEN s.StartDate AND s.EndDate
WHERE s.ID = @ID
GROUP BY x.DATE
ORDER BY x.DATE

SELECT @ID = MIN(ID)
FROM @Control
WHERE ID > @ID
END

SELECT @EndDate = NULL,
@StartDate = NULL

UPDATE @Stage
SET @StartDate = theDate,
@ID = CASE
WHEN @EndDate IS NULL THEN 1
WHEN DATEDIFF(DAY, @EndDate, @StartDate) = 1 THEN @ID
ELSE @ID + 1
END,
grp = @ID,
@EndDate = @StartDate

INSERT @Stage
SELECT e.ID,
k.EndDate,
e.grp
FROM (
SELECT ID,
MAX(grp) AS grp
FROM @Stage
GROUP BY ID
) AS e
INNER JOIN (
SELECT ID,
EndDate
FROM @Sample
WHERE EndDate = '99991231'
) AS k ON k.ID = e.ID

-- Show the expected output
SELECT ID,
MIN(theDate) AS StartDate,
MAX(theDate) AS EndDate
FROM @Stage
GROUP BY ID,
grp
ORDER BY ID,
grp


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-14 : 10:56:04
This code should be fast!

My tests show that

100,000 records take 6 seconds
1,000,000 records take 41 seconds
Here is the code
-- Prepare sample data
SET DATEFORMAT MDY

DECLARE @Sample TABLE
(
ID INT,
StartDate DATETIME,
EndDate DATETIME
)

INSERT @Sample
(
ID,
StartDate,
EndDate
)
SELECT 1111, '01/01/2006', '01/31/2006' UNION ALL
SELECT 1111, '02/01/2006', '12/31/2006' UNION ALL
SELECT 1111, '04/01/2007', '12/31/9999' UNION ALL
SELECT 2222, '01/01/2007', '12/31/9999' UNION ALL
SELECT 3333, '01/01/2007', '03/31/2007' UNION ALL
SELECT 4444, '01/01/2006', '03/31/2006' UNION ALL
SELECT 4444, '06/01/2006', '09/30/2006' UNION ALL
SELECT 4444, '01/01/2007', '12/31/9999' UNION ALL
SELECT 5555, '01/01/2006', '05/30/2006' UNION ALL
SELECT 5555, '06/01/2006', '12/31/2006' UNION ALL
SELECT 5555, '01/01/2006', '12/31/9999'

-- Stage the data
DECLARE curStage CURSOR FORWARD_ONLY READ_ONLY FOR
SELECT ID,
CASE
WHEN StartDate <= EndDate THEN StartDate
ELSE EndDate
END,
CASE
WHEN StartDate <= EndDate THEN EndDate
ELSE StartDate
END
FROM @Sample
ORDER BY ID,
2,
3 DESC

DECLARE @Stage TABLE
(
ID INT,
StartDate DATETIME,
EndDate DATETIME
)

DECLARE @LastID INT,
@CurrID INT,
@LastStartDate DATETIME,
@CurrStartDate DATETIME,
@LastEndDate DATETIME,
@CurrEndDate DATETIME

OPEN curStage

FETCH NEXT
FROM curStage
INTO @CurrID,
@CurrStartDate,
@CurrEndDate

SELECT @LastID = @CurrID,
@LastStartDate = @CurrStartDate,
@LastEndDate = @CurrEndDate

WHILE @@FETCH_STATUS = 0
BEGIN
IF @LastID <> @CurrID OR @CurrStartDate > DATEADD(DAY, CASE WHEN @LastEndDate = '99991231' THEN 0 ELSE 1 END, @LastEndDate)
BEGIN
INSERT @Stage
(
ID,
StartDate,
EndDate
)
VALUES (
@LastID,
@LastStartDate,
@LastEndDate
)

SELECT @LastID = @CurrID,
@LastStartDate = @CurrStartDate,
@LastEndDate = @CurrEndDate
END
ELSE
IF @CurrEndDate > @LastEndDate
SET @LastEndDate = @CurrEndDate

FETCH NEXT
FROM curStage
INTO @CurrID,
@CurrStartDate,
@CurrEndDate
END

CLOSE curStage
DEALLOCATE curStage

INSERT @Stage
(
ID,
StartDate,
EndDate
)
VALUES (
@LastID,
@LastStartDate,
@LastEndDate
)

-- Show the expected output
SELECT ID,
StartDate,
EndDate
FROM @Stage
ORDER BY ID,
StartDate


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-15 : 07:24:42
With the method posted above, I got these result with a table of 12 million sample records

Test # Index Constraint Time
------ ---------------------------------------------- ------------------------------ ----
1 No index at all, records randimized Allow EndDate < StartDate 5:39
2 Clustered (ID, StartDate) Allow EndDate < StartDate 5:13
3 Clustered (ID, StartDate) Always StartDate <= EndDate 5:27
4 Clustered (ID, StartDate, EndDate DESC) Always StartDate <= EndDate 4:30
The result seems to take about 5 minutes, regardless of index and sample data set up, for 12 million records.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

awolfe
Starting Member

8 Posts

Posted - 2007-04-16 : 11:22:58
Peter,

The solution that you provided on 04/14/2007 at 10:56:04 works. It is providing much better results than my current solution. Thank you for all of your help.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-16 : 11:27:43
Thanks!

What is the time difference? The code above takes what seconds to run, and your old code took what seconds to run?


Peter Larsson
Helsingborg, Sweden
Go to Top of Page

awolfe
Starting Member

8 Posts

Posted - 2007-04-16 : 11:30:54
I am unable to provide exact numbers for you because of the development environment that we have. The best information that I can provide is that the old way would take at least 6 hours to run and your solution took 1 hour and 21 minutes. Sorry I cannot provide more detailed numbers for you.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-16 : 11:35:43
Well, it is at least 4 times better.


Peter Larsson
Helsingborg, Sweden
Go to Top of Page
    Next Page

- Advertisement -