replace while loop with set based query

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

replace while loop with set based query

Author

Topic

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 13:58:22

I have the need to compare records in a table and combine records where possible.

The input would be as follows:

ID Start Date End Date
1111 01/01/2006 01/31/2006
1111 02/01/2006 12/31/2006
1111 04/01/2007 12/31/9999
2222 01/01/2007 12/31/9999
3333 01/01/2007 03/31/2007
4444 01/01/2006 03/31/2006
4444 06/01/2006 09/30/2006
4444 01/01/2007 12/31/9999
5555 01/01/2006 05/30/2006
5555 06/01/2006 12/31/2006
5555 01/01/2006 12/31/9999

I need to combine records where the end date of record 1 is one day before the start date of record 2. The date of 12/31/9999 means no End Date. So the output to the above would be as follows:

ID Start Date End Date
1111 01/01/2006 12/31/2006
1111 04/01/2007 12/31/9999
2222 01/01/2007 12/31/9999
3333 01/01/2007 03/31/2007
4444 01/01/2006 03/31/2006
4444 06/01/2006 09/30/2006
4444 01/01/2007 12/31/9999
5555 01/01/2006 12/31/9999

Currently I am using a while loop to compare the current record and previous record and combine when possible. Can anyone suggest a set based solution?

Thank you for your help

dinakar
Master Smack Fu Yak Hacker

2507 Posts

Posted - 2007-04-13 : 14:33:27

Question already answered here: http://www.sql-server-performance.com/forum/topic.asp?TOPIC_ID=20709

************************
Life is short. Enjoy it.
************************
http://weblogs.sqlteam.com/dinakar/

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 14:51:25

The solution suggested there is not working.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2007-04-13 : 15:05:38

if your data looks like this, what should be returned:

ID Start Date End Date
1111 01/01/2006 01/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

??

Also, can it ever be like this?

ID Start Date End Date
1111 01/01/2006 02/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

(note that the first two rows overlap)

What happens then?

Finally, what is the primary key of this data? Can there be two rows with same ID/StartDate? What happens then?

You are not getting good answers because your specification are not clearly explained. And it doesn't help to simply say "it isn't working" without explaining why. Do you drop your car off at the mechanic and simply say "it isn't working" and leave it at that and expect him to fix it?

- Jeff
http://weblogs.sqlteam.com/JeffS

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 15:22:30

Try this!
First install the function found here http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61519

-- Prepare sample data
DECLARE	@Sample TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)

INSERT	@Sample
SELECT	1111, '01/01/2006', '01/31/2006' UNION ALL
SELECT	1111, '02/01/2006', '12/31/2006' UNION ALL
SELECT	1111, '04/01/2007', '12/31/9999' UNION ALL
SELECT	2222, '01/01/2007', '12/31/9999' UNION ALL
SELECT	3333, '01/01/2007', '03/31/2007' UNION ALL
SELECT	4444, '01/01/2006', '03/31/2006' UNION ALL
SELECT	4444, '06/01/2006', '09/30/2006' UNION ALL
SELECT	4444, '01/01/2007', '12/31/9999' UNION ALL
SELECT	5555, '01/01/2006', '05/30/2006' UNION ALL
SELECT	5555, '06/01/2006', '12/31/2006' UNION ALL
SELECT	5555, '01/01/2006', '12/31/9999'

-- Stage the data
DECLARE	@Stage TABLE (RecID INT IDENTITY(1, 1) PRIMARY KEY, ID INT, theDate DATETIME, grp INT)

DECLARE	@Control TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)

INSERT		@Control
SELECT		ID,
		MIN(theDate),
		MAX(theDate)
FROM		(
			SELECT ID, MIN(StartDate) AS theDate FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
			SELECT ID, MAX(StartDate) FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
			SELECT ID, MIN(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID UNION ALL
			SELECT ID, MAX(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID
		) AS x
GROUP BY	ID

DECLARE	@ID INT,
	@MaxID INT,
	@StartDate DATETIME,
	@EndDate DATETIME

SELECT	@ID = MIN(ID),
	@MaxID = MAX(ID)
FROM	@Control

WHILE @ID <= @MaxID
	BEGIN
		SELECT	@StartDate = StartDate,
			@EndDate = EndDate
		FROM	@Control
		WHERE	ID = @ID

		INSERT		@Stage
		SELECT DISTINCT	@ID,
				x.DATE,
				0
		FROM		F_TABLE_DATE(@StartDate, @EndDate) AS x
		INNER JOIN	@Sample AS s ON x.DATE BETWEEN s.StartDate AND s.EndDate AND s.ID = @ID
		ORDER BY	DATE

		SELECT	@ID = MIN(ID)
		FROM	@Control
		WHERE	ID > @ID
	END

SELECT	@EndDate = NULL,
	@StartDate = NULL

UPDATE	@Stage
SET	@StartDate = theDate,
	@ID =	CASE
			WHEN @EndDate IS NULL THEN 1
			WHEN DATEDIFF(DAY, @EndDate, @StartDate) = 1 THEN @ID
			ELSE @ID + 1
		END,
	grp = @ID,
	@EndDate = @StartDate

INSERT		@Stage
SELECT		e.ID,
		k.EndDate,
		e.grp
FROM		(
			SELECT		ID,
					MAX(grp) AS grp
			FROM		@Stage
			GROUP BY	ID
		) AS e
INNER JOIN	(
			SELECT	ID,
				EndDate
			FROM	@Sample
			WHERE	EndDate = '99991231'
		) AS k ON k.ID = e.ID

-- Show the expected output
SELECT		ID,
		MIN(theDate) AS StartDate,
		MAX(theDate) AS EndDate
FROM		@Stage
GROUP BY	ID,
		grp
ORDER BY	ID,
		grp

Peter Larsson
Helsingborg, Sweden

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 16:08:19

Peter

Thank you for your suggestion. When I try your example I get the following message

Server: Msg 8101, Level 16, State 1, Line 50
An explicit value for the identity column in table '@Stage' can only be specified when a column list is used and IDENTITY_INSERT is ON.

I have tried to use the SET IDENTITY INSERT @Stage ON but to no avail. Can you help me get past this error so that I can test your suggestion.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:10:04

<removed due to bug>

Use first suggestion.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2007-04-13 : 16:17:41

it helps to clarify and define the specs before writing the code, you know ... try running your code for this data:

INSERT @Sample
SELECT 1111, '01/01/2006', '01/31/2006' UNION ALL
SELECT 1111, '02/01/2006', '03/31/2006' UNION ALL
SELECT 1111, '04/01/2007', '12/31/9999' UNION ALL
SELECT 2222, '01/01/2007', '12/31/9999' UNION ALL
SELECT 3333, '01/01/2007', '03/31/2007' UNION ALL
SELECT 4444, '01/01/2006', '03/31/2006' UNION ALL
SELECT 4444, '06/01/2006', '09/30/2006' UNION ALL
SELECT 4444, '01/01/2007', '12/31/9999' UNION ALL
SELECT 5555, '01/01/2006', '05/30/2006' UNION ALL
SELECT 5555, '06/01/2006', '12/31/2006' UNION ALL
SELECT 5555, '01/01/2006', '12/31/9999'

for 111, the second row is completely ignored. Now, maybe that's what you want to do (who knows? the specs are incomplete) but I doubt it.

Just some general advice: Gather and state the specifications and requirments CLEARLY first, and THEN write the code.

- Jeff
http://weblogs.sqlteam.com/JeffS

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 16:18:46

Jeff,

To answer the questions that you presented.
if your data looks like this, what should be returned:

ID Start Date End Date
1111 01/01/2006 01/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

The output expected would be
ID Start Date End Date
1111 01/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

??

Also, can it ever be like this?

Currently I have no records that would fit your example of overlapping dates.
ID Start Date End Date
1111 01/01/2006 02/31/2006
1111 02/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

(note that the first two rows overlap)

What happens then? However, if this did happen in the data then the current process would provide the following output
ID Start Date End Date
1111 01/01/2006 03/31/2006
1111 04/01/2007 12/31/9999

Finally, what is the primary key of this data? Can there be two rows with same ID/StartDate? What happens then?
The primary key on this data is ID, Startdate and will not allow two rows to have the same ID/StartDate.

You are not getting good answers because your specification are not clearly explained. And it doesn't help to simply say "it isn't working" without explaining why. Do you drop your car off at the mechanic and simply say "it isn't working" and leave it at that and expect him to fix it?

The reason that I didn't provide more details when I said this wasn't working is because I provided those details on the other message board.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:21:22

<removed due to bug>

Use first suggestion.

awolfe, take this as a lesson.
Next time, provide sample for all possible combinations for sample/source data, so there will be little room for misunderstandings.

Peter Larsson
Helsingborg, Sweden

awolfe
Starting Member

8 Posts

Posted - 2007-04-13 : 16:25:49

To further clarify my original question. I am trying to change some existing code. The code was first built using a cursor which looped through all of the records in a table and combined records where possible. It was requested that the original code be changed to remove the cursor. This was because the input file has over 12 million records and the code was taking to long to run. So the code was then changed to use a while loop instead of a cursor. There were also some additional modifications that were made including the addition of indexes to some tables. We are now being requested to remove the while loop from the code if possible. This is why I am trying to determine if there is a set based solution or if a cursor or while loop is the only solution.

I do appreciate all of the suggestions I am receiving.

Thank you.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:29:54

Depending on the quality of your source data, a SET based solution might not be the best or fastest solution with whis problem.
Also if you change the @Stage table variable to #Stage temporary table and add a proper covering index over ID, DateStart and DateEnd, this might be the fastest and most complete solution.

Peter Larsson
Helsingborg, Sweden

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:36:16

<removed due to bug>

Use first suggestion.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 16:39:27

How long do the suggestion right above take on your 12 million records?
How long time did it take before?

Peter Larsson
Helsingborg, Sweden

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-13 : 18:19:52

This code takes 105 seconds for 1,000 sample records. And 791 seconds for 100,000 records.
Which means 100 times more records and 8 times more in time. Seems to scale well.

-- Stage the data
DECLARE	@Stage TABLE (RecID INT IDENTITY(1, 1) PRIMARY KEY, ID INT, theDate DATETIME, grp INT)

DECLARE	@Control TABLE (ID INT, StartDate DATETIME, EndDate DATETIME)

INSERT		@Control (ID, StartDate, EndDate)
SELECT		ID,
		MIN(theDate),
		MAX(theDate)
FROM		(
			SELECT ID, MIN(StartDate) AS theDate FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
			SELECT ID, MAX(StartDate) FROM @Sample WHERE StartDate < '99991231' GROUP BY ID UNION ALL
			SELECT ID, MIN(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID UNION ALL
			SELECT ID, MAX(EndDate) FROM @Sample WHERE EndDate < '99991231' GROUP BY ID
		) AS x
GROUP BY	ID

DECLARE	@ID INT,
	@MaxID INT,
	@StartDate DATETIME,
	@EndDate DATETIME

SELECT	@ID = MIN(ID),
	@MaxID = MAX(ID)
FROM	@Control

WHILE @ID <= @MaxID
	BEGIN

		SELECT	@StartDate = StartDate,
			@EndDate = EndDate
		FROM	@Control
		WHERE	ID = @ID

		INSERT		@Stage (ID, theDate, grp)
		SELECT		@ID,
				x.DATE,
				0
		FROM		@Sample AS s
		INNER JOIN	F_TABLE_DATE(@StartDate, @EndDate) AS x ON x.DATE BETWEEN s.StartDate AND s.EndDate
		WHERE		s.ID = @ID
		GROUP BY	x.DATE
		ORDER BY	x.DATE

		SELECT	@ID = MIN(ID)
		FROM	@Control
		WHERE	ID > @ID
	END

SELECT	@EndDate = NULL,
	@StartDate = NULL

UPDATE	@Stage
SET	@StartDate = theDate,
	@ID =	CASE
			WHEN @EndDate IS NULL THEN 1
			WHEN DATEDIFF(DAY, @EndDate, @StartDate) = 1 THEN @ID
			ELSE @ID + 1
		END,
	grp = @ID,
	@EndDate = @StartDate

INSERT		@Stage
SELECT		e.ID,
		k.EndDate,
		e.grp
FROM		(
			SELECT		ID,
					MAX(grp) AS grp
			FROM		@Stage
			GROUP BY	ID
		) AS e
INNER JOIN	(
			SELECT	ID,
				EndDate
			FROM	@Sample
			WHERE	EndDate = '99991231'
		) AS k ON k.ID = e.ID

-- Show the expected output
SELECT		ID,
		MIN(theDate) AS StartDate,
		MAX(theDate) AS EndDate
FROM		@Stage
GROUP BY	ID,
		grp
ORDER BY	ID,
		grp

Peter Larsson
Helsingborg, Sweden

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-14 : 10:56:04

This code should be fast!

My tests show that


  100,000 records take   6 seconds
1,000,000 records take  41 seconds

Here is the code

-- Prepare sample data
SET DATEFORMAT MDY

DECLARE	@Sample	TABLE
		(
			ID INT,
			StartDate DATETIME,
			EndDate DATETIME
		)

INSERT	@Sample
	(
		ID,
		StartDate,
		EndDate
	)
SELECT	1111, '01/01/2006', '01/31/2006' UNION ALL
SELECT	1111, '02/01/2006', '12/31/2006' UNION ALL
SELECT	1111, '04/01/2007', '12/31/9999' UNION ALL
SELECT	2222, '01/01/2007', '12/31/9999' UNION ALL
SELECT	3333, '01/01/2007', '03/31/2007' UNION ALL
SELECT	4444, '01/01/2006', '03/31/2006' UNION ALL
SELECT	4444, '06/01/2006', '09/30/2006' UNION ALL
SELECT	4444, '01/01/2007', '12/31/9999' UNION ALL
SELECT	5555, '01/01/2006', '05/30/2006' UNION ALL
SELECT	5555, '06/01/2006', '12/31/2006' UNION ALL
SELECT	5555, '01/01/2006', '12/31/9999'

-- Stage the data
DECLARE	curStage CURSOR	FORWARD_ONLY READ_ONLY FOR
			SELECT		ID,
					CASE
						WHEN StartDate <= EndDate THEN StartDate
						ELSE EndDate
					END,
					CASE
						WHEN StartDate <= EndDate THEN EndDate
						ELSE StartDate
					END
			FROM		@Sample
			ORDER BY	ID,
					2,
					3 DESC

DECLARE	@Stage	TABLE
		(
			ID INT,
			StartDate DATETIME,
			EndDate DATETIME
		)

DECLARE	@LastID INT,
	@CurrID INT,
	@LastStartDate DATETIME,
	@CurrStartDate DATETIME,
	@LastEndDate DATETIME,
	@CurrEndDate DATETIME

OPEN	curStage

FETCH	NEXT
FROM	curStage
INTO	@CurrID,
	@CurrStartDate,
	@CurrEndDate

SELECT	@LastID = @CurrID,
	@LastStartDate = @CurrStartDate,
	@LastEndDate = @CurrEndDate

WHILE @@FETCH_STATUS = 0
	BEGIN
		IF @LastID <> @CurrID OR @CurrStartDate > DATEADD(DAY, CASE WHEN @LastEndDate = '99991231' THEN 0 ELSE 1 END, @LastEndDate)
			BEGIN
				INSERT	@Stage
					(
						ID,
						StartDate,
						EndDate
					)
				VALUES	(
						@LastID,
						@LastStartDate,
						@LastEndDate
					)

				SELECT	@LastID = @CurrID,
					@LastStartDate = @CurrStartDate,
					@LastEndDate = @CurrEndDate
			END
		ELSE
			IF @CurrEndDate > @LastEndDate
				SET @LastEndDate = @CurrEndDate

		FETCH	NEXT
		FROM	curStage
		INTO	@CurrID,
			@CurrStartDate,
			@CurrEndDate
	END

CLOSE curStage
DEALLOCATE curStage

INSERT	@Stage
	(
		ID,
		StartDate,
		EndDate
	)
VALUES	(
		@LastID,
		@LastStartDate,
		@LastEndDate
	)

-- Show the expected output
SELECT		ID,
		StartDate,
		EndDate
FROM		@Stage
ORDER BY	ID,
		StartDate

Peter Larsson
Helsingborg, Sweden

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-15 : 07:24:42

With the method posted above, I got these result with a table of 12 million sample records


Test #	Index						Constraint			Time
------	----------------------------------------------	------------------------------	----
1	No index at all, records randimized		Allow EndDate < StartDate	5:39
2	Clustered (ID, StartDate)			Allow EndDate < StartDate	5:13
3	Clustered (ID, StartDate)			Always StartDate <= EndDate	5:27
4	Clustered (ID, StartDate, EndDate DESC)		Always StartDate <= EndDate	4:30

The result seems to take about 5 minutes, regardless of index and sample data set up, for 12 million records.

Peter Larsson
Helsingborg, Sweden

awolfe
Starting Member

8 Posts

Posted - 2007-04-16 : 11:22:58

Peter,

The solution that you provided on 04/14/2007 at 10:56:04 works. It is providing much better results than my current solution. Thank you for all of your help.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-16 : 11:27:43

Thanks!

What is the time difference? The code above takes what seconds to run, and your old code took what seconds to run?

Peter Larsson
Helsingborg, Sweden

awolfe
Starting Member

8 Posts

Posted - 2007-04-16 : 11:30:54

I am unable to provide exact numbers for you because of the development environment that we have. The best information that I can provide is that the old way would take at least 6 hours to run and your solution took 1 hour and 21 minutes. Sorry I cannot provide more detailed numbers for you.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-04-16 : 11:35:43

Well, it is at least 4 times better.

Peter Larsson
Helsingborg, Sweden

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources