Process per group/batch - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Process per group/batch

Author

Topic

dsegalles80
Starting Member

3 Posts

Posted - 2008-10-16 : 13:15:54

Hi Everyone,

I have sample code below Item#1 which will create the sample data for testing and Item #2 which is the code to get the interleaving rows and range of unique data.

Item #1: This will create a sample table and data


use tempdb;

SET NOCOUNT ON;
GO

IF OBJECT_ID ('#TempTable','U') IS NOT NULL 
BEGIN
	DROP TABLE #TempTable
END

CREATE TABLE #TempTable
(
	RowID int PRIMARY KEY CLUSTERED,
	RowKey varchar(1000),
	Locale varchar(10),
	Code char(1)
)
GO

INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A
SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
GO

Item #2: This will get the interleaving rows of duplicates resulting to a unique range result.


SELECT * FROM #TempTable


DECLARE @RecordCount int;
DECLARE @MaxRowID int; -- Stores the overall maximum row id at the beginning.

-- Initialize the MaxROwID to the max row id number + 1
-- So we can set this value as the last value in the range
SET @MaxRowID = (SELECT MAX(RowID) FROM #TempTable) +1; -- Store this before we delete rows

SET @RecordCount = 0;

SELECT @RecordCount = COUNT(*) FROM #TempTable

IF @RecordCount > 0 
BEGIN;
	DECLARE @Results TABLE(RowID int); -- Here we assemble the range numbers (result)
	DECLARE @RowID int; -- Stores the first rowid which is a duplicate
	DECLARE @OptRows int; -- variable which records the optimum number of rows to scan
	DECLARE @DeletedRowID int; -- Track the deleted rowids.

	SET @DeletedRowID = -1
	SET @OptRows = 100;

	WHILE 1=1 
	BEGIN;
		SET @RowID = (
			SELECT MIN(RowID) FROM (
				SELECT RowID,RowKey, 
				  row_number() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) AS DupRank 
				FROM   (SELECT TOP (@OptRows) *
						FROM #TempTable
						WHERE RowID > @DeletedRowID
						ORDER BY RowID ASC
				) SubsetRows
			) [Aggregate]
			WHERE DupRank > 1
		);
		
		RAISERROR('RowID: %d, DeletedRowID: %d, OptRows: %d', 10, 1, @RowID, @DeletedRowID, @OptRows);

		IF @RowID IS NULL BEGIN;
			IF @OptRows >= 2*@MaxRowID BEGIN; 
				INSERT INTO @Results SELECT @MaxRowID;
				BREAK;
			END; 
			ELSE
				BEGIN;
				SET @OptRows = @OptRows + @OptRows;
				CONTINUE;
			END; 
		END;

		ELSE BEGIN;
			INSERT INTO @Results SELECT @RowID;
			
			SET @DeletedRowID = @RowID - 1;

			SET @OptRows = @OptRows - 1;
		END; 
	END;
END;

SELECT * FROM @Results;
GO

My Objectives:
1.) Using Item #2, I need to process data for every batch of A (see Item #1 table data per batch), the problem with Item #2 is it will process the whole data without looking on every A batch.

2.) Aside from processing data for every batch of A, I need to make (Item #2) more faster on certain patterns of data for every batch of A. Since it is much slower when there are consecutive concentrated duplicates in the data like for example pattern (Item #3) below:

Item#3:


INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 4,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 11,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 18,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'

Low performance numbers will be seen if tested on 750,000 records or 1 Million records using the pattern above. It will actually take about 2 hours+ by actually processing only 200k using the pattern (Item#3) above but If there aren't that much consecutive duplicates like (Item #1), It would only take 28 minutes to process the 1 million data.

Thank you for your time and help.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-10-17 : 02:05:07

I have no idea what you are asking for!
Do you want the RowID for the last A in every batch of a?

This will get you the batches for million of records in just second or two

DECLARE	@Sample TABLE
	(
		rowID INT PRIMARY KEY CLUSTERED,
		rowKey VARCHAR(1),
		locale varchar(2),
		code char(1),
		grp INT
	)

INSERT	@Sample
	(
		rowID,
		rowKey,
		locale,
		code
	)
SELECT	 1, '1', 'en', 'C' UNION ALL
SELECT	 2, '1', 'de', 'C' UNION ALL
SELECT	 3, '5', 'zh', 'A' UNION ALL
SELECT	 4, '4', 'en', 'A' UNION ALL
SELECT	 5, '3', 'en', 'A' UNION ALL
SELECT	10, '2', 'zh', 'A' UNION ALL
SELECT	11, '2', 'zh', 'A' UNION ALL
SELECT	12, '1', 'zh', 'C' UNION ALL
SELECT	13, '1', 'de', 'C' UNION ALL
SELECT	14, '1', 'en', 'A' UNION ALL
SELECT	15, '3', 'en', 'A' UNION ALL
SELECT	16, '3', 'de', 'A' UNION ALL
SELECT	18, '1', 'en', 'A' UNION ALL
SELECT	19, '3', 'de', 'A' UNION ALL
SELECT	20, '1', 'en', 'C' UNION ALL
SELECT	21, '1', 'en', 'C'

DECLARE	@grp INT

SET	@grp = 0

UPDATE	@Sample
SET	@grp = grp = CASE WHEN code = 'A' THEN @grp + 1 ELSE @grp + 2 END,
	@grp = CASE WHEN code = 'A' THEN @grp - 1 ELSE @grp END

SELECT		MIN(rowID) AS minRowID,
		MAX(rowID) AS maxRowID
FROM		@Sample
WHERE		code = 'A'
GROUP BY	grp
ORDER BY	grp

SELECT		MIN(rowID) + 1 AS minRowID
FROM		@Sample
WHERE		code = 'A'
GROUP BY	grp
HAVING		COUNT(*) > 1
ORDER BY	grp

E 12°55'05.63"
N 56°04'39.26"

dsegalles80
Starting Member

3 Posts

Posted - 2008-10-17 : 08:36:03

Thank you for your prompt response Peso. Sorry I forgot to show the output for the following data sample and how to get the final output.

For this data sample, Item #1:

INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A
SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A 
SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A <- dup on RowID 10 (checked)
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)
SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'

Final Output on Item #1 should be:


SELECT 3 as RowID, 'A' as Code UNION ALL
SELECT 11 as RowID, 'A' as Code UNION ALL
SELECT 12 as RowID, 'C' as Code UNION ALL
SELECT 14 as RowID, 'A' as Code UNION ALL
SELECT 18 as RowID, 'A' as Code UNION ALL
SELECT 20 as RowID, 'C'

To explain on how did I come up with the final output on Item #1.
First, I need to get the interleaving rows per Code so it should give me the output of:


SELECT 3 as RowID, 'A' as Code UNION ALL
SELECT 12 as RowID, 'C' as Code UNION ALL
SELECT 14 as RowID, 'A' as Code UNION ALL
SELECT 20 as RowID, 'C'

Next is I need to process the interleaving rows of duplicates per batch of A's that would result on a unique range result. So for the first batch of A, it should give me the output.


SELECT 11 as RowID, 'A'

and the second batch of A will have an output:


SELECT 18 as RowID, 'A'

So that is how I came up with the final output. I put some comments on Item #1 hoping it would help understand the process.

Thank you for your time and help.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources