Author |
Topic |
dsegalles80
Starting Member
3 Posts |
Posted - 2008-10-16 : 13:15:54
|
Hi Everyone,I have sample code below Item#1 which will create the sample data for testing and Item #2 which is the code to get the interleaving rows and range of unique data.Item #1: This will create a sample table and datause tempdb;SET NOCOUNT ON;GOIF OBJECT_ID ('#TempTable','U') IS NOT NULL BEGIN DROP TABLE #TempTableENDCREATE TABLE #TempTable( RowID int PRIMARY KEY CLUSTERED, RowKey varchar(1000), Locale varchar(10), Code char(1))GOINSERT #TempTableSELECT 1,'1','en','C' UNION ALLSELECT 2,'1','de','C' UNION ALLSELECT 3,'5','zh','A' UNION ALL --/first(1) batch of ASELECT 4,'4','en','A' UNION ALL --/first(1) batch of ASELECT 5,'3','en','A' UNION ALL --/first(1) batch of ASELECT 10,'2','zh','A' UNION ALL --/first(1) batch of ASELECT 11,'2','zh','A' UNION ALL --/first(1) batch of ASELECT 12,'1','zh','C' UNION ALLSELECT 13,'1','de','C' UNION ALLSELECT 14,'1','en','A' UNION ALL --/second(2) batch of ASELECT 15,'3','en','A' UNION ALL --/second(2) batch of ASELECT 16,'3','de','A' UNION ALL --/second(2) batch of ASELECT 18,'1','en','A' UNION ALL --/second(2) batch of ASELECT 19,'3','de','A' UNION ALL --/second(2) batch of ASELECT 20,'1','en','C' UNION ALLSELECT 21,'1','en','C'GO Item #2: This will get the interleaving rows of duplicates resulting to a unique range result.SELECT * FROM #TempTableDECLARE @RecordCount int;DECLARE @MaxRowID int; -- Stores the overall maximum row id at the beginning.-- Initialize the MaxROwID to the max row id number + 1-- So we can set this value as the last value in the rangeSET @MaxRowID = (SELECT MAX(RowID) FROM #TempTable) +1; -- Store this before we delete rowsSET @RecordCount = 0;SELECT @RecordCount = COUNT(*) FROM #TempTableIF @RecordCount > 0 BEGIN; DECLARE @Results TABLE(RowID int); -- Here we assemble the range numbers (result) DECLARE @RowID int; -- Stores the first rowid which is a duplicate DECLARE @OptRows int; -- variable which records the optimum number of rows to scan DECLARE @DeletedRowID int; -- Track the deleted rowids. SET @DeletedRowID = -1 SET @OptRows = 100; WHILE 1=1 BEGIN; SET @RowID = ( SELECT MIN(RowID) FROM ( SELECT RowID,RowKey, row_number() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) AS DupRank FROM (SELECT TOP (@OptRows) * FROM #TempTable WHERE RowID > @DeletedRowID ORDER BY RowID ASC ) SubsetRows ) [Aggregate] WHERE DupRank > 1 ); RAISERROR('RowID: %d, DeletedRowID: %d, OptRows: %d', 10, 1, @RowID, @DeletedRowID, @OptRows); IF @RowID IS NULL BEGIN; IF @OptRows >= 2*@MaxRowID BEGIN; INSERT INTO @Results SELECT @MaxRowID; BREAK; END; ELSE BEGIN; SET @OptRows = @OptRows + @OptRows; CONTINUE; END; END; ELSE BEGIN; INSERT INTO @Results SELECT @RowID; SET @DeletedRowID = @RowID - 1; SET @OptRows = @OptRows - 1; END; END;END;SELECT * FROM @Results;GO My Objectives:1.) Using Item #2, I need to process data for every batch of A (see Item #1 table data per batch), the problem with Item #2 is it will process the whole data without looking on every A batch.2.) Aside from processing data for every batch of A, I need to make (Item #2) more faster on certain patterns of data for every batch of A. Since it is much slower when there are consecutive concentrated duplicates in the data like for example pattern (Item #3) below:Item#3:INSERT #TempTableSELECT 1,'1','en','C' UNION ALLSELECT 2,'1','de','C' UNION ALLSELECT 3,'1','en','A' UNION ALL --/first(1) batch of ASELECT 4,'1','en','A' UNION ALL --/first(1) batch of ASELECT 5,'1','en','A' UNION ALL --/first(1) batch of ASELECT 10,'1','en','A' UNION ALL --/first(1) batch of ASELECT 11,'1','en','A' UNION ALL --/first(1) batch of ASELECT 12,'1','zh','C' UNION ALLSELECT 13,'1','de','C' UNION ALLSELECT 14,'2','en','A' UNION ALL --/second(2) batch of ASELECT 15,'2','en','A' UNION ALL --/second(2) batch of ASELECT 16,'2','en','A' UNION ALL --/second(2) batch of ASELECT 18,'2','en','A' UNION ALL --/second(2) batch of ASELECT 19,'2','en','A' UNION ALL --/second(2) batch of ASELECT 20,'1','en','C' UNION ALLSELECT 21,'1','en','C' Low performance numbers will be seen if tested on 750,000 records or 1 Million records using the pattern above. It will actually take about 2 hours+ by actually processing only 200k using the pattern (Item#3) above but If there aren't that much consecutive duplicates like (Item #1), It would only take 28 minutes to process the 1 million data.Thank you for your time and help. |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-10-17 : 02:05:07
|
I have no idea what you are asking for!Do you want the RowID for the last A in every batch of a?This will get you the batches for million of records in just second or twoDECLARE @Sample TABLE ( rowID INT PRIMARY KEY CLUSTERED, rowKey VARCHAR(1), locale varchar(2), code char(1), grp INT )INSERT @Sample ( rowID, rowKey, locale, code )SELECT 1, '1', 'en', 'C' UNION ALLSELECT 2, '1', 'de', 'C' UNION ALLSELECT 3, '5', 'zh', 'A' UNION ALLSELECT 4, '4', 'en', 'A' UNION ALLSELECT 5, '3', 'en', 'A' UNION ALLSELECT 10, '2', 'zh', 'A' UNION ALLSELECT 11, '2', 'zh', 'A' UNION ALLSELECT 12, '1', 'zh', 'C' UNION ALLSELECT 13, '1', 'de', 'C' UNION ALLSELECT 14, '1', 'en', 'A' UNION ALLSELECT 15, '3', 'en', 'A' UNION ALLSELECT 16, '3', 'de', 'A' UNION ALLSELECT 18, '1', 'en', 'A' UNION ALLSELECT 19, '3', 'de', 'A' UNION ALLSELECT 20, '1', 'en', 'C' UNION ALLSELECT 21, '1', 'en', 'C'DECLARE @grp INTSET @grp = 0UPDATE @SampleSET @grp = grp = CASE WHEN code = 'A' THEN @grp + 1 ELSE @grp + 2 END, @grp = CASE WHEN code = 'A' THEN @grp - 1 ELSE @grp ENDSELECT MIN(rowID) AS minRowID, MAX(rowID) AS maxRowIDFROM @SampleWHERE code = 'A'GROUP BY grpORDER BY grpSELECT MIN(rowID) + 1 AS minRowIDFROM @SampleWHERE code = 'A'GROUP BY grpHAVING COUNT(*) > 1ORDER BY grp E 12°55'05.63"N 56°04'39.26" |
 |
|
dsegalles80
Starting Member
3 Posts |
Posted - 2008-10-17 : 08:36:03
|
Thank you for your prompt response Peso. Sorry I forgot to show the output for the following data sample and how to get the final output. For this data sample, Item #1:INSERT #TempTableSELECT 1,'1','en','C' UNION ALLSELECT 2,'1','de','C' UNION ALLSELECT 3,'5','zh','A' UNION ALL --/first(1) batch of ASELECT 4,'4','en','A' UNION ALL --/first(1) batch of A SELECT 5,'3','en','A' UNION ALL --/first(1) batch of ASELECT 10,'2','zh','A' UNION ALL --/first(1) batch of ASELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A <- dup on RowID 10 (checked)SELECT 12,'1','zh','C' UNION ALLSELECT 13,'1','de','C' UNION ALLSELECT 14,'1','en','A' UNION ALL --/second(2) batch of ASELECT 15,'3','en','A' UNION ALL --/second(2) batch of ASELECT 16,'3','de','A' UNION ALL --/second(2) batch of ASELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18SELECT 20,'1','en','C' UNION ALLSELECT 21,'1','en','C' Final Output on Item #1 should be:SELECT 3 as RowID, 'A' as Code UNION ALLSELECT 11 as RowID, 'A' as Code UNION ALLSELECT 12 as RowID, 'C' as Code UNION ALLSELECT 14 as RowID, 'A' as Code UNION ALLSELECT 18 as RowID, 'A' as Code UNION ALLSELECT 20 as RowID, 'C' To explain on how did I come up with the final output on Item #1. First, I need to get the interleaving rows per Code so it should give me the output of:SELECT 3 as RowID, 'A' as Code UNION ALLSELECT 12 as RowID, 'C' as Code UNION ALLSELECT 14 as RowID, 'A' as Code UNION ALLSELECT 20 as RowID, 'C' Next is I need to process the interleaving rows of duplicates per batch of A's that would result on a unique range result. So for the first batch of A, it should give me the output.SELECT 11 as RowID, 'A' and the second batch of A will have an output:SELECT 18 as RowID, 'A' So that is how I came up with the final output. I put some comments on Item #1 hoping it would help understand the process.Thank you for your time and help. |
 |
|
|
|
|