Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
TomNewYork
Starting Member
2 Posts |
Posted - 2010-09-04 : 12:10:05
|
| I need to normalize (if that's the right word) a very wide table with more than 100 million records, turning it into a pair of relational tables. For many variables in the table, there are 20 fields that could be filled in. For simplicity's sake, the table looks like this:JoinID varchar(10)Value1 varchar(8)Value2 varchar(8)Value3 varchar(8)(so on, up to Value20)The way the table is currently set up, if I want to find records where "Value" equals 'x', I need to search across 20 fields. I want to create a table that has one record for each of the 20 Value fields that are filled in, ignoring those that are null.The only way I know to do that is something like below, which can take 20 hours on my table. Is there a more efficient (faster) way to do this? Thanks in advance for any advice.TomDECLARE @sql varchar(Max)WHILE (@step < 21)USE SANDBOXDECLARE @table varchar(500)SET @table ='Receptacle'DECLARE @step intset @step = 1BEGINSET @sql = 'USE SANDBOXINSERT INTO ' + @table +'(JoinID, Step, Value)SELECTJoinID,' + CONVERT(varchar(2),@step) + ' as Step,Value' + CONVERT(varchar(2),@step) +' as HCFPCLFROM SANBOX.dbo.TABLEWHERE VALUE' + CONVERT(varchar(2),@step) +' is not null'EXEC(@SQL)SET @step+=1END |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-09-05 : 03:04:55
|
I don't see why you need dynamic SQL? Although I don't suppose it makes much difference, but as complexity increases it will make it harder to debug - given there are only 20 columns I would "unravel" it into a static SQL scriptDoing an insert of 100Million records, in one go, will a) take a while and b) use a huge amount of log space - although if many of the rows have NULL for a value then the number of inserts will be smaller / maybe very small.My approach for batch-processing large numbers of records is to get all the PK IDs into a temporary table, and then process the data in reasonably sized batches. (On a live system I will also monitor the time that each batch takes and adjust the batch size accordingly - so that if the system gets busy,. batches take longer, then batch size will reduce - and conversely grow when the system is not busy).However, given that you are only inserting the JoinID and Value (plus a "type" indicator [Step]) then that is much the same as the Temp table size anyway. I would still be inclined to do it in batches.If you are NOT using Simple Recovery Model then make TLog backups every so often - e.g. every 10 loopsSET NOCOUNT ONDECLARE @intRowCount int, @intLoop int, @intTotalRows int, @intRowsDone int, @dtStart datetimeSELECT @intRowCount = 1, -- Force first loop @intLoop = 0, -- Initialise @intRowsDone = 0, @dtStart = GetDate()SELECT @intTotalRows = COUNT(*)FROM dbo.SourceTable AS SWHERE S.Value1 IS NOT NULLWHILE @intRowCount >= 1BEGIN SELECT @intLoop = @intLoop + 1 PRINT 'Loop: ' + CONVERT(varchar(20), @intLoop) + ', Processed: ' + CONVERT(varchar(20), @intRowsDone) + ' of ' + CONVERT(varchar(20), @intTotalRows) + '. Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtStart, GetDate())) + ' seconds' INSERT INTO dbo.TargetTable(JoinID, Step, Value) SELECT TOP 100000 S.JoinID, 1, S.Value1 -- Adjust Batch size here FROM dbo.SourceTable AS S WHERE S.Value1 IS NOT NULL AND S.JoinID > COALESCE((SELECT MAX(D.JoinID) FROM dbo.TargetTable AS D WHERE D.Step = 1), '') ORDER BY S.JoinID SELECT @IntRowCount = @@ROWCOUNT, @intRowsDone = @intRowsDone + @@ROWCOUNT -- Make optional TLog backup every N loopsENDSET NOCOUNT OFF The downside with this is that after some rows have been added to [TargetTable] then finding more rows to add, in the next batch, which are not ALREADY created will become slower. Presumably you have a Unique Clustered Index on [TargetTable} (JoinID, Step) and an index on [SourceTable] (JoinID) - ideally the Clustered Index. On that basis I have added an ORDER BY on [SourceTable] (JoinID) so that rows will be added monotonically at the end of the clustered index, and a WHERE clause which provides the start JoinID; I think this will be faster than using NOT EXISTS.And alternative would be to BCP out the data, pre-sorted into clustered index order of [TargetTable] and then BCP the data back into the new table. You could drop the SourceTable inbetween - in order to replaim that space if you wished. |
 |
|
|
TomNewYork
Starting Member
2 Posts |
Posted - 2010-09-05 : 09:59:46
|
Thank you for the advice, Kristen. I'm going to try this out.quote: Originally posted by Kristen I don't see why you need dynamic SQL? Although I don't suppose it makes much difference, but as complexity increases it will make it harder to debug - given there are only 20 columns I would "unravel" it into a static SQL scriptDoing an insert of 100Million records, in one go, will a) take a while and b) use a huge amount of log space - although if many of the rows have NULL for a value then the number of inserts will be smaller / maybe very small.My approach for batch-processing large numbers of records is to get all the PK IDs into a temporary table, and then process the data in reasonably sized batches. (On a live system I will also monitor the time that each batch takes and adjust the batch size accordingly - so that if the system gets busy,. batches take longer, then batch size will reduce - and conversely grow when the system is not busy).However, given that you are only inserting the JoinID and Value (plus a "type" indicator [Step]) then that is much the same as the Temp table size anyway. I would still be inclined to do it in batches.If you are NOT using Simple Recovery Model then make TLog backups every so often - e.g. every 10 loopsSET NOCOUNT ONDECLARE @intRowCount int, @intLoop int, @intTotalRows int, @intRowsDone int, @dtStart datetimeSELECT @intRowCount = 1, -- Force first loop @intLoop = 0, -- Initialise @intRowsDone = 0, @dtStart = GetDate()SELECT @intTotalRows = COUNT(*)FROM dbo.SourceTable AS SWHERE S.Value1 IS NOT NULLWHILE @intRowCount >= 1BEGIN SELECT @intLoop = @intLoop + 1 PRINT 'Loop: ' + CONVERT(varchar(20), @intLoop) + ', Processed: ' + CONVERT(varchar(20), @intRowsDone) + ' of ' + CONVERT(varchar(20), @intTotalRows) + '. Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtStart, GetDate())) + ' seconds' INSERT INTO dbo.TargetTable(JoinID, Step, Value) SELECT TOP 100000 S.JoinID, 1, S.Value1 -- Adjust Batch size here FROM dbo.SourceTable AS S WHERE S.Value1 IS NOT NULL AND S.JoinID > COALESCE((SELECT MAX(D.JoinID) FROM dbo.TargetTable AS D WHERE D.Step = 1), '') ORDER BY S.JoinID SELECT @IntRowCount = @@ROWCOUNT, @intRowsDone = @intRowsDone + @@ROWCOUNT -- Make optional TLog backup every N loopsENDSET NOCOUNT OFF The downside with this is that after some rows have been added to [TargetTable] then finding more rows to add, in the next batch, which are not ALREADY created will become slower. Presumably you have a Unique Clustered Index on [TargetTable} (JoinID, Step) and an index on [SourceTable] (JoinID) - ideally the Clustered Index. On that basis I have added an ORDER BY on [SourceTable] (JoinID) so that rows will be added monotonically at the end of the clustered index, and a WHERE clause which provides the start JoinID; I think this will be faster than using NOT EXISTS.And alternative would be to BCP out the data, pre-sorted into clustered index order of [TargetTable] and then BCP the data back into the new table. You could drop the SourceTable inbetween - in order to replaim that space if you wished.
|
 |
|
|
|
|
|
|
|