Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Normalizing Large Database

Author  Topic 

TomNewYork
Starting Member

2 Posts

Posted - 2010-09-04 : 12:10:05
I need to normalize (if that's the right word) a very wide table with more than 100 million records, turning it into a pair of relational tables. For many variables in the table, there are 20 fields that could be filled in. For simplicity's sake, the table looks like this:

JoinID varchar(10)
Value1 varchar(8)
Value2 varchar(8)
Value3 varchar(8)
(so on, up to Value20)

The way the table is currently set up, if I want to find records where "Value" equals 'x', I need to search across 20 fields. I want to create a table that has one record for each of the 20 Value fields that are filled in, ignoring those that are null.

The only way I know to do that is something like below, which can take 20 hours on my table. Is there a more efficient (faster) way to do this? Thanks in advance for any advice.

Tom

DECLARE @sql varchar(Max)

WHILE (@step < 21)


USE SANDBOX


DECLARE @table varchar(500)

SET @table ='Receptacle'

DECLARE @step int

set @step = 1

BEGIN

SET @sql = '

USE SANDBOX

INSERT INTO ' + @table +'

(

JoinID, Step, Value

)

SELECT

JoinID,

' + CONVERT(varchar(2),@step) + ' as Step,

Value' + CONVERT(varchar(2),@step) +' as HCFPCL

FROM SANBOX.dbo.TABLE

WHERE VALUE' + CONVERT(varchar(2),@step) +' is not null

'

EXEC(@SQL)

SET @step+=1

END

Kristen
Test

22859 Posts

Posted - 2010-09-05 : 03:04:55
I don't see why you need dynamic SQL? Although I don't suppose it makes much difference, but as complexity increases it will make it harder to debug - given there are only 20 columns I would "unravel" it into a static SQL script

Doing an insert of 100Million records, in one go, will a) take a while and b) use a huge amount of log space - although if many of the rows have NULL for a value then the number of inserts will be smaller / maybe very small.

My approach for batch-processing large numbers of records is to get all the PK IDs into a temporary table, and then process the data in reasonably sized batches. (On a live system I will also monitor the time that each batch takes and adjust the batch size accordingly - so that if the system gets busy,. batches take longer, then batch size will reduce - and conversely grow when the system is not busy).

However, given that you are only inserting the JoinID and Value (plus a "type" indicator [Step]) then that is much the same as the Temp table size anyway. I would still be inclined to do it in batches.

If you are NOT using Simple Recovery Model then make TLog backups every so often - e.g. every 10 loops

SET NOCOUNT ON
DECLARE @intRowCount int,
@intLoop int,
@intTotalRows int,
@intRowsDone int,
@dtStart datetime

SELECT @intRowCount = 1, -- Force first loop
@intLoop = 0, -- Initialise
@intRowsDone = 0,
@dtStart = GetDate()

SELECT @intTotalRows = COUNT(*)
FROM dbo.SourceTable AS S
WHERE S.Value1 IS NOT NULL

WHILE @intRowCount >= 1
BEGIN
SELECT @intLoop = @intLoop + 1
PRINT 'Loop: ' + CONVERT(varchar(20), @intLoop)
+ ', Processed: ' + CONVERT(varchar(20), @intRowsDone) + ' of ' + CONVERT(varchar(20), @intTotalRows)
+ '. Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtStart, GetDate())) + ' seconds'

INSERT INTO dbo.TargetTable(JoinID, Step, Value)
SELECT TOP 100000 S.JoinID, 1, S.Value1 -- Adjust Batch size here
FROM dbo.SourceTable AS S
WHERE S.Value1 IS NOT NULL
AND S.JoinID > COALESCE((SELECT MAX(D.JoinID) FROM dbo.TargetTable AS D WHERE D.Step = 1), '')
ORDER BY S.JoinID
SELECT @IntRowCount = @@ROWCOUNT, @intRowsDone = @intRowsDone + @@ROWCOUNT

-- Make optional TLog backup every N loops
END
SET NOCOUNT OFF

The downside with this is that after some rows have been added to [TargetTable] then finding more rows to add, in the next batch, which are not ALREADY created will become slower. Presumably you have a Unique Clustered Index on [TargetTable} (JoinID, Step) and an index on [SourceTable] (JoinID) - ideally the Clustered Index. On that basis I have added an ORDER BY on [SourceTable] (JoinID) so that rows will be added monotonically at the end of the clustered index, and a WHERE clause which provides the start JoinID; I think this will be faster than using NOT EXISTS.

And alternative would be to BCP out the data, pre-sorted into clustered index order of [TargetTable] and then BCP the data back into the new table. You could drop the SourceTable inbetween - in order to replaim that space if you wished.
Go to Top of Page

TomNewYork
Starting Member

2 Posts

Posted - 2010-09-05 : 09:59:46
Thank you for the advice, Kristen. I'm going to try this out.


quote:
Originally posted by Kristen

I don't see why you need dynamic SQL? Although I don't suppose it makes much difference, but as complexity increases it will make it harder to debug - given there are only 20 columns I would "unravel" it into a static SQL script

Doing an insert of 100Million records, in one go, will a) take a while and b) use a huge amount of log space - although if many of the rows have NULL for a value then the number of inserts will be smaller / maybe very small.

My approach for batch-processing large numbers of records is to get all the PK IDs into a temporary table, and then process the data in reasonably sized batches. (On a live system I will also monitor the time that each batch takes and adjust the batch size accordingly - so that if the system gets busy,. batches take longer, then batch size will reduce - and conversely grow when the system is not busy).

However, given that you are only inserting the JoinID and Value (plus a "type" indicator [Step]) then that is much the same as the Temp table size anyway. I would still be inclined to do it in batches.

If you are NOT using Simple Recovery Model then make TLog backups every so often - e.g. every 10 loops

SET NOCOUNT ON
DECLARE @intRowCount int,
@intLoop int,
@intTotalRows int,
@intRowsDone int,
@dtStart datetime

SELECT @intRowCount = 1, -- Force first loop
@intLoop = 0, -- Initialise
@intRowsDone = 0,
@dtStart = GetDate()

SELECT @intTotalRows = COUNT(*)
FROM dbo.SourceTable AS S
WHERE S.Value1 IS NOT NULL

WHILE @intRowCount >= 1
BEGIN
SELECT @intLoop = @intLoop + 1
PRINT 'Loop: ' + CONVERT(varchar(20), @intLoop)
+ ', Processed: ' + CONVERT(varchar(20), @intRowsDone) + ' of ' + CONVERT(varchar(20), @intTotalRows)
+ '. Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtStart, GetDate())) + ' seconds'

INSERT INTO dbo.TargetTable(JoinID, Step, Value)
SELECT TOP 100000 S.JoinID, 1, S.Value1 -- Adjust Batch size here
FROM dbo.SourceTable AS S
WHERE S.Value1 IS NOT NULL
AND S.JoinID > COALESCE((SELECT MAX(D.JoinID) FROM dbo.TargetTable AS D WHERE D.Step = 1), '')
ORDER BY S.JoinID
SELECT @IntRowCount = @@ROWCOUNT, @intRowsDone = @intRowsDone + @@ROWCOUNT

-- Make optional TLog backup every N loops
END
SET NOCOUNT OFF

The downside with this is that after some rows have been added to [TargetTable] then finding more rows to add, in the next batch, which are not ALREADY created will become slower. Presumably you have a Unique Clustered Index on [TargetTable} (JoinID, Step) and an index on [SourceTable] (JoinID) - ideally the Clustered Index. On that basis I have added an ORDER BY on [SourceTable] (JoinID) so that rows will be added monotonically at the end of the clustered index, and a WHERE clause which provides the start JoinID; I think this will be faster than using NOT EXISTS.

And alternative would be to BCP out the data, pre-sorted into clustered index order of [TargetTable] and then BCP the data back into the new table. You could drop the SourceTable inbetween - in order to replaim that space if you wished.

Go to Top of Page
   

- Advertisement -