Author |
Topic |
duanecwilson
Constraint Violating Yak Guru
273 Posts |
Posted - 2010-11-18 : 10:43:40
|
I have some simple inserts that are an attempt to copy data from one table to the other, but the tables are huge, sometimes over 100 million rows with 10 or so average columns. They run as high as 10GB to 20GB. Ther is enough room in the db to do this, but not in the logs. I have been told I need to find a way to use a counter to limit the number of rows per transaction or setting a batch size. How would I do this? These are standard SQL Server tables on the same server in the same database. Here is a sample query I have been using:INSERT INTO dbo.Add_Remove_Programs_DATA( [MachineID] ,[InstanceKey] ,[RevisionID] ,[AgentID] ,[TimeKey] ,[ProdID00] ,[DisplayName00] ,[InstallDate00] ,[Publisher00] ,[Version00])SELECT [MachineID] ,[InstanceKey] ,[RevisionID] ,[AgentID] ,[TimeKey] ,[ProdID00] ,[DisplayName00] ,[InstallDate00] ,[Publisher00] ,[Version00]FROM [SAM].[staging].[Add_Remove_Programs_DATA] Also, we are set to Simple recovery. I have researched minimal logging, but I fear I have missed something. I am sure this operation is done all the time by big companies. I have been told to sift through the data to create filters and make 10 or 20 queries out of one. Do I really have to resort to that? I have also done some Select Into with minimal logging, but I don't want to have to drop and select into a new table and indexes every time. Thank you for any input.Duane |
|
tkizer
Almighty SQL Goddess
38200 Posts |
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2010-11-18 : 12:57:53
|
And for the batch size, I use a variable as my code is wrapped into stored procedures and that variable is an input parameter. By using an input variable, I can easily and quickly change the batch size without having to check my change into source control. And my stuff that calls these stored procedures are done through SQL Agent jobs. We use this batch approach for our nightly purges.Doing it in batches will slow your process down, which is actually a good thing in most cases. By slowing it down, you are reducing the impact on the server, thus allowing other queries to run while your process is running. And by doing it in batches, you are able to control the transaction log size.Also, there is no way to not log this. SIMPLE recovery model just means that it gets clear from the log after the transaction completes. But while the transaction is running, it is stored in the log. This is necessary for data consistency, in the case of rollback.Tara KizerMicrosoft MVP for Windows Server System - SQL Serverhttp://weblogs.sqlteam.com/tarad/Subscribe to my blog |
 |
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2010-11-18 : 13:06:05
|
Here's an example of what I've got deployed to production for one of our purge processes:CREATE proc [dbo].[Table1_Purge] (@RetainDays int = 200, @Batch int = 1000)ASSET NOCOUNT ONDECLARE @rowcnt int, @retaindate datetime, @maxRetries int, @retries tinyintSELECT @rowcnt = 1, @retries = 0, @maxRetries = 10SELECT @retaindate = DATEADD( dd, @RetainDays * -1, GETUTCDATE() )WHILE @rowcnt > 0 AND @retries <= @maxRetriesBEGIN BEGIN TRY BEGIN TRAN DELETE TOP(@Batch) FROM Table1 WHERE DateColumn < @retaindate SET @rowcnt = @@ROWCOUNT COMMIT TRAN END TRY BEGIN CATCH IF @@TRANCOUNT > 0 ROLLBACK TRAN IF ERROR_NUMBER() = 1205 AND @retries < @maxRetries -- 1205 is deadlock error BEGIN SET @retries = @retries + 1 WAITFOR DELAY '00:00:10' END ELSE -- some other error or done retrying BEGIN RETURN END END CATCHENDRETURN Note: I edited it down to remove all proprietary information, but hopefully it still compiles and shows you a "controlled delete". The same concept can be used for a "controlled insert", but you'll need to add a little bit more logic to keep track of where you are at.Tara KizerMicrosoft MVP for Windows Server System - SQL Serverhttp://weblogs.sqlteam.com/tarad/Subscribe to my blog |
 |
|
duanecwilson
Constraint Violating Yak Guru
273 Posts |
Posted - 2010-11-18 : 18:04:23
|
Thank you - I didn't realize try and catch was part of t-sql. I'll be making use of this. I still am not sure of how to keep track of where I am in a 200 million row table, but this is a good start. It gives me some things to think about.Duane |
 |
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2010-11-18 : 18:15:48
|
quote: I am sure this operation is done all the time by big companies
For the row counts you're describing, usually they'll use table partitioning. It's an Enterprise Edition feature, and therefore a bit more expensive. If you're not already on Enterprise and can't upgrade, you may be able to get reasonable performance from partitioned views, without the hassle of inserting rows and log growth. There are other hassles with either approach, but may be worth investigating. |
 |
|
X002548
Not Just a Number
15586 Posts |
|
tkizer
Almighty SQL Goddess
38200 Posts |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-19 : 06:07:07
|
To add to what Tara said:We put aWAITFOR DELAY '00:00:02'in the loop to give other processes time to run. (Tara has this in the DEADLOCK Catch, which is a nice idea ) For loops like this that have dire impact on other users we also dynamically adjust the Loop Size Variable according to how long the loop took - more than N seconds and we halve the loop size (if not below some minimum size, you don't want it auto-adjusting down to one row per loop!!), less than M seconds then increase loop size by 10%I agree with Brett that BCP [native format] out and back in is easiest, has the smallest impact on logging, and may actually be the fastest route. If you generate the BCP file in Clustered Index order [of the receiving machine] (which is probably your PK) then you can provide a hint on import that the file is pre-sorted, which will speed things up.Alternative is to use SSIS - which comes equipped to do such transfers in batchesNot sure what I would do for 100 million row transfer. We usually put the IDs / PK of the Source Rows into a temp table and then "page" that for the batches we transfer because repeated:SELECT ...FROM MyTableWHERE reasonably complex criteriais relatively slow to perform, as is adding:AND NOT EXISTS (SELECT * FROM TargetTable WHERE PK-matches)which (I think) will get slower and slower as more and more batches are run, and more rows exist on Target. |
 |
|
jeffw8713
Aged Yak Warrior
819 Posts |
Posted - 2010-11-19 : 13:57:31
|
Another option would be to use SSIS. You can define the commit batch size on the OLEDB destination which will batch the load for you.Jeff |
 |
|
duanecwilson
Constraint Violating Yak Guru
273 Posts |
Posted - 2010-11-19 : 17:37:02
|
Thank you all for these answers. I am going to. I have been distracted for a while, but should be able to get back to this soon.Duane |
 |
|
|