Insert into very large table - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Insert into very large table

Author

Topic

CharlesLS
Starting Member

6 Posts

Posted - 2010-03-17 : 21:24:27

I have a table with about 500 million rows (total size about 50 GB), and the table requires three indexes which total around 30 GB as well as an indexed view joining each entry with a few additional columns from other tables.

We need to run a process every few minutes that will add about another 25,000 rows to the table. It currently takes about 3 minutes to add 25,000 rows. However, each index that I delete on a test copy makes this much shorter.

I was hoping that the solution would be to partition the table - most of it can be archived, while the new rows are inserted into the "working" portion. I can't completely split the table, so there do need to be indexes on the entire thing. I hoped that by splitting the index, inserting into the working portion would be quick, but it takes about the same three minutes. As an alternative, I did the insert first into a temporary table and then tried to insert the entire table in one statement. This provided a slight improvement but it still takes an average of over two minutes.

How can I do this insert much more quickly? I'm sure I'm not the first person to have to insert a small number of rows into a large table that needs several indexes - what's the standard solution to this problem? Is there at least a query hint that would help by telling SQL Server that it only needs to update the small, working piece of the partitioned indexes?

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2010-03-17 : 22:36:46

Partition the table. Create the index(es) on the partition function.

May we see the current table and index DDL?

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-03-18 : 00:15:17

To me it sounds like you need to adjust the clustered index so that it matches the order on which the inserts are happening.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."

CSteinhardt
Starting Member

2 Posts

Posted - 2010-03-18 : 00:42:39

quote:
To me it sounds like you need to adjust the clustered index so that it matches the order on which the inserts are happening.

I don't think this is impossible. There's a set of operations with a unique identifier, and a bunch of different things can add operations. The clustered index is on the unique identifier, but things run in some time order instead of a unique identifier order. The time is not kept in the table.

Do you think the problem is the clustered index, though? Removing the nonclustered indexes also provides a substantial improvement when we test it, so it seems like all indexes are contributing.

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-03-18 : 00:48:11

Just how much improvement do you get when you remove just one non-clustered index?

How are the 25,000 rows being added? Is it through BULK INSERT, bcp, ...? Is it through INSERT statements? If it's through INSERT statements, is it in one large batch, meaning just one transaction?

Do you get any improvement if you remove the indexed view and just make it a regular view?

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2010-03-18 : 00:52:07

clustering on uniqueidentifier? why? almost always a bad idea. as Tara said, you're not inserting in clustered index order in this case.

Identity int is a better cluster key for inserts by far

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-03-18 : 01:00:19

Your clustered index is having to constantly shift pages around due to using a GUID. You should really reconsider this.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."

CharlesLS
Starting Member

6 Posts

Posted - 2010-03-18 : 01:16:18

Just to be sure I understand your advice:

I am adding an identity column and building a clustered index on the identity. The theory is that this means it's always adding to a known part of the index so the insert should be quick, right?

I'll try it and let you know whether it works - thanks for the advice!

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2010-03-18 : 01:21:16

right. identity fields are always lat value + incrementer (typically 1)

GUIDs (or uniqueidentifiers) are random

when you insert in the middle, sql server has to shuffle pages around on disk to make room. when you insert at the end, it doesn't have to

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 03:56:09

Doesn't sound like you are using a GUID (I think your term "unique identifier" was being interpreted imprecisely!!)

However, if it IS a GUID then you could perhaps use a "Sequential GUID" which would help the index inserts.

I think Tara's question about how the inserts are being made is key, understanding the process a bit more may well suggest different approaches.

CharlesLS
Starting Member

6 Posts

Posted - 2010-03-18 : 04:23:17

The first attempt to add an identity on the table failed because it took several hours and a few hundred GB of space, and eventually there wasn't enough room on the disk. Is there a way to add the identity without using up so much space? For example, can I avoid logging? Or, can I add the identity piece by piece somehow and only call it an identity at the end?

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 05:05:34

Hmmm ... interesting problem :)

My best guess is to:

export the table using BCP / Native format
truncate the table
Alter the table to add the IDENTITY column
Reimport the data

but I don't know if the reimport will get confused by the additional Identity column in the table

Presumably the Re-import could be minimally logged.

Splitting the data, for partitioning, and then adding the IDENTITY to each partition table would reduce the batch-size.

CharlesLS
Starting Member

6 Posts

Posted - 2010-03-18 : 13:07:40

I tried bcp into a text file, adding an identity column, then reimporting. It failed for formatting reasons, though I'm still reading some documentation.

What would happen if there were no clustered index on the table? Would that still make the inserts quick because it just puts them at the end, and therefore be a potential solution?

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 13:34:38

" It failed for formatting reasons, though I'm still reading some documentation"

Have a look at the native format (-N parameter I think)

"What would happen if there were no clustered index on the table? Would that still make the inserts quick because it just puts them at the end, and therefore be a potential solution?"

I think your problem would be when you put the clustered index back - it would have to physically reorganise the file into order, so probably going to do just as much damage to the TLog at that time

And don't drop the Clustered Index if you have Non-clustered indexes as well - drop all the non-clustered indexes first, and then re-create them after re-creating the clustered index

Kristen
Test

22859 Posts

Posted - 2010-03-18 : 13:36:38

Hmmm ... not something I know about, but maybe you could re-create the clustered index using a HINT to do the sorting etc. in TEMPDB. That might be light on TLog space, and maybe your TEMPDB has acres of space and is on a different spindle and thus offers itself to that approach?

CharlesLS
Starting Member

6 Posts

Posted - 2010-03-18 : 20:48:45

I am having this problem on multiple tables. While I was unable to find a way with existing disk space to put an identity on the biggest table, I could on the second-largest. It did not provide any improvement.

Here is the table structure:



CREATE TABLE [dbo].[transactions](
	[txnID] [int] IDENTITY(1,1) NOT NULL,
	[teamid] [int] NOT NULL,
	[txnType] [tinyint] NOT NULL,
	[flag1] [int] NULL,
	[flag2] [int] NULL,
	[amount] [int] NOT NULL,
	[oldBalance] [int] NOT NULL,
	[txnTime] [datetime] NOT NULL,
 CONSTRAINT [PK_transactions] PRIMARY KEY CLUSTERED 
(
	[txnID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

and indexes


CREATE NONCLUSTERED INDEX [ix_teamid] ON [dbo].[transactions] 
(
	[teamid] ASC,
	[txnTime] ASC,
	[txnType] ASC
)
INCLUDE ( [amount]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_txnTime] ON [dbo].[transactions] 
(
	[txnTime] ASC,
	[txnType] ASC
)
INCLUDE ( [teamid],
[amount]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = OFF) ON [PRIMARY]

Each of these indexes is used fairly heavily, so I believe they are all necessary. When inserting, we are using a C++ program connecting to an ODBC data source using the OTL library. This executes, in this case, about 2000 inserts run as inline SQL but all in one batch. For comparison, I tried doing the 2000 inserts into a temporary table in the same manner (it was nearly immediate) and then


INSERT INTO transactions WITH (ROWLOCK)
SELECT * FROM tmp_transactions

This provided about a factor of 2-3 improvement. Doing something similar provides a similar improvement for the larger table, but that improvement is still insufficient (it still means taking around a minute to add 25000 rows).


CREATE NONCLUSTERED INDEX [DISABLED_tid] ON [dbo].[transactions] 
(
	[teamid] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [DISABLED_ttime] ON [dbo].[transactions] 
(
	[txnTime] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = OFF) ON [PRIMARY]

While these two indexes were enabled (and everything else was identical), the inserts took about 30 seconds. I mention this mainly because these indexes actually *were* causing about 2/3 of the problem even though they are not clustered and there is a different clustered index on the identity. So my inexperienced intuition here is that taking our large table and successfully adding an identity and clustered index on it (something I'm still trying to do given our disk space limitations) is unlikely to be a solution.

Anyway, I'd like to work through and understand this problem, because it seems fairly basic and like the sort of thing that I can probably figure out with some help, but I should also ask if you have any recommendations for somebody who I could hire to take a look at our database on a very small budget. Given what I'm learning from this experience about our design, I suspect we'd get some major improvement from having an expert look at what we've done and teach us how to structure things better.

As for this specific problem, does this structure give you any ideas?

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-03-18 : 20:55:33

Remove the last two indexes as they are duplicates to the other two that you have. You should not have indexes that lead off with the same columns as another index. The index with more columns added to it can be used instead.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

CharlesLS
Starting Member

6 Posts

Posted - 2010-03-18 : 23:57:52

Yes, I did remove the two indexes that were duplicates. My point was that removing them reduced it from 30 seconds to 10, which means that the slow insert is being affected by indexes and not just the primary key (which in this case is indeed on an identity).

For the really large table, there aren't any such useless indexes -- there are three and we need all three.

erwin.zeez
Starting Member

1 Post

Posted - 2010-05-06 : 03:31:22

CharlesLS i have learned so much from this tread even i dont know about this way but when i read i tried it my self and yes great help

Erwin

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2010-05-07 : 12:55:53

SQL Server still has to maintain those non-clustered indexes on disk. Think of it as a hidden trigger that keeps them in sync with the data table (which is physically arranged by the clustered index). Therefore, each non-clustered index will negatively impact write speeds due to this maintenance (but hopefully it is offset by increased read speeds). Plus, keep in mind that all of this disk I/O happens on a single insert, so the more NC indexes you have the more I/O pressure you will have.

Also, keep in mind that each non-clustered index contains, by necessity, the clustered index key as part of it's physical storage requirements. Aside from the reasons mentioned above, changing the key from a 16-byte GUID to a 4-byte INT is desirable because each row of each non clustered index is storing 12 less bytes. This should also translate into increased write speed, which is surprisingly absent from your results.

Asken
Starting Member

38 Posts

Posted - 2010-05-11 : 07:56:58

I would just remove all indexes except a primary key on a identity field and work from there. I can't really believe that the data is to be be queried real time as the queries would, however we do index, run slow with 500M rows of data. If not when reading then when inserting/updating. Creating partitions is good however adding more full table indexes will case it to be slow as the index will most certainly be fully updated as it's not part of the partition. You can try to add a filtered index that is also contained in the partition to see if that helps performance. Also don't drop indexes, just disable them and re-enable all of them again with the rebuild clause. Makes it easy to script the indexes on and off without any table specific code...

This is how I'd do if using SQL Server:
1. What do you need from the data? Can you denormalize the data to satisfy your queries? Specially on old stuff... Archiving strategy? Anyone's even looking at the old stuff?
2. Create a work table and load the 25' lines into that. Dump the 25' rows into the 500M row table. The 500M row data table should from now only be used for changing the level of denormalization if requirements change on level of details of output or if the denormalization script has an error in it. In other words to create a new denormalized table.
3. Write a denormalization script taking care of the current 500M rows of data putting it into a less detailed table. Maybe stats per month instead of day (~data/30 in size)? Run the script to create the initial denormalized table.
4. Write a denormalization script for processing the 25' new rows and insert/update the denormalized table. This is to be run every time you get new rows. It should be fast to generate the rows and lesser number of rows should be faster to add to the denormalized table.

The much better option when dealing with this size of data is to use partitioned cubes in analysis services which would probably eliminate your problems all together _if_ real time is not vital....

Reporting & Analysis Specialist
Helping others helps me get better...

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources