| Author |
Topic  |
|
|
mciasco
Starting Member
5 Posts |
Posted - 03/10/2009 : 07:49:20
|
I have to load data from a txt file, so I use the BULK INSERT command. But the question is more general.
I have to create the table from scratch and then populate it. I think there are only 2 options: 1) create table - create indexes - load data 2) create table - load data - create indexes
Which is the best command sequence?
Is different if there is a primary key in the table? And if so, when it must be created for better performance of the loading operation?
|
|
|
sodeep
Flowing Fount of Yak Knowledge
USA
7173 Posts |
Posted - 03/10/2009 : 08:44:32
|
| Load the data and then implement index so that even fragmentation and statistics is taken care. |
 |
|
|
mciasco
Starting Member
5 Posts |
Posted - 03/10/2009 : 12:44:55
|
quote: Originally posted by sodeep
Load the data and then implement index so that even fragmentation and statistics is taken care.
Mmm, I'm thinking about the "exact" behaviour of the 3 options...
1) create table - create indexes - load data What happens when I create indexes in an empty table? I think that the DBMS just create the root node for each B-tree index. Right? So there are no pages allocated for data. Then, during the data loading, for each row the DBMS must allocates page memory space, store phisically the row and update all the indexes.
2) create table - load data - create indexes Here, loading data in an empty-not-indexed table, each row is stored sequentially, in the reading order. Again, for each row must be allocated some page space. But each insertion does not require indexes updates. But at the next step, when creating indexes on the populated table, each index is created reading the entire set of rows in sequential order, so the index must be ordered again, as a normal update.
Well, why a solution is better than the other? In both cases, there are no pages pre-allocated for data, and in both cases there's the cost of updating all the indexes for each row. |
 |
|
|
sodeep
Flowing Fount of Yak Knowledge
USA
7173 Posts |
Posted - 03/10/2009 : 13:07:00
|
| The question is how big data you are inserting? |
 |
|
|
mfemenel
Professor Frink
USA
1414 Posts |
Posted - 03/10/2009 : 13:19:13
|
Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.
Mike "oh, that monkey is going to pay" |
 |
|
|
mciasco
Starting Member
5 Posts |
Posted - 03/10/2009 : 13:22:42
|
quote: Originally posted by sodeep
The question is how big data you are inserting?
Actually I can't answer. I'm testing with a data file of 40 MB, but data file dimension could go from a few KB to hundreds of MB.
However, at this stage, I'm not really interested in finding a particular solution that works for me, but instead, I'm interested in the correct approach (even if it's just theory), cause I need to understand also the mechanisms behind the scene. |
 |
|
|
mfemenel
Professor Frink
USA
1414 Posts |
Posted - 03/10/2009 : 13:26:32
|
Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.
Mike "oh, that monkey is going to pay" |
 |
|
|
mciasco
Starting Member
5 Posts |
Posted - 03/10/2009 : 14:01:17
|
quote: Originally posted by mfemenel
Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.
Mmm, can you tell me why this happens? I'm not saying it's wrong, I really want to understand how it works! If I create indexes after the load, I think that DBMS needs to write data to the index too, so the storing time should be the same. I can't understand the difference of behaviour between pre-creation and post-creation of indexes.
quote: Originally posted by mfemenel
Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.
Tables are always created from scratch and populated reading rows from a non-empty file. The appending of single row during the normal table's life is not a problem (for the moment). |
 |
|
|
sodeep
Flowing Fount of Yak Knowledge
USA
7173 Posts |
Posted - 03/10/2009 : 14:24:36
|
quote: Originally posted by mciasco
quote: Originally posted by mfemenel
Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.
Mmm, can you tell me why this happens? I'm not saying it's wrong, I really want to understand how it works! If I create indexes after the load, I think that DBMS needs to write data to the index too, so the storing time should be the same. I can't understand the difference of behaviour between pre-creation and post-creation of indexes.
quote: Originally posted by mfemenel
Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.
Tables are always created from scratch and populated reading rows from a non-empty file. The appending of single row during the normal table's life is not a problem (for the moment).
Option 1-
You are dealing with double work with data pages and index page.This will be slow because data being inserted has to go in data pages as well map in index pages Which will lead to term called Fragmentation.It deals with External and internal fragmentation and when query on it, it has to do more work due to page splits/more page scans and not good statistics. you might again have to Rebuild or Reorganize index.
Option 2-
You are directly dealing with Data pages and it is faster as you are not taking overhead of index pages.when you apply indexes later on,you are mapping correctly to index pages as well as removing fragmentation,keeping statistics upto date.
|
Edited by - sodeep on 03/10/2009 14:25:09 |
 |
|
|
mciasco
Starting Member
5 Posts |
Posted - 03/11/2009 : 06:07:05
|
Ok I made some tests.
I tested 3 DBMS: MySQL, PostgreSQL and SQLServer with the same data bulk file. 1) Each DBMS is installed locally (no network delays) 2) The bulk file is 50MB and contains 120.000 rows 3) each opeation (create table, load bulk and create indexes) is done separately and require a new connection, so for each time value is included the connection time 4) the values showed are result of 3 runs of the same test, so they are average values.
Sequence: CREATE TABLE - LOAD BULK - CREATE INDEXES MYSQL - create table (s): 7 - load bulk (s): 26 - create indexes(s): 3667 - total (s): 3700
POSTGRES - create table (s): 5 - load bulk (s): 50 - create indexes(s): 83 - total (s): 509
SQLSERVER - create table (s): 12 - load bulk (s): 21 - create indexes(s): 24 - total (s): 426
---------------- Sequence: CREATE TABLE - CREATE INDEXES - LOAD BULK MYSQL - create table (s): 1 - load bulk (s): 0 - create indexes(s): 565 - total (s): 893
POSTGRES - create table (s): 1 - load bulk (s): 1 - create indexes(s): 499 - total (s): 883
SQLSERVER - create table (s): 11 - load bulk (s): 2 - create indexes(s): 100 - total (s): 462
So, it seems that creating indexes after the bulk load is better for Postgres and SQLServer (Postgres takes more advantages from this solution, while SQLServer performances are almost the same in both cases). But this approach in MySQL is totally wrong! The time needed for create indexes after the bulk load grows enormously! For MySQL is better to create indexes first, and then load data. And the difference between the two approaches is huge! This seems strange if compared to the behaviour of Postgres and SqlServer.
Overall, SqlServer is the fastest DBMS in both cases, while MySQL is the slowest one.
What do you think about? |
Edited by - mciasco on 03/11/2009 07:09:34 |
 |
|
|
darkdusky
Aged Yak Warrior
591 Posts |
Posted - 03/11/2009 : 07:52:42
|
The "correct" solution depends on the situation. For instance while the overall time might be the same for the two approaches using SQL Server (using this file size), there may be cases where you want the load to occur as fast as possible and are happy to build the indexes overnight where the overhead is less noticed. Also datatype of indexes will also affect time to build index. Also I ran a test of creating a large table with no indexes and then tried to create indexes. The create index failed due to memory error (I ran this on a my local PC - so it may succeed on better spec server). So there are limits to how large of a table indexes can be created afterwards.
CREATE TABLE [dbo].[testA]( [id] [int] IDENTITY(1,1) NOT NULL, [t] [varchar](2) NULL
) ON [PRIMARY]
INSERT testA DEFAULT VALUES GO 20000000
|
 |
|
| |
Topic  |
|