Author |
Topic |
sqlserver_newbee
Starting Member
11 Posts |
Posted - 2007-07-19 : 23:39:36
|
Hi All,We get several data files(flat files) in excess of 3-4 GB each month having records above 5 million. I am working on this project of transferring these data files to SQL Server database.I have designed the database and established the constraints on tables etc. I have also created a DTS Package with Transform Data Task for reading these fixed length files and loading them into the tables. It all works fine until here. Now each month we get new files with new records + updates to the old data. Because of the primary key constraints I can't obviously do a simple insert/append into the tables. I assume that maybe I need to load the new data into temp tables and then use t-sql queries to check the records in the actual table. If there is a duplicate primary key then update that row of actual table else insert the new record from temp table. Because of the size of the files it takes about 3 hours to insert data from text file to one table using Transform Data Task. This additional check + table transfers may take much more time and I have 20 such tables.Can someone please suggest me of a suitable way of doing this operation? I am looking for a solution that can be time-efficient and simpler to implement as I am a beginner. Any help is appreciated. Thanks a lot! |
|
pootle_flump
1064 Posts |
Posted - 2007-07-20 : 03:56:05
|
Personally I don't like DTS - I find it unweildy and inflexible. I prefer to use T-SQL syntax to import data using BULK INSERT (this is just a wrapper for BCP - Bulk Copy Program). You can look both these up in BoL for detailed explanations. They are very fast methods of getting data into the database, expecially if you can define a clustered index that matches the order of records in the file. Worth an experiment with one file to see if it is quicker eh?I load into staging tables with no constraints (or maybe a PK if I trust the data source). The data is then cleansed and upserted (insert new records, update existing ones) into the database proper. It sounds, however, like you are working through these tables one record at a time, in a loop or cursor. Is this correct? If so you will get an immense performance gain by doing this in a set based manner. |
 |
|
sqlserver_newbee
Starting Member
11 Posts |
Posted - 2007-07-20 : 10:18:12
|
Can you tell me where can I find BoL, so that I can look it up. I am aware of the Bulk Insert T-SQL Query and I am going to try that out to check out the time efficency.Yes I am doing one record at a time. Can you tell me how to achieve this in a "set based manner"? If there is any article, please direct me to it as well.Thanks a lot! |
 |
|
rmiao
Master Smack Fu Yak Hacker
7266 Posts |
Posted - 2007-07-20 : 11:09:35
|
Bol is books online, installed on sql server and client machine by default. |
 |
|
pootle_flump
1064 Posts |
Posted - 2007-07-20 : 17:22:46
|
quote: Originally posted by sqlserver_newbee Yes I am doing one record at a time. Can you tell me how to achieve this in a "set based manner"? If there is any article, please direct me to it as well.
Ok - I presume you are not totally familiar with SQL and perhaps have a background in procedural programming? See how you get on with this tutorial. The short of it SQL deals in sets. As such, rather than performing the action n times to n rows you instead perform the action once to a set of rows. One of the MVPs here likens it to putting sugar in your tea either a grain at a time with tweezers or a lump at a time.See how you get on. If you already know all this please let me know and we'll move on from there.http://www.w3schools.com/sql/default.asp |
 |
|
sqlserver_newbee
Starting Member
11 Posts |
Posted - 2007-07-23 : 11:52:53
|
Hey Pootle_flump.. Thanks a lot for your help. I have figured out the solution to my problem and I am half way done putting my data in the tables :)I just performed 4 very simple queries which did my job -1. Put data in staging tables with just primary key constraints.2. Update Query - (Inner join got me all the common values and updated those in actual table)<b>Update Actual_TableSet ()From Actual_TableInner Join Staging_TableONActual_Table.pk1=Staging_Table.pk1</b>3. Delete results of inner join from staging table and just leaving the unique rows for insert.<b>Delete Staging_TableFrom Staging_TableInner Join Actual_TableON Actual_Table.pk1=Staging_Table.pk1</b>4. Insert all unique rows -<b>Insert INTO Actual_Table ()Select ()From Staging_Table</b>What are your views on this solution? Would you suggest me to look for a more optimal solution?Once again, thanks for the help! |
 |
|
pootle_flump
1064 Posts |
Posted - 2007-07-23 : 15:03:47
|
That's a good job Typically you would combine 3 & 4:INSERT INTO dbo.actual_table ()SELECT ()FROM dbo.staging_tableWHERE NOT EXISTS(SELECT NULL FROM dbo.actual_table WHERE dbo.actual_table.pk1 = dbo.staging_table.pk1) This just inserts records from the staging table that do not yet exist in the actual table. |
 |
|
X002548
Not Just a Number
15586 Posts |
|
sqlserver_newbee
Starting Member
11 Posts |
Posted - 2007-07-23 : 16:04:25
|
Yeah I could combine 3-4 Pootle_flump. That might reduce some considerable execution time for me given the size of data files in my case.Thanks for the blog link, x002548. It was excellent information on what I am trying to achieve. Thanks you very much guys! I will post back if i run into problems. |
 |
|
|