Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Database Design and Application Architecture
 How to create staging table to handle incremental

Author  Topic 

jim123456789jim
Starting Member

13 Posts

Posted - 2014-01-02 : 04:18:57
Hi Team,

We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.

In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.

The query which is loading tbl_EmployeRecord is, SELECT EMPID,EMPNAME,DEPTNAME FROM tbl_Department D INNER JOIN tbl_Employee E ON D.DEPARTMENTID=E.DEPARTMENTID.

Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.

The columns of the tables are,

tbl_Department : DEPARTMENTID,DEPTNAME

tbl_Employee : EMPID,EMPNAME,DEPARTMENTID

tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME

Kindly suggest how to design the staging for this to handle Insert, Update and Delete.

Regards

Jim

Kristen
Test

22859 Posts

Posted - 2014-01-02 : 07:49:29
We build staging tables with a Date (when the row was last changed) and a flag to indicate if the row has been deleted in the source.

We UPDATE any rows in the staging table if ANY COLUMN (which we are interested in) has changed, and set the Date and Flag=1 (** see below)

We INSERT any rows that do not exist (based on Unique Primary Key fields), and set the Date and Flag=1

We UPDATE any rows that no longer exist in the source data (based on Unique Primary Key fields), and set the Date and Flag=0

Then a routine runs to update the actual tables. That processes anything with a Date more recent than the last time it ran. We store that date in a table in the target database - so if that gets restored so does the cutoff date, and then next scheduled task runs FROM that date.

** The UPDATE of existing rows compared every column using code similar to:

...
FROM MySourceDatabase.dbo.MySourceTable as S
JOIN MyDestinationDatabase.dbo.MyDestinationTable AS D
ON D.MyPKey1 = S.MyPKey1
...
WHERE (S.NumericColumn1 <> D.NumericColumn1
OR (S.NumericColumn1 IS NULL AND D.NumericColumn1 IS NOT NULL)
OR (S.NumericColumn1 IS NOT NULL AND D.NumericColumn1 IS NULL))
OR (S.StringColumn1 <> D.StringColumn1 COLLATE Latin1_General_BIN2
OR (S.StringColumn1 IS NULL AND D.StringColumn1 IS NOT NULL)
OR (S.StringColumn1 IS NOT NULL AND D.StringColumn1 IS NULL))
OR ...

this takes care of rows that have changed even in capitalisation, and where the column is NULL in one source and not in destination, and vice versa

If the Source system has a Changed Date then that speeds up the comparison considerably, as you only need to "process" rows marked as changed, on the source system, since the previous run.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2014-01-02 : 07:53:17
P.S. You may also need to also compare DATALENGTH() of Source and Destination string fields, particularly those with (n)VARCHAR datatype and if you store trailing spaces.
Go to Top of Page

jim123456789jim
Starting Member

13 Posts

Posted - 2014-01-02 : 10:30:33
Hi Kristen,

I think this will work for direct mapping from one to one table mapping between source and destination. Please suggest can we extend this approach like the one I mentioned where there is a join between two tables and that result set need to be mapped to destination.

Thanks..
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2014-01-02 : 12:32:56
I find it easier to diagnose problems with the process if there is a staging table in the middle. It is then easy to see the Date when the data last changed. Thus if the user says "We changed XXX on the source system yesterday, but it isn't on the destination system yet" then you can look at the XXX row in the staging table to see if it has a date more recent than "yesterday". If not then the PULL from source is bust. If yes then perhaps the UPDATE of the target table, from Staging table, is bust.

It also helps if there is manipulation of the Source data before it is applied to the Destination table. For example one table on Source updates two tables on Destination, of some source data is modified based on yet more data in other tables, and then applied to Destination tables.

The other issue is where Source and Target systems are on different servers. If you just run a JOIN between the Source and Target tables it will run like a dog, more so if the connection is Wide-Area rather than Local-Network. What you want to avoid is pulling ALL rows from Source Server to compare with Destination Table - and then finding that 99% of them haven't changed

In that situation you can create the STAGING table on the Source server. Populate it there. Have an identical staging table on Target server. Populate that by pulling only rows that are newer [i.e. newer than MAX(ChangeDate) on Target server's staging table]. That pulls the minimum amount of rows, and no cross-server JOINs required:

SELECT @MyMaxLocalChangeDate = MAX(ChangeDate)
FROM MyLocalStagingDatabase.dbo.MyStagingTable

-- Get newer records from remote. Use OPENQUERY if that is more efficient/easier than Linked Server
SELECT *
INTO #TempTable
FROM SourceServer.SourceStagingDatabase.dbo.MyStagingTable AS S
WHERE S.ChangedDate >= @MyMaxLocalChangeDate

-- Pre-delete any existing rows in the local staging table (which are also in the PULLED data)
DELETE D
FROM MyLocalStagingDatabase.dbo.MyStagingTable AS D
JOIN #TempTable AS S
ON S.MyPKey1 = D.MyPKey1
AND ...

-- Insert the PULLED data
INSERT INTO MyLocalStagingDatabase.dbo.MyStagingTable
SELECT *
FROM #TempTable

note that this is relying solely on the ChangeDate as set up on the Remote/Source server. Doesn't matter whether the clocks are synchronised or not (but DO watch out for change from Daylight Saving Time in the Autumn )

This method is also robust if the Source or Destination databases are restored (to a backup from an earlier date). Although Destination will show the newer version of records UNTIL they are next modified on source server. If that is important to you then you would need a means of re-dating records on Source Staging table that had newer dates on Target Staging table. They would then re-transfer (the earlier versions of the data)

This scales really well. IME Projects have a habit of "growing", so I prefer to start with Staging Tables, even for relatively trivial requirements, because it is then easy to scale them, and saves having to rewrite (and completely test & debug) them when the need for scaling arises later.
Go to Top of Page

jim123456789jim
Starting Member

13 Posts

Posted - 2014-01-06 : 01:28:20
Thanks a lot Kristen for the detailed explanation. I understood that a staging layer need to be build with date column. So in this senario(tbl_Department, tbl_Employee are the source tables and tbl_EmployeRecord is the target table), what I'm going to do is, in the staging layer, there will be two tables tbl_Department_staging, tbl_Employee_staging. This two table will have excatly same schema of the two source tables with two additional columns (Date and Flag). Date column will hold when the row in the source got inserted or updated or deleted and Flag will say whether it is an insert, update or delete.

Now, lets take case by case senario,

Case 1 : Insert in tbl_Department - This row will be inserted in tbl_Department_staging table. In the staging layer the query SELECT EMPID,EMPNAME,DEPTNAME FROM tbl_Department D INNER JOIN tbl_Employee E ON D.DEPARTMENTID=E.DEPARTMENTID will be executed and the result set will be inserted into the target tbl_EmployeRecord .

Case 2 : Insert into tbl_Employee - Same as case 1

Case 3 : Update in tbl_Department - A row will be inserted in tbl_Department_staging table with date and flag as update. SELECT EMPID,EMPNAME,DEPTNAME FROM tbl_Department D INNER JOIN tbl_Employee E ON D.DEPARTMENTID=E.DEPARTMENTID will be executed and the result set will be compared with the target tbl_EmployeRecord and update will happen.

Case 4 : Update in tbl_Employee - Same as Case 3

Case 5 : Delete in tbl_Department - A row will be inserted in tbl_Department_staging table with date and flag as delete. The key will be used to delete the rows in the target table tbl_EmployeRecord.

Case 6 : Delete in tbl_Employee - A row will be inserted in tbl_Emplyee_Staging table with date and flag as delete. This will be joined with source table tbl_Department to get the keys and will be an updated in the target table tbl_EmployeRecord(columns will be null).

Kindly let me know is the approach is fine. Also, is it scable when there are more tables involved in the join.

Regards
Jim
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2014-01-06 : 02:36:35
"Case 5 : Delete in tbl_Department - A row will be inserted updated in tbl_Department_staging table with date and flag as delete. The key will be used to delete the rows in the target table tbl_EmployeRecord."

Note that a Deleted row will ONLY be recorded if that record previously existed in the staging table. Thus it can only get there by an UPDATE, not by an insert.

If you freshen up your staging table once a day (say) and then during the day you add a new row to the Source system, and delete it again later the same day, then nothing will be recorded, for that key, in the staging table.

Similarly for your point (6)
Go to Top of Page
   

- Advertisement -