Incremental processing - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Incremental processing

Author

Topic

Dev Null
Starting Member

39 Posts

Posted - 2008-12-04 : 17:40:08

Hi folks,

I've got a problem where I want to process say, 10,000 rows at a time, and then delete the ones I've processed.

Basically, I've got a table full of data, and that data gets copied to another table with a trigger on it for the processing (done that way for historical reasons I won't go into.) The processing is all per-row (i.e. I can copy the rows across one-at-a-time and get the same result as copying them all at once - the data in different rows does not combine.) If I _do_ process them one at a time it takes much longer than doing it all at once - presumably the overhead for the various inserts, transactions, etc. - but if I process them all at once then on larger datasets it can take a very long time to run, I get timeouts, no user feedback, etc. So I'd like to copy the first 10,000 rows across in a block, commit the changes, delete them from the original table, and then move on to the next block.

I've investigated using TOP (and ordering on the clustered primary key, to be certain that I delete the same rows that I've processed) but the only way I could work it out I have to do a self-join to find the matching rows for the delete (simplified example):

insert into table2
  select top 1 * from table1 order by pk_1, pk_2, pk_3

delete from table1
  from ( select top 1 * from table1 order by pk_1, pk_2, pk_3 ) as del
  where table1.pk_1 = del.pk_1
    and table1.pk_2 = del.pk_2
    and table1.pk_3 = del.pk_3

Basically, I've got a table full of data, and that data gets copied to another table with a trigger on it for the processing (done that way for historical reasons I won't go into.) The processing is all per-row (i.e. I can copy the rows across one-at-a-time and get the same result as copying them all at once - the data in different rows does not combine.) If I _do_ process them one at a time it takes much longer than doing it all at once - presumably the overhead for the various inserts, transactions, etc. - but if I process them all at once then on larger datasets it can take a very long time to run, I get timeouts, no user feedback, etc. So I'd like to copy the first 10,000 rows across in a block, commit the changes, delete them from the original table, and then move on to the next block.

I've investigated using TOP (and ordering on the clustered primary key, to be certain that I delete the same rows that I've processed) but the only way I could work it out I have to do a self-join to find the matching rows for the delete (simplified example):

insert into table2
  select top 1 * from table1 order by pk_1, pk_2, pk_3

delete from table1
  from ( select top 1 * from table1 order by pk_1, pk_2, pk_3 ) as del
  where table1.pk_1 = del.pk_1
    and table1.pk_2 = del.pk_2
    and table1.pk_3 = del.pk_3

Which seems to work, but is terribly inefficient (takes almost a minute to delete a single row from the table this way, even though just selecting the top row is pretty much instantaneous.) So I wondered if you guys would know of a better way to do the delete (or a whole different better approach to the basic problem.)

Any advice appreciated,

- rob.

Dev Null
Starting Member

39 Posts

Posted - 2008-12-04 : 17:44:28

Ack. Pardon the stutter. I was editing in another window as I tried a few things out, and apparently I fumbled the cut-paste...

darkdusky
Aged Yak Warrior

591 Posts

Posted - 2008-12-05 : 10:17:45

Doing the processing row by row is not the way you should be doing it.
You need to think in "set based" terms. Can you elaborate on what the processing is and we will see if this can be done to the set rather than looping through each row.

http://www.simple-talk.com/sql/t-sql-programming/rbar--row-by-agonizing-row/

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-12-05 : 10:31:23

You can use the new OUTPUT operator.
See http://weblogs.sqlteam.com/peterl/archive/2007/10/03/New-OUTPUT-operator.aspx

DELETE     src
OUTPUT     deleted.i
INTO       Target
FROM       Source AS src
WHERE      src.i = 2

E 12°55'05.63"
N 56°04'39.26"

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2008-12-05 : 11:47:30

Hi Peso,

Keep seeing these little things you post and think "**** me, why didn't I know about this?"

Anyway,

If you do a


DELETE a
OUTPUT deleted.b
INTO #foo
FROM #bar a

does the INTO part do an INSERT as a set or as a loop? (I'm right about SELECT INTO doing things 1 at a time aren't I?)

I await enlightenment.

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-12-05 : 11:57:15

It's a set.

E 12°55'05.63"
N 56°04'39.26"

Dev Null
Starting Member

39 Posts

Posted - 2008-12-05 : 17:06:18

Peso: Thanks! Thats exactly what I needed, and I'd never heard of it before. Unfortunately I also have to be compatable with SQL Server 2000... which means I posted this query in the wrong forum; sorry about that. But I learned something new and useful, so I'm not complaining.

Darkdusky: Sorry, I wasn't being very clear. I don't do the processing row-by-row, its just that the processing comes up with the same answer for a given set of rows, regardless of how I break those rows up. Where I'm grouping the data by, say, time, the processing has already been set up to correctly recalculate the values if I later get a new row that refers to a time already calculated. Is there a term for that? Grouping independent? So storing a sum or a maximum over time is safe, because you can always add to the sum or check the max of a new addition, but storing the average is not because you can't recalculate the new average from the old and a new value. Basically, I was just trying to reassure you folks that changing from doing my processing all at once to doing it in sets of 10,000 or so wouldn't change my answers and invalidate my processing.

Thanks for the offer of help though!

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources