Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 SQL Server Administration (2005)
 40 millions records, archiving and field addition

Author  Topic 

Andre412
Starting Member

11 Posts

Posted - 2009-05-05 : 10:32:15
Hi

Im new to this forum, this is my first post. I wanted to run soemthing past you experts before i go ahead with what i have planned

We have a few tables in our DB that are filling with too much data. we are sitting at 48 million records in one of them as i write this.
The tables contain data pertainnig to who is browsing our website, it is the basis of our pay per click engine that identifies valid "human traffic" as opposed to bots, crawlers etc

My plan is to archive the data reflected by non human traffic. The database is fully normalised with good referential integrity. I can flag the data i need to archive by running each record in the base table against another table that identifies if the record refelctas a human or a bot. Once flagged i plan to itterate throught the records copying the information across to the "archive db" and removing the record from the production db table which in turn will cascade and remove references from other tables.

My question is, is this a good approach to controlling bloating of data that will never be used, archive then delete. Or is there a prefered industry standard approach.
I would also like to know the impact on a server if i need to add a "bit" field to a table containing 40M records.

Hope thats enough detail for you all

looking forward to your responses

pootle_flump

1064 Posts

Posted - 2009-05-06 : 04:30:53
This practice if broadly fine. The only two things I would say are:
Why "iterate through the records"? Why not perform this in a set based operation?
Why flag the rows and then act upon them? Why bother with the bit column? Why not perform the actions based on joins to this other table, rather than writing bit column values (less processes, less space required, no normalisation problems).

Also, what version of SQL Server?
Go to Top of Page

Andre412
Starting Member

11 Posts

Posted - 2009-05-06 : 05:25:03
Hi P

Thanks for responding.
I was going to use the bit field as a flag. The reason for this is due to high level of table transactions. This table is constantly being inserted to. Any lock on the table must be for a minimum amount of time. The best way i could think of was to create an insert trigger that cross references the "browser details" against our list ~5000 records and set the bit. I was then going to periodically run the "archive" SP as a job to move the records across to the archive db and delete from production.
The iteration would take place through a cursor, since running a cursor has minimal impact on table locking where as running a large copy and delete would lock the table until the transaction was complete.
We are running SQL 2005 in our dev and production environments.

Unless you can confirm that a large "batch" operation using joins on nvarchar(max) fields on a table with 40million records is faster than a cursor on a fully indexed bit field iterating one row at a time?
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-05-06 : 06:27:00
Why delete the records at all during busy times?

If you have a bit field which marks the rows you don;t care about can't you index that field then use that in your queries to cut down on execution times.

Then do a job that you could schedule during your "quiet" times (if you have any) to remove the unneeded rows.

If that's just not possible how about using a paged delete instead of row by bloody row?

do a loop and deleted x rows at a time rather than 1 at a time.

with some experimentation you will find a "sweet spot" for the max number of rows to be removed


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

Andre412
Starting Member

11 Posts

Posted - 2009-05-06 : 06:51:28
Hi Charlie

Unfortunately there is no quiet period, this is a global www and when europe goes quiet the US wakes up.
So deleting in a quiet time although ideal is not an option in this case.
Paged delete sounds like a good idea, i will take a look now.

Just to update you all. A small amount of investigation revealed that any browser in the world performing a request will not present an "http user agent" id length of more than 450, well 99.9% of the time doesnt exceed 300, however an nvarchar of 450 can be indexed, hence chosing this value of 450 is the preferential.
So i can now operate on a fully indexed set of fields improving performance even further in the trigger.

Thanks for your responses so far, been very helpful
Go to Top of Page

Andre412
Starting Member

11 Posts

Posted - 2009-05-11 : 06:17:46
Hi Charlie

I was taken off of this issue last week in favour of a more pressing problem, I've just returned to this again and cannot seem to find much in the way of information regarding paged deletes.

Can you explain briefly how i can go about paged deleting

thanks in advance

me http://drayblog.gotdns.com
company http://www.lowcarboneconomy.com
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-05-11 : 07:11:05
oh -- I just meant that instead of performing a huge set based delete in one go (for thousands of records and up) then you could so a loop and just do something like this

-- Setup the tables

IF OBJECT_ID('tempdb..#seed') IS NOT NULL DROP TABLE #seed
IF OBJECT_ID('tempdb..#del') IS NOT NULL DROP TABLE #del


CREATE TABLE #seed ([a] INT)
INSERT #seed ([a])
SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
UNION SELECT 10

CREATE TABLE #del (
[a] INT IDENTITY(1,1) PRIMARY KEY
, [b] VARCHAR(50)
)

INSERT #del ([b])
SELECT CAST(CAST(NEWID() AS VARBINARY(16)) AS VARCHAR(16))
FROM
#seed a
CROSS JOIN #seed b
CROSS JOIN #seed c
CROSS JOIN #seed d
CROSS JOIN #seed e
CROSS JOIN #seed f
GO


DECLARE @num INT
DECLARE @rows INT
DECLARE @error INT

SET @rows = 1
SET @num = 500

WHILE @rows > 0 BEGIN

BEGIN TRANSACTION

DELETE TOP (@num) FROM #del WHERE [a] > 500000
SELECT @error = @@ERROR, @rows = @@ROWCOUNT

IF @error = 0 BEGIN
COMMIT TRANSACTION
END
ELSE BEGIN
ROLLBACK TRANSACTION
RETURN -- @error
END
END


Here there is a table with 1 million records and we want to remove 500 000 of them. If you do them all at once it takes a long time. If you do it one at a time it takes a long time. However if you have a transaction where you remove a page at once (set to 500) in this example it doesn't





Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

Andre412
Starting Member

11 Posts

Posted - 2009-05-13 : 20:37:35
Hi Charlie

thanks for your response, though for some reason i didnt see it until now.
I went about it a slightly different way based on this post
[url]http://www.sqlusa.com/bestpractices2005/hugeupdate/[/url]

It was such a performance increase i was shocked. updated 17 million records in 35 minutes with minimal impact on the traffic.

I was so impressed i wrote a quick blog about it here [url]http://drayblog.gotdns.com/index.php/2009/05/14/sql-update-updating-millions-of-records-nicely/[/url]
Im sure there is likely further optimising but it was very late and i needed to finish.

Your code snippet has given me some more ideas with regards automation, thanks again.

me http://drayblog.gotdns.com
company http://www.lowcarboneconomy.com
Go to Top of Page
   

- Advertisement -