Disaster Procedure Checklist - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Administration (2000)

Disaster Procedure Checklist

Author

Topic

Kristen
Test

22859 Posts

Posted - 2006-05-08 : 13:51:53

A client's RAID had a disk fail last night, which should be a no-brainer, but the disk failure caused a corruption of a couple of databases - see http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=65707

Here's what I did, I'd appreciate any suggestions for what else I might have done, or what you would have done differently etc.

Basically there were errors in two databases (data page, rather than just index page).

I had a full backup from previous day, and intervening TLog backups.

There are two disk channels:
1) LDF Log Files
2) MDF Data files and backup files

Channel 2 had the failure. Thus I decided that the backup files MIGHT be damaged.

The steps I took were:

A1) Immediately disable all scheduled backup processes - these might truncate the log, and if the backup is to a duff disk then it might fail

A2) Take all web sites offline (we have a Holding Page procedure which disables all access to the site)

B) Prevent access to the databases (by scheduled jobs / persons unknown!) by changing the databases to DBO/READ-ONLY:

ALTER DATABASE 'MyDatabase' SET RESTRICTED_USER, READ_ONLY WITH ROLLBACK IMMEDIATE

I achieved this with:


SELECT 'PRINT ''' + name + '''' 
	+ CASE WHEN name IN ('MyAdminDB', 'MyAdminDB2') THEN ' -- *** DO THIS ONE AFTER LAST BACKUP MADE ***'  ELSE '' END
	+ CHAR(13) + CHAR(10) + 'GO' 
	+ CHAR(13) + CHAR(10) +  'ALTER DATABASE ' + name + ' SET RESTRICTED_USER, READ_ONLY WITH ROLLBACK IMMEDIATE' 
	+ CHAR(13) + CHAR(10) + 'GO'
FROM	master.dbo.sysdatabases
WHERE	name NOT IN ('master', 'msdb', 'model', 'pubs', 'northwind', 'tempdb')
ORDER BY CASE WHEN name IN ('MyAdminDB', 'MyAdminDB2') THEN 2 ELSE 1 END, name

I hand-checked the list of databases and ran the script. I excluded our Admin DB [MyAdminDB] as this is required by our backup process (to log backup filenames etc.)

C) I ran a final TLog backup (to catch the latest database transactions) ~~[See note below]~~ and then marked the MyAdminDB as DBO/Read-Only. Use WITH NO_TRUNCATE in case the target backup medium is flakey, as it will allow another backup to a different disk.

D) With everything "disabled" I then stopped the SQL Server services and copied all the MDF & LDF files. I copied them first to the "other disk channel" and then, again, to a Share [pre-created for just such a requirement] to another machine on the LAN which has SQL Server installed.

E) I also copied the most recent FULL backups and all subsequent TLog/DIFF backups to the Share.

F) I then restarted the SQL Service, but not SQL Agent.

G) I then did a CHECKDB of all DBs (similar code to above to make a script for all DBs) along the lines of:


DBCC CHECKDB ('MyDatabase') WITH ALL_ERRORMSGS, NO_INFOMSGS

H) Whilst that was running I checked the Event Log and SQL Log.

SQL Log started recording database corruptions during DIFF backups made at midnight. The Event log had a "RAID Disk failed" 5 minutes before the midnight DIFF backup was made.

I) On the Shared Machine on the LAN restored the FULL backup, and all subsequent TLog backups of the main (corrupted) database.

This takes a while of course because the database did not pre-exist. Perhaps better to pre-restore a backup on that server periodically so that the files were "close to right size" (a failover/log-shipping server would be better of course, I was thinking for a recovery process one-step removed from that)

J) Restored in STANDBY mode [See note below]

K) Ran a CHECKDB:

DBCC CHECKDB ('MyDatabase') WITH ALL_ERRORMSGS, NO_INFOMSGS

which had no errors - so it appears that the TLog backups were immune to the corruption which occurred on the MDF/Data file channel.

As the CHECKDB was OK:

L) Activate database and Reinstated Database Users (sp_change_users_login etc.)


RESTORE DATABASE MyDatabase WITH RECOVERY

M) Because the final database TLog backup was from a DBO/READ_ONLY database reversed that state to put the DB back to normal:

ALTER DATABASE MyDatabase SET READ_WRITE, MULTI_USER

N) Reconfigured the web site to point to the new database server/database.

Then I set about dealing with the original server. The guy with the spare disk had arrived at the data centre by this time, and the drive was busy synchronising with the rest of the RAID.

So what is damaged?

We took the view that the errors were probably "soft"

O) we did a surface scan of the disks

P) At this point I'm worried about the corruption of the database MDF file. If I just restore will some of the file not be written to, and that might be a problem in the future? Perhaps I should DROP the database, and restore to a brand-new-database so that the new database is fully pre-initialised.

Q) We are going to take the opportunity to defrag the whole machine whilst the database is "on the spare machine" and then backup/restore the database back onto the Main server in a day or two.

That's as far as I have got!

Questions:

C] Should I have used "WITH NO_TRUNCATE" as part of that final TLog backup? (BoL suggests this in "Using Standby Servers", but I just used my normal backup routine which would have truncated the log)

J] I only restored the LAST Tlog backup [which was made with the DB in SINGLE USER mode and no pending transactions] in STANDBY mode - it didn't seem to complain, and allowed me to then check the database. Was that a bad move? Clearly I could just restore the whole lot and then do WITH RECOVERY, and not use STANDBY at all. I wasn't thinking, but what WAS in my mind was to restore each TLog in turn, WITH STANDBY, and do a CHECKDB before restoring the next TLog, to locate where any corruption of TLogs existed - I would presumably then have to start-over the restore operation up to the one-before-that TLog backup?.

Z] Should the TLog backups be made to the MDF disk channel, and the FULL/DIFF backups to the TLog channel - to avoid propagating errors on one or other? (This is assuming that there is NO disk channel JUST for backups)

Thanks

Kristen

mcrowley
Aged Yak Warrior

771 Posts

Posted - 2006-05-09 : 16:09:52

Whew! Quite the story there. I have personally seen DBCC errors caused by currently bad disks. In my case I decided to leave the affected system alone until the disks were fixed up, as the DBCC checks reported corruptions in different areas each time they were run. In the end, there was only about 8 corrupted entires of a text column, and I got back all but 1.

quote:

Z] Should the TLog backups be made to the MDF disk channel, and the FULL/DIFF backups to the TLog channel - to avoid propagating errors on one or other? (This is assuming that there is NO disk channel JUST for backups)

I would do this as a general practicew, just to cut down on the disk activity. You are reading from one channel, and writing to the other.

quote:

J] I only restored the LAST Tlog backup [which was made with the DB in SINGLE USER mode and no pending transactions] in STANDBY mode - it didn't seem to complain, and allowed me to then check the database. Was that a bad move? Clearly I could just restore the whole lot and then do WITH RECOVERY, and not use STANDBY at all. I wasn't thinking, but what WAS in my mind was to restore each TLog in turn, WITH STANDBY, and do a CHECKDB before restoring the next TLog, to locate where any corruption of TLogs existed - I would presumably then have to start-over the restore operation up to the one-before-that TLog backup?.

If you are unsure where/when the corruption happened, it would be best to do periodical DBCCs to make absolutely sure. I have seen (oddly enough on the same system as above) a transaction log backup happily pass over a bit of corruption with no error.

quote:

C] Should I have used "WITH NO_TRUNCATE" as part of that final TLog backup? (BoL suggests this in "Using Standby Servers", but I just used my normal backup routine which would have truncated the log)

Probably yes. You would be writing the backup out to a possibly bad/finicky drive, and second chances are a positive boon in this business.

Overall, though, it looks like you did an excellent job of it. A very thourough writeup, too. I may have to copy this and use it as a starting point for procedures here.

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2006-05-09 : 17:45:39

Have you verified that the problem was actually the disk, and not the disk array controller?

CODO ERGO SUM

Kristen
Test

22859 Posts

Posted - 2006-05-10 : 01:23:00

"You would be writing the backup out to a possibly bad/finicky drive"

Ah, good point. "Second Chance" - Yup, I like that option!

"Have you verified that the problem was actually the disk, and not the disk array controller?"

I think we have ... but I will get confirmation from the hardware boys that they are sure, thanks for that.

Kristen

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources