Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Administration (2000)
 RAID5 v. RAID1+0

Author  Topic 

Kristen
Test

22859 Posts

Posted - 2006-06-08 : 05:20:11
We had a failure a while back on a RAID5 system. A disk failed and we got "Torn Page Error" in the database.

I assume this was because the drive failed during a write cycle, and all the other drives only got a half-written sector.

Maybe its the type of RAID controller - it would seem to me to be sensible to re-try the write operation if that happens.

Anyway, is RAID1+0 guaranteed NOT to do this? such that a write failure on one drive will ensure that the other drive does indeed get properly written to? (I assume that RAID1+0 writes each drive one after another, so if the first one fails during the write it gets locked out and the write to the second drive then carries on regardless??)

Thanks

Kristen

nr
SQLTeam MVY

12543 Posts

Posted - 2006-06-08 : 05:37:26
Nope.
A torn page means that sql server could not complete the write to the disk subsystem.
I think a disk failure in a raid 5 configuration shouln't cause this. The raid set should carry on as though nothing has happened using the other disks and waiting for you to replace the failed disk.

Sounds like the disk subsystem went down while the server was trying to write - nothing that can be done about that.
Theer are some things about write caching and battery backups that make it less likely to happen but it's something that you should always allow for.


==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-06-08 : 05:55:47
"A torn page means that sql server could not complete the write to the disk subsystem."

But in RAID1+0 if the writes [to each drive] are Sequential won't they be independent?

If First Drive fails then either the write is not made at all (and presumably O/S reports error to SQL [dunno what happens then!]), or it is able to successfully go on to write to the second drive.

If Second Drive fails and the write is already completed to the First Drive then everything is fine, isn't it.

I suppose if the disk controller is going to give up then the rest of the stripe is at risk ... so what I want is a controller that won't give up! (it strikes me that something as simple as a retry would fix the problem - by then the bad drive will be locked out).

I guess I'm struggling to believe that we can't have a system that is immune to this type of problem (i.e. if there are still the minimum live drives left in the RAID)

Kristen
Go to Top of Page

MichaelP
Jedi Yak

2489 Posts

Posted - 2006-06-08 : 15:34:01
Kristen,
Do you have write caching enabled on the drive in windows? Sometimes, for a particular RAID setup you need to turn that off.
What RAID card are you using? Does it have battery backed-up cache? Are the card and the drives in your SQL server (DAS), or are they on a SAN?

Michael

<Yoda>Use the Search page you must. Find the answer you will. Cursors, path to the Dark Side they are. Avoid them, you must. Use Order By NewID() to get a random record you will.</Yoda>

Opinions expressed in this post are not necessarily those of TeleVox Software, inc. All information is provided "AS IS" with no warranties and confers no rights.
Go to Top of Page

nr
SQLTeam MVY

12543 Posts

Posted - 2006-06-08 : 16:26:29
>> But in RAID1+0 if the writes [to each drive] are Sequential won't they be independent?
Probably but you are assuming that the disk treats a block write request from sql server as a unit of work.
The write will probably be split up by the controller into smaller batches for efficiency. These batches will be independent but a failure will still mean that the block write has failed unless it was on the last batch unit.
Have a look at torn i/o in
http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-06-08 : 16:52:44
I hadn't considered the 8K page being written in 512K sectors, and failing part way through that lot.

I still think the RAID manufacturers should be able to create a guaranteed-write controller (assuming minimum number of drives still alive) ...

... but if I have backups on separate (or at least different) channels, as we had when our RAID5 went South, then we should be able to restore without loss.

We shut down all user access, restored the last Full and all TLog backups, and the DB was fine - so the data had made it OK to the LOG [separate channel] when the RAID5 disk went down ...

Would a fail-over system detect that there was a torn page, as it occurred [or on the first attempted read] and trigger a fail-over?

Kristen
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2006-06-08 : 17:10:47
quote:
Originally posted by Kristen


Would a fail-over system detect that there was a torn page, as it occurred [or on the first attempted read] and trigger a fail-over?



Do you mean a clustered system? If so, then no. The nodes in the cluster use the same disks anyway.

Tara Kizer
aka tduggan
Go to Top of Page

cmdr_skywalker
Posting Yak Master

159 Posts

Posted - 2006-06-08 : 17:14:16
RAID 5 will fail if you have more than one drive failure (or bad sectors) at the data block at same time. RAID 1+0 is also subject to hotspot problem. If you have a very large database and have budget, try the RAID 100 (RAID 10+0).

I think the fail-over is on a system/application level, not just on a batch transaction. However, you may configure the system to alert you :).


May the Almighty God bless us all!
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-06-08 : 17:17:37
"RAID 5 will fail if you have more than one drive failure"

Indeed, but I have laboured under the misapprehension that RAID5 would just "carry on" if a [single] drive failed. That's not the case as it turned out

I'll just have to invent some hardware to solve the problem!

Kristen
Go to Top of Page

MichaelP
Jedi Yak

2489 Posts

Posted - 2006-06-08 : 17:21:27
Personally, I've never seen a RAID 5 array that "died" when one drive died. I suspect that in this case a RAID driver or RAID card really caused your issue since it didn't handle the write properly.

Does your RAID card have the latest firmware?
Do all of your HDD's have the latest firmware and all of the drives are matching with matching firmware?
What sort of drives are we talking about here? IDE, SATA, SCSI, Fibre Channel?

The firmware thing is really really important, esp. on larger systems.

Michael

<Yoda>Use the Search page you must. Find the answer you will. Cursors, path to the Dark Side they are. Avoid them, you must. Use Order By NewID() to get a random record you will.</Yoda>

Opinions expressed in this post are not necessarily those of TeleVox Software, inc. All information is provided "AS IS" with no warranties and confers no rights.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-06-08 : 17:50:14
OK, thanks Michael I'll get that checked and report back.

Kristen
Go to Top of Page

nr
SQLTeam MVY

12543 Posts

Posted - 2006-06-13 : 12:43:50
One of the ideas of raid 5 is that it should tolerate the loss of a disk. You won't be getting the redundant copy written but the system should still carry on (might even be faster).
Then you just replace the disk and everything slows down while the data is being copied to the new disk.
Often though no one notices a disk has failed so the system carries on until another disk fails and you lose the data.
(Yes that disk always has a red light - we don't know why but it doesn't seem to matter).

Could be that your controller doesn't tolerate the loss of a disk or it failed in such a way as to bring the subsystem down. Sounds like more than a simple disk error though.

Anyway it sholdn't need to retry the write as it can carry on with the other disks and repair the failed one later.


==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-06-13 : 13:44:18
Yup, that's always been my understanding Nigel, but it clearly trashed the rest of the block on the other drives (i.e. we got a TORN PAGE error), so I have made the assumption that the drive failed during the write cycle, and that in turn prevented the write from being completed to the rest of the drives in the array. Very annoying!

Kristen
Go to Top of Page
   

- Advertisement -