Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Data Corruption Issues
 DBCC CHECKDB

Author  Topic 

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-05-17 : 04:42:04
We had a bit of an accident last night.

My nightly maintenance task runs fine normally. Last night it didn't.
When it got to DBCC CHECKDB for an audit database (it gets populated from triggers in other databases; one of the tables has over a million records), the live database server shut down and did not come back up. DBCC CHECKDB completed in 6 seconds with no errors. The next task in the sequence that would have run (but the log has no message of it completing or failing) was DBCC INDEXDEFRAG on that audit table with over a million records.

Even the OFF button on the 64-bit Dell server (SQL2005, Windows Server Enterprise x64 2003 SP1, 16 gig ram) did not work. There wasn't even the blue screen of death. It was just a blue screen. A plug out and plug back in was required.

The only recent configuration change has been that we have had Dell install a piece of disk-level replication software called Double Take.

Could CheckDB have contributed to this problem?

Kristen
Test

22859 Posts

Posted - 2007-05-17 : 09:03:58
"Could CheckDB have contributed to this problem?"

Well, it probably would have read every used sector in the file ... that would have added some "stress" to anything already flaky.

But I would have expected something to "break" during CHECKDB, not afterwards.

INDEXDEFRAG would cause some Stress too ...

Kristen
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2007-05-17 : 10:25:32
No - it wasn't caused by CHECKDB - nothing that SQL Server (or any other software AFAIK) can stop a power switch working. You've got problems with your hardware - I would call Dell first.

Paul Randal
Principal Lead Program Manager, Microsoft SQL Server Core Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
http://blogs.msdn.com/sqlserverstorageengine/default.aspx
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-05-17 : 10:36:14
quote:
Originally posted by paulrandal

No - it wasn't caused by CHECKDB - nothing that SQL Server (or any other software AFAIK) can stop a power switch working. You've got problems with your hardware - I would call Dell first.

Paul Randal
Principal Lead Program Manager, Microsoft SQL Server Core Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
http://blogs.msdn.com/sqlserverstorageengine/default.aspx



Thanks Paul. That makes sense.
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-05-23 : 05:58:15
Dell have thusfar said that there are no problems. The problem is escalating. It happened last night again - at exactly the same point and time! Namely, strait after the DBCC CHECKDB for the audit database. I changed the DB on which DBCC INDEXDEFRAG ran in the next step so I know that's not the problem.

One of the things I don't understand is why it crashed the server once last week and then worked fine for a few nights and then crashed the server again last night. Why is it not crashing the server every night?

It's worth mentioning also that two weekends ago I applied SS SP2 and the problem has only occured these two times after I did that.

Needless to say, I and management are getting really concerned. Here are the sequence of events in the log (read from bottom up):


05/23/2007 04:48:36,spid2s,Unknown,SQL Server has encountered 745 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\data\MyDB_Data.MDF] in database [Backoffice] (19). The OS file handle is 0x00000000000007DC. The offset of the latest long I/O is: 0x000000b6ec0000
05/23/2007 04:40:08,Server,Unknown,Resource Monitor (0xd34) Worker 0x000000008000C1C0 appears to be non-yielding on Node 0. Memory freed: 40488 KB. Approx CPU Used: kernel 546 ms<c/> user 62 ms<c/> Interval: 60000.
05/23/2007 04:25:03,spid23s,Unknown,This instance of SQL Server has been using a process ID of 2484 since 17/05/2007 08:00:33 (local) 17/05/2007 07:00:33 (UTC). This is an informational message only; no user action is required.
05/22/2007 19:36:56,Server,Unknown,Resource Monitor (0xd34) Worker 0x000000008000C1C0 appears to be non-yielding on Node 0. Memory freed: 81744 KB. Approx CPU Used: kernel 1250 ms<c/> user 265 ms<c/> Interval: 132204.
05/22/2007 19:35:56,Server,Unknown,Resource Monitor (0xd34) Worker 0x000000008000C1C0 appears to be non-yielding on Node 0. Memory freed: 52016 KB. Approx CPU Used: kernel 500 ms<c/> user 171 ms<c/> Interval: 60000.
05/22/2007 19:35:56,Server,Unknown,External dump process return code 0x20000001.<nl/>External dump process returned no errors.
05/22/2007 19:35:46,Server,Unknown,Stack Signature for the dump is 0x0000000000000344
05/22/2007 19:35:45,Server,Unknown,* Short Stack Dump
05/22/2007 19:35:45,Server,Unknown,* -------------------------------------------------------------------------------
05/22/2007 19:35:45,Server,Unknown,* *******************************************************************************
05/22/2007 19:35:45,Server,Unknown,*
05/22/2007 19:35:45,Server,Unknown,* Non-yielding Resource Monitor
05/22/2007 19:35:45,Server,Unknown,*
05/22/2007 19:35:45,Server,Unknown,* 05/22/07 19:35:45 spid 0
05/22/2007 19:35:45,Server,Unknown,* BEGIN STACK DUMP:
05/22/2007 19:35:45,Server,Unknown,*
05/22/2007 19:35:45,Server,Unknown,* *******************************************************************************
05/22/2007 19:35:45,Server,Unknown,***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0026.txt
05/22/2007 19:35:45,Server,Unknown,**Dump thread - spid = 0<c/> PSS = 0x0000000000000000<c/> EC = 0x0000000000000000
05/22/2007 19:35:44,Server,Unknown,Using 'dbghelp.dll' version '4.0.5'
05/22/2007 19:33:57,spid74,Unknown,DBCC CHECKDB (MyAuditDB) executed by MyDomain\Administrator found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 7 seconds.

So it looks like the server restarted itself and came up 9 hours later. It wasn't very happy when it did come back and a manual reboot was performed. Everything fine after that. Here is the log file mentioned:

=====================================================================
BugCheck Dump
=====================================================================

This file is generated by Microsoft SQL Server
version 9.00.3054.00
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.


Computer type is AT/AT COMPATIBLE.
Current time is 19:35:45 05/22/07.
16 Unknown CPU 9., 2660 Mhz processor (s).
Windows NT 5.2 Build 3790 CSD Service Pack 1.

Memory
MemoryLoad = 98%
Total Physical = 16379 MB
Available Physical = 326 MB
Total Page File = 18770 MB
Available Page File = 1966 MB
Total Virtual = 8388607 MB
Available Virtual = 8371221 MB
**Dump thread - spid = 0, PSS = 0x0000000000000000, EC = 0x0000000000000000
***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0026.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 05/22/07 19:35:45 spid 0
*
* Non-yielding Resource Monitor
*
* *******************************************************************************
* -------------------------------------------------------------------------------
* Short Stack Dump


I accept that a DBCC may not cause the Hardware to lock up totally so that the power button doesn't work. But I think that was just a ripple effect of this problem. I say that because this time our KVM switch let us go to all servers other than this one - so we have to Terminal Services in at the moment. I think when something so unexpected happens, you can have any number of strange side-effects down the line. But it doesn't change the fact that the problem started at the same point. That is where our problem lies. I don't think that can be argued now. The question is, exactly what is the problem? (I would open a ticket at MS but the last time I did that for a problem I had, I didn't have much joy.)
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2007-05-23 : 07:00:53
Restore a backup from just before the crash to another machine, with same version of SQL Server but "different" hardware, repeat the scheduled tasks steps, does that crash?

Possibly restore the database to the SAME machine, but a Temporary database, repeat the steps on that temporary database, can you crash that one? (This may not be very desirable, if its a production machine!, but may be a necessary-evil too ...)

Kristen
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-05-23 : 07:51:39
quote:
Originally posted by Kristen

Restore a backup from just before the crash to another machine, with same version of SQL Server but "different" hardware, repeat the scheduled tasks steps, does that crash?

Possibly restore the database to the SAME machine, but a Temporary database, repeat the steps on that temporary database, can you crash that one? (This may not be very desirable, if its a production machine!, but may be a necessary-evil too ...)

Kristen



Because the problem is not occuring every night, there is no guarantee that I will be able to re-produce the problem. So while I deeply appreciate your suggestion, I don't feel it will give the certain answer I am after.
For now I have switched off CHECKDB against that Audit DB. Hopefully that buys me some time...
I'm kinda hoping something there in those log files tells Paul what he needs to know.
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2007-05-23 : 10:36:33
I'm wondering whether you're experiencing something to do with the OS paging out SQL's working set - see http://support.microsoft.com/kb/918483 for details.

I recommend you contact SQL Product Support to assist you with this - it's not something we can debug over this forum.

Thanks

Paul Randal
Principal Lead Program Manager, Microsoft SQL Server Core Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
http://blogs.msdn.com/sqlserverstorageengine/default.aspx
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-05-31 : 12:19:29
Before I contact MS (which I presume will cost), here's one more example (as it happend last night again - on a different DB). There's a bit more detail this time:

=====================================================================
BugCheck Dump
=====================================================================

This file is generated by Microsoft SQL Server
version 9.00.3054.00
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.


Computer type is AT/AT COMPATIBLE.
Current time is 02:43:47 05/31/07.
16 Unknown CPU 9., 2660 Mhz processor (s).
Windows NT 5.2 Build 3790 CSD Service Pack 1.

Memory
MemoryLoad = 98%
Total Physical = 16379 MB
Available Physical = 192 MB
Total Page File = 18396 MB
Available Page File = 2493 MB
Total Virtual = 8388607 MB
Available Virtual = 8371386 MB
**Dump thread - spid = 53, PSS = 0x000000024580FC20, EC = 0x000000031BC661C0
***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0030.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 05/31/07 02:43:47 spid 53
*
* Latch timeout
*
* Input Buffer 86 bytes -
* DBCC CHECKDB use myDB2
*
* *******************************************************************************
* -------------------------------------------------------------------------------
* Short Stack Dump


Something like this seems to lead up to it quite often:

2007-05-28 02:11:00.30 spid2s SQL Server has encountered 6 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2). The OS file handle is 0x0000000000000A6C. The offset of the latest long I/O is: 0x00000010760000
2007-05-28 02:11:00.30 spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000A64. The offset of the latest long I/O is: 0x00000001098000
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2007-06-03 : 13:42:57
ok - this is an IO subsystem problem - either its overloaded while CHECKDB is running or there's something wrong with it. It's not a SQL Server problem. I'd download and run the SQLIOSim tool (see http://blogs.msdn.com/sqlserverstorageengine/archive/2006/10/06/SQLIOSim-available-for-download.aspx) to stress the IO subsystem and see what it throws up.

Paul Randal
Principal Lead Program Manager, Microsoft SQL Server Core Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
http://blogs.msdn.com/sqlserverstorageengine/default.aspx
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-06-04 : 07:32:08
Thanks Paul, I'll try that.
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-06-07 : 10:10:39
Thanks Paul, I tried that on my test server before doing it on the production one tomorrow. But I ended up having errors on my test server too!

Where does the data that you see in the Test Results get stored? Is there a log file somewhere? sqliosim.log.xml talks about e:\yukon\sosbranch\... But it doesn't actually exist.

The errors I got on my test system were:
4 x "Unable to get disk cache info"

&
3 x "(x) IO requests are outstanding for more than 15 sec"

&
2 x "(x) long IO requests are outstanding for file E:\sqliosim.mdx Handle 0x318. Offset of latest long IO (y), duration (z) ms"

In case I get this on the production server too (which I think I will), what does it mean?


----------------
Also, with 4 hours to go now before I do the strain test on the production server, we had it misbehave just 2 hours ago during lunch. It encountered 189 occurrences of I/O requests taking longer than 15 seconds on the log file of our main production database. Quite a few people got timeout error messages as a result. There was no DBCC command running at the time.
I'm pretty sure the strain test tonight will cause the same thing.
So how does a problem like this get solved? Why are these I/O requests taking too long?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2007-06-08 : 09:56:05
Could it be a database expansion that ran? I imagine that would add to the normal I/O load, but under SQL2k5 I believe it is much less than was under SQL2k.

(10% expansion o0n a multi-GB database is a pretty big single extension. I favour fixed size extensions for multi-GB databases ...)

Just thinking out loud ...

Kristen
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-06-11 : 03:50:46
So I ran it on the live system. Here are the results:
I got the "Unable to get disk cache info" messages on the production system too - right after it tried to expand the ldx files.

I also got the "(x) IO requests are outstanding for more than 15 sec" & the "(x) long IO requests are outstanding for file E:\sqliosim.mdx Handle 0x318. Offset of latest long IO (y), duration (z) ms" messages strait after it performed the Checkpointing on the initial Update Scan.

I'm kinda thinking this is normal behaviour? Which would mean I'm no closer to finding the source of my problem - other than that I have discounted another possible source to the problem. Can we still say that the I/O subsystem has problems?
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2007-06-11 : 17:40:22
Sounds to me like you have a bottleneck somewhere. I'll ask one of the team to take a look at this thread.

Paul Randal
Principal Lead Program Manager, Microsoft SQL Server Core Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
http://blogs.msdn.com/sqlserverstorageengine/default.aspx
Go to Top of Page

Jhalmans
Microsoft SQL Server Product Team

12 Posts

Posted - 2007-06-11 : 19:41:39
SQLIOSim will generate sufficient IO requests to overwhelm almost any disk subsystem. The long IO message from the simulator are normal. Although this does tell you that at some point the disks won't keep up.

Long IO messages from the server are much more concerning. There are 2 reasons to get this message from the server, neither is "normal"
1. The disk subsystem simply cannot keep up with the load.
2. There is a problem somewhere in the subsystem causing delays in dealing with IOs.

See http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/21/642314.aspx for more information.



This posting is provided "AS IS" with no warranties, and confers no rights
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-06-12 : 09:39:50
Thank you Mr Halmans. That has further confirmed my understanding of the problem.

Just ran the PhysicalDisk\Avg Disk sec./read, write and transfer counters for 5 minutes and this is what it brought back (I presume I should probably run this for a 24-hour period):

read: 0.008avg ; 0.036max

write: 0.004avg ; 0.047max

transfer: 0.007avg ; 0.020max

But these values are referring to the amount of bytes during the operation - as you know. It doesn't help me establish whether I am accomplishing the 10ms completion time (that you referred to on the blog) server-wide. Is there a way of monitoring that?

Having bought quite a high spec server and having updated it with the latest drivers, is the inevitable answer going to be "throw more hardware at it, buy a bigger server"? Or at least, faster drives?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2007-06-12 : 12:20:48
"is the inevitable answer going to be ..."

First off I would ask the supplier why the server isn't doing a decent job
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-06-12 : 16:16:12
Fine idea, I am already in communication with them. So I will put this question to them also.
Currently, the best answer Dell can come back with is "you should have seperate RAID arrays so that your TempDB can be on a different RAID".

Talk about stating the obvious...
So it will be a while before they come back with something useful by the looks of it - if ever.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2007-06-18 : 10:12:36
"you should have seperate RAID arrays so that your TempDB can be on a different RAID"

Well, yes; ideally.

We use Dell kit, and we don't have a separate RAID for TempDB ... but we do have separate RAID for O/S, Data and Index (but not for Backup).

But I'm not sure that actually tells you anything useful!

Kristen
Go to Top of Page

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2007-06-18 : 10:57:19
So here's a good story for you. Server went down twice today. With Dell Support being as useful as a chocolate fire guard, we tried troubleshooting things a bit more. Dell Server Management Software (which we have the latest version of) hasn't been alerting us to any problems. So we sent them the diagnostics report and all they came back with was the RAID idea.

So today we actually went over to the Comms room and had a look at the pretty lights on the drives. One of the drives' lights was flashing at a constant rate - it's rebuilding. Had a look at the Dell Server Management Software and it just didn't list that drive. There was no info about what was happening. We are guessing at this stage that it has been sporadically rebuilding itself because of some problems (maybe bad sectors) and that is what has led to the disk I/O issues. I presume being one disk down will mean that the rest of the disks will have to pick up the slack. Plus you have the extra load of the rebuild taking place.
Let's take a moment and reflect on what Paul said at the beginning: "You've got problems with your hardware". Yes Sir indeed I have.
We have a 4 hour support deal. Hopefully we'll have a new disk soon...
Go to Top of Page
  Previous Page&nsp;  Next Page

- Advertisement -