SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 General SQL Server Forums
 Data Corruption Issues
 DB Maintenance corrupts data
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

Fat Bob
Starting Member

3 Posts

Posted - 05/30/2006 :  07:02:29  Show Profile  Reply with Quote
Hi all,

I'm new to the forum, and fairly new to SQL, so please forgive me if I say something that sounds stupid at any time. I have been looking around this forum (and others) while trying to diagnose my current issues, and there seems to be a really good amount of knowledge about here so I thought I'd try my luck with a post.

We've been running SQL 2000 for around 2 years now, and had very few problems until we suffered a torn page error around 6 months ago (just before the time I started looking after the server).

Since then we've been getting intermittent database corruption (I implemented a nightly DBCC job as part of DB maintenance) and it seems to occur after our weekly index maintenance has run (also a scheduled DBMaint job). I have just confirmed this by restoring nightly backups to a test database - the current round of corruption has occured sometime between the Saturday and Sunday night backups (indexing runs at 1 am Sunday morning, and is the only activity between these two backups)

We have 9 small (all less than 200MB) databases which are used faily lightly by our salesforce. The corruption is occurring on different databases each time.

We are running SQL 2000 (SP4) on a Windows 2003 clustered server (2 x DL380s and an MSA500 G1) running in failover mode (active-passive). The raid 5 array does utilise disk caching, and I have wondered if this is involved somehow.

Our hardware support comany have checked the server thoroughly for HW issues, but have come up blank.


Here is the DBCC output from one of our corrupt databases from this morning:

dbcc checkdb (casestatus) WITH ALL_ERRORMSGS, NO_INFOMSGS
-----------------------
Server: Msg 2533, Level 16, State 1, Line 1
Table error: Page (1:20040) allocated to object ID 117575457, index ID 0 was not seen. Page may be invalid or have incorrect object ID information in its header.
Server: Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 117575457, index ID 1. Page (1:20040) was not seen in the scan although its parent (1:19808) and previous (1:20039) refer to it. Check any previous errors.
Server: Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 117575457, index ID 1. Page (1:20041) is missing a reference from previous page (1:20040). Possible chain linkage problem.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 1, text ID 1190656606208 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 3, text ID 1190656671744 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 5, text ID 1190656737280 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 7, text ID 1190656802816 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 9, text ID 1190656868352 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 11, text ID 1190656933888 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 13, text ID 1190656999424 is not referenced.
Server: Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 117575457. The text, ntext, or image node at page (1:12449), slot 15, text ID 1190657064960 is not referenced.
CHECKDB found 0 allocation errors and 11 consistency errors in table 'Raw:LifeStatusFP' (object ID 117575457).
CHECKDB found 0 allocation errors and 11 consistency errors in database 'CaseStatus'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (CaseStatus ).
-----------------------


The funny thing is that this particular database isn't even in use any more - I've had it in single user mode for about 2 months (and the users don't have any access to the back end stuff).

If anyone can shed any light on what's happening, I'd really appreciate this.

Thanks

Dave

paulrandal
Yak with Vast SQL Skills

USA
899 Posts

Posted - 05/31/2006 :  15:35:50  Show Profile  Visit paulrandal's Homepage  Reply with Quote
Hi Dave,

Some questions for you:
1) are the corruptions in the same table all the time or in different tables?
2) what maintenance are you doing on table 'Raw:LifeStatusFP' and what determines what maintenance is done (e.g. fragmentation level)?
3) what was the cause of the torn-page error 6 months ago?
4) (unrelated to corruption) why are you running maintenance jobs on a database that's in single-user mode and thus can't be getting more fragmented?

From what you've said, my theory is that your hardware performs fine except when under load, which index maintenance will introduce. Have you run any stress tests?

Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine + SQL Express
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page

Fat Bob
Starting Member

3 Posts

Posted - 06/01/2006 :  06:56:36  Show Profile  Reply with Quote
Hi Paul,

Thanks for your time :)

1) are the corruptions in the same table all the time or in different tables?
The corruptions seem to be not just in different tables, but in different databases each time.

2) what maintenance are you doing on table 'Raw:LifeStatusFP' and what determines what maintenance is done (e.g. fragmentation level)?
As I said, I'm new to SQL: When I realised that no backups or maintenace had occured on the SQL instance since it was originally built, I implemented nightly backup maintenance job schedules accross all user databases and also selected the option to 'reorganise data and indexes' (weekly) as part of this plan. At the time I didn't realise this could have a negative effect.

3) what was the cause of the torn-page error 6 months ago?
Never discovered - Our hardware support company looked at the server after the torn page error and gave us the all clear from their viewpoint, so I just repaired the database and put it down to experience.

4) (unrelated to corruption) why are you running maintenance jobs on a database that's in single-user mode and thus can't be getting more fragmented?
The database maintenance plan I mentioned earlier is set to 'all user databases' because (until I took control) my programmers had been in the habit of randomly creating additional databases to meet new requirements (something I have now put a stop to). This particular database is in the process of being decommissioned, and to prove that it is no longer needed I put it in single-user mode for a few weeks rather than delete it straight away.

From what you've said, my theory is that your hardware performs fine except when under load, which index maintenance will introduce. Have you run any stress tests?
The problem is that we haven't even begun to load this server yet - we're in the process of migrating away from Access databases, and so far only one major application has gone fully to SQL (the current problem started even before we implemented this).
Our combined production SQL data only adds up to 500MB in size - Most of it is static data (written once, and not re-used much). we orginally sized the hardware to allow us to do much more than we currently use it for.
We have around 700 users (max) connected to our main system at any one time, the most connections to SQL I've seen is around 100. Admittedly, our programmers don't know much about SQL server either, so I had to go through a big indexing excercise to bring processor and disk utilisation down to a normal level a few months ago (they are pretty hot on normailization though, so I didn't need to worry about that side too much). Since then, all the monitoring stats I've produced show that the server is not very stressed at all.
The indexing job takes place at 7am on a Sunday when no other jobs are running (and no user access is permitted).

I'm not familiar with methods of stess testing SQL, although I have recently started to make our programmers provide modified applications which emulate 100 users performing the same SQL tasks simultaneously (on our test servers) as a routine part of release testing. I'd be happy to run some formal tests if you can point me in the right direction.
I'd also be happy to add any SQL performance counters you can recommend to our existing server monitoring jobs. I do keep an eye on some of the usual stats that were recommended in the SQL admin course, but so far I haven't been writing these stats to disk.

If there is anything else you can recommend I'd be happy to try it - I'd be happy just to definately rule anything out at the moment.

Dave
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

USA
899 Posts

Posted - 06/01/2006 :  15:47:26  Show Profile  Visit paulrandal's Homepage  Reply with Quote
Sounds like you're doing a great job getting a handle on things.

Check out the whitepaper below for more details on when to do index maintenance - it doesn't make sense to blindly do it all the time.

http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx

For your backup strategy, you need to make sure that you've practises doing a disaster recovery. Also, does the combination of the type/frequency of backups you're taking, along with the frequency of consistency checks allow you to meet your company's downtime and data-loss SLAs?

Are all the databases on the same hardware? That would point even more strongly towards hardware being the issue.

Checkout this link http://support.microsoft.com/default.aspx?scid=kb;en-us;231619 for details of how to run SQLIOStress.

What diagnostics did the hardware support company run? Are all the drivers up-to-date?

Thanks



Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine + SQL Express
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page

Fat Bob
Starting Member

3 Posts

Posted - 06/07/2006 :  09:48:48  Show Profile  Reply with Quote
Hi Paul,

I read the index maintenance paper just before I posted on here, so it's good to know that I'm checking in the right places.

I have now amended my scheduled maintenance to stop running the indexing jobs, and I'll check this manually from time to time as per your recommendation.

DR is not a problem for us, as we run a dual-site (with redundant servers) operation with transaction log shipping to take care of the SQL data. In the 2 1/2 years since I set the system up I've had to switch primary operations between sites a few times (due to our hosting providers A/C and power failures) and I'm pretty happy with the means and methods of everything - except for the SQL side (which I intend to look into at my earliest opportunity).

I take note of your comments re:SLAs - we tend to perform everything on best endeavours, but with SQL still being new to me I think I'll get the company to aknowledge that if something goes wrong with our databases during production hours it might be a wee while before I can get things back to 'normal'. That said, our 5-minute transaction log backups (for log shipping) should go a long way to help if we do get a data problem :)

To be honest, even when our hardware service provider gave us the all clear, I was still a little suspicious that there might be a fault somewhere. Since then, HP have checked our servers themselves (at my insistence) and also a large and reputable HP partner is due to produce a report from some IO monitoring we asked them to do (which included a hardware check), and I'd be surprised if they had all overlooked something.

I'm afraid I don't know what particular diagnostics any of the 3 companies used to diagnose the servers, as they are physically located 400 miles away from me. All I know is that they used some HP 'you-need-to-have-been-on-the-course' type tools.

Drivers are all up to date (checked these myself, but our HW provider keeps an eye on this for us also).

Thanks for IOstress tool info - I'll put this to good use as soon as I get a bit more time to look at it properly.


Thanks once again for your replies - you've given me a few good pointers, and a little bit of confidence that what I have done so far is headed in the right direction.


Cheers

Dave
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

USA
899 Posts

Posted - 06/07/2006 :  12:14:31  Show Profile  Visit paulrandal's Homepage  Reply with Quote
You're welcome. If the IOStress tool doesn't find anything, all other h/w checks come back clean, and its still happening, let me know and we'll dig in a bit deeper.

Thanks

Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine + SQL Express
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.1 seconds. Powered By: Snitz Forums 2000