SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 SQL Server 2000 Forums
 SQL Server Administration (2000)
 sp4 - indexdefrag blocking and HUGE perf problems
 New Topic  Reply to Topic
 Printer Friendly
Previous Page | Next Page
Author Previous Topic Topic Next Topic
Page: of 3

avihay
Starting Member

Israel
3 Posts

Posted - 11/30/2005 :  06:29:03  Show Profile  Reply with Quote
we also got lots of problem with sp4.
page share lock not released.
optimiser work the incorrectly (no doubt).
we also consider to rollback tp sp3.

avihay
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 11/30/2005 :  08:22:39  Show Profile  Reply with Quote
avihay : "optimiser work the incorrectly (no doubt)"

There is a "known issue" with JOINs between datatypes that are different - e.g. an INT joined to a VARCHAR - could this be your problem?

ThreePea: "If anyone has any good suggestions on accurate stress-testing for a heavy OLTP system, I'm all ears"

Well I'm all ears too!

We had a chat around the [dev] office about this today.

One thing we do is to log the elapsed time for each Sproc call (we do this with logging code within each SProc). We also have an automated test process (no load testing ... just regression testing).

We are [now] looking at comparing the SProc execution time from one test run to the next. In particular we could run "identical" tests before and after installing, say, SP4.

A change of even 50ms to a single SProc call might be sufficient to ring alarm bells! (i.e. a change of 16ms is probably just how you hit a clock-cycle event, but bigger than that could be real). There are external influences that might make a difference to some things, so I don't know how useful this will be, but a slight change to the average elapsed time for a given Sproc that is called several times within the test might indicate a problem which would be exaserpated on a load-stressed system.

Just thinking out loud, really, we haven't had a chance to see if there is actually any real value in this line of thinking.

Kristen

Edited by - Kristen on 11/30/2005 13:22:52
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

USA
899 Posts

Posted - 11/30/2005 :  12:26:26  Show Profile  Visit paulrandal's Homepage  Reply with Quote
Just as an FYI - PSS mgmt took the correct action with this case and have a (very good) US Escalation Engineer driving the team assisting Jerry - AFAIK from PSS Jerry's OK with the current action plan.

Please let me know about anything else like this in future.

Thanks



Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page

ThreePea
Yak Posting Veteran

USA
83 Posts

Posted - 11/30/2005 :  13:11:16  Show Profile  Visit ThreePea's Homepage  Reply with Quote
quote:
Originally posted by paulrandal

Just as an FYI - PSS mgmt took the correct action with this case and have a (very good) US Escalation Engineer driving the team assisting Jerry - AFAIK from PSS Jerry's OK with the current action plan.



I am extremely happy now, thank you. I got a number of calls last night and this morning from the escalation engineer and also one of the department heads. We put a great plan together and I'm working on it right now. I feel like we are moving forward now, which is a much better feeling than banging your head against the wall.

Thanks again.

Jerry
Go to Top of Page

avihay
Starting Member

Israel
3 Posts

Posted - 11/30/2005 :  14:13:03  Show Profile  Reply with Quote
thanks:)
There is a "known issue" with JOINs between datatypes that are different - e.g. an INT joined to a VARCHAR - could this be your problem?

we have cluster index and regular one (on identity column).
the select is
select max(identity_col)
where cluster_index_col = xxx

it goes by the non cluster index (identity col)
this is wrong!!!

the lock problem is select on large table
select XXX where column>(inner select)
this leave tousands of pages on shared lock after command ends.

*** escalation flag is set to off
if changing escalation flag to on it leavs only 2 pages lock
(still select command when ends should not leave lockes)
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 11/30/2005 :  14:56:33  Show Profile  Reply with Quote
"select max(identity_col)
where cluster_index_col = xxx
"

is "xxx" same datatype as "cluster_index_col" ?

How selective is "cluster_index_col = xxx" - i.e.
What is
SELECT COUNT(*) FROM MyTable
and
SELECT COUNT(*) FROM MyTable WHERE cluster_index_col = xxx

The other issues I don't know about.

Kristen
Go to Top of Page

avihay
Starting Member

Israel
3 Posts

Posted - 12/01/2005 :  02:05:46  Show Profile  Reply with Quote
thank for the response .

xxx is same data type ,
the cluster index is selective enough.
the cluster index is complex one (few columns) and include the identity column as well.
it did not happen before sp4.
we solve this by hint due to locking and performance problem's.

for the second problem of pages shared lock anyone has aclue ?

select bla , bla , bla
from a
where a.b =
and b.c =
and a.identity > (select max(bla) from x)

this leaves almost all pages of table lock with shared lock until
end of transaction (commit or rollback).

database configure is no escalation.
for testing we changed to allow escalation it leaves 2 pages with
shared lock.




we are using::

Microsoft SQL Server 2000 - 8.00.2040 (Intel X86)
May 13 2005 18:33:17
Copyright (c) 1988-2003 Microsoft Corporation
Enterprise Edition on Windows NT 5.0 (Build 2195: Service Pack 4)

This is SQL 2000 Sp4 + Hotfix 2040.


Go to Top of Page

lewisdm
Starting Member

USA
1 Posts

Posted - 12/28/2005 :  17:49:13  Show Profile  Reply with Quote
We also put sp4 on a new server cluster we were moving production to. We already had sp4 on in development and test and had no major issues with it. We are using 8.00.2148. The server was set up as an active active cluster running windows 2003 sp1.

Within a few hours we were experiencing terrible response time and deadlocks that had increased 900%. We went from 30 deadlocks the day before to 45 deadlocks within an hour. The types of locks being held were now table and page (tab and pag) instead of what we had seen previously on sp3a as index and row locks (key and rid).

We unfortunately had to take the 4 hour down time the next day and back off. Since we backed off sp4 and rebuilt the server as sp3a we have seen 0 deadlocks and have had no problems.

We are a heavily OLTP system running OLTP on one side of the cluster and DSS (reporting) on the other side of the cluster replicating data in between.

We did open a ticket with Microsoft and are still awaiting an answer from them --- although we did deluge them with information the last 2 days.


Any information that you can share would be greatly appreciated. Does it matter what OS you are on? Are you running in a clustered environment? Has the optimizer been changed?



Thanks...

dml

DML
Go to Top of Page

graz
Chief SQLTeam Crack Dealer

USA
4137 Posts

Posted - 12/28/2005 :  20:05:25  Show Profile  Visit graz's Homepage  Reply with Quote
DML,

Can you please start a new thread for this?

-Bill

===============================================
Creating tomorrow's legacy systems today.
One crisis at a time.
Go to Top of Page

dietcokebreak
Starting Member

United Kingdom
1 Posts

Posted - 01/07/2006 :  18:15:10  Show Profile  Reply with Quote
Hi All,

Just thought I'd share some very similiar experiences and observations after reading through your experiences. We are in a fortunate position now, after seriously pushing the powers that be, to have two absolutely identical clusters - which enable us to perform side by side comparisons of absolutely any system change; whether that be firmware, drivers, os, sql or our own applications. Our performance and availability SLA's essentially dictated this, if we were going to maintain our position as the number #1 service provider in our industry sector.

So, as always, we started testing the latest version of our application on the pre-production system mid-November. At the same time, we decided to updgrade SQL to SP#4 for the test programme; as our strategy next year is to move to 64bit and ultimately SQL 2005 towards the end of the year. This would mean running SQL2000 32bit on 64bit Enterprise Edition, which requires SP4; thus giving us a reasonably straight forward upgrade path when it came to migrating to 64bit SQL 2005.

Well.... after an the expenditure of enormous amounts of test/senior designer/ development/techinical resource, we disovered that the significant degragtion in application performance and stability was completely attributable to SP4 not the latest version of our application!

We started with build 2039, moved to 2040 then finally to 2162 - all of which exhibited the same performance degradation. We have of course now moved back to SP3a and applied hotfixes to bring us up to build 871. We have been running this configuration for the last 15 months and checking with our service delivery manager, have achieved application availability of just over 99.9% for the last 12 months.

Some examples of the performance degradation we experienced;

- delete process: SP3a 400,000 records in 30mins; SP4 100,000 records in 30mins (these deletes actually cascade)
- insert process: SP3a 100,000 XML records per hour; SP4 60,000 XML records per hour

- we also had continuous problems with deadlocks, timeouts and failures to checkpoint the log under simulated operational system stress.

From monitoring the system we noticed the following differences;

- the number of AWE write maps/sec increased significantly with SP4
- the buffer cache hit ratio, although > 99% was still consistently lower than achieved with SP3a
- the number of disk IO operations increased for the same workload

Look forward to hearing any further discoveries you may have made since last posting and would be happy to share further details of our experiences. Needless to say, we are stuck with build 871 for the time being, which although has delivered great performance for our customers so far this month (which is a peak time for us) on the new version of the application; our longer strategy has a hugh hole in at the moment!

Cheers, James.

NOTE: Biggest problem we've had rolling forward/back SQL on W2K3SP1, is the installer's nasty habit of clearing the named pipe alias for the machine from 'cliconfg'; which is required to get around the problem caused by the OS blocking TCP access to SQL server until SP3a is applied. I've found that if you put it in just after you've finished working through the installer dialogues, this usually works ok. If you put it in before you start the install process, you will find that it has disappeared at this point!
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 01/08/2006 :  05:44:17  Show Profile  Reply with Quote
Thanks for sharing this with us dietcokebreak.

Kristen
Go to Top of Page

graz
Chief SQLTeam Crack Dealer

USA
4137 Posts

Posted - 01/16/2006 :  18:53:47  Show Profile  Visit graz's Homepage  Reply with Quote
So this threads been quiet for a while. I'm curious if there are any other reports of problems with SP4. I have a client that wants to install SP4 on a server so they can run it on 64-bit hardware. And I'm a little spooked.

===============================================
Creating tomorrow's legacy systems today.
One crisis at a time.
Go to Top of Page

eyechart
Flowing Fount of Yak Knowledge

USA
3575 Posts

Posted - 01/16/2006 :  19:56:58  Show Profile  Reply with Quote
quote:
Originally posted by graz

So this threads been quiet for a while. I'm curious if there are any other reports of problems with SP4. I have a client that wants to install SP4 on a server so they can run it on 64-bit hardware. And I'm a little spooked.



we are running 32bit sql w/SP4 on our opteron based 64bit boxes without any major problems.

We have the HP DL585 boxes w/ 4 dual core CPUs and 32GB RAM. really fast machines and so far (knock on wood) very stable.

We have run into a little bit of a problem with perfmon and vbscript/jscript. Because of the way the 32bit code works on these 64bit machines, it cannot reference 64bit registry entries. This means that if you run perfmon (the 64bit perfmon) you will not see the counters for SQL server because they are 32bit. you will need to run the 32bit perfmon explicitely instead. I think this stuff is referenced in the SP4 readme, but it can cause problems especially if you point your workstation's perfmon at a remote server. you will actually connect to the 64bit perfmon and not see the SQL Server stuff.

This issue with perfmon has also caused some issues with monitoring. If you use a 3rd party monitoring package be sure to ask the vendor if they support this configuration. Chances are they do not. btw, we use NetIQ for monitoring and it does not yet support monitoring 32bit SQL server running on 64bit windows.

Other than that issue we have been really happy with how things have worked out. We have had no issues with problems because of SP4, just weirdness with how 32bit code works in the 64bit environment.



-ec

Edited by - eyechart on 01/16/2006 19:58:17
Go to Top of Page

jen
Flowing Fount of Yak Knowledge

Sweden
4110 Posts

Posted - 01/16/2006 :  21:27:32  Show Profile  Send jen a Yahoo! Message  Reply with Quote
A standby server is always critical to have, especially for those 24x7 systems.

Should something happens, you can always shift to the standby server for a couple of days while rebuilding the affected server, then revert back after the rebuild or uninstallation is done. Less stressful because you don't get to think of "customers" withdrawing contract or fines accumulating.

It's an investment rather than additional expense because it's worth it especially if the company is being fined for n-seconds of downtime.

I personally learned this the hard way, thousands (thank goodness it's not more) of good money down the drain (less shouting during meetings too )

There is only so much anyone can do and plan and prevent, but things happen...

--------------------
keeping it simple...
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 01/17/2006 :  02:40:13  Show Profile  Reply with Quote
They want to run it on 32bit system, so parallel a 64bit system, or they only want to put it on a 64bit system?

There is a hot fix for this:

"Microsoft has found an issue with the final build of SP4 that impacts customers who run SQL Server with AWE support enabled. This problem does not affect ia64 platforms. On x86 and x64 systems, the problem only impacts customers with more than 2 gigabytes (GB) of memory (available only with the Enterprise, Developer, and Evaluation editions) where AWE has been enabled and more than half of the total system memory is being allocated to a single SQL Server instance"

http://support.microsoft.com/default.aspx?kbid=899761

The list of fixes in SQL 2000 SP4 is long, and some of them are scary!, but I am not aware that our application has fallen over any of them, so have fought shy of installing it.

I am worried about the reported slowdown for JOINs on mixed-datatypes - I'm sure we have lots that are not explicitly cast (it would be nice to have a "pedantic", or Lint, mode that announces any such loose coding so it could be tidied up during development.

These are the things in the SP ReadMe:

Full-Text Catalogs Are Rebuilt After Setup Completes

"All full-text catalogs are rebuilt as part of the installation of SP4 when upgrading from SP2 or earlier. The rebuild is automatic and resource-intensive. Queries against full-text catalogs may return partial results or no results until the rebuild process is complete. After SP4 is installed, the system event logs contain messages stating that the catalogs were corrupt, of an older version, and had to be rebuilt"

Change to Maximum Network Packet Size

maximum value for the network packet size option reduced from 65536 to 32767.

Optimization of Queries with Large IN Lists or Many OR Clauses

Effects queries with:

"More than 10,000 elements in the IN list
Two IN lists, each containing over 100 elements
Over 10,000 disjuncts in the OR clause
A combination of OR clauses and IN lists such that their equivalent expression will contain more than 10,000 disjuncts
"

Monitoring an Instance of SQL Server Running on Windows-on-Windows 64 Mode

As eyechart mentioned

Replication

Several points to note, if replication used

Changes to Master/Target Server Configurations

".... Before you apply SP4, you must complete several steps to upgrade your SQL Server 2000 master/target server configuration. The changes that are introduced with SP4 are not compatible with SQL Server 7.0 target servers, or with any servers not running SP3 or later. This is a change from the original SQL Server 2000 functionality."

Meta Data Browser Exports in Unicode

"... As of this SP4 release, exported data is always expressed as Unicode. You can still export in ANSI code by setting the value of the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Repository\Engine\XMLExport to 0"

XML : Improved Validation of XPath Expressions

"OPENXML is updated to use a custom-built XML parsing technology designed to be backward compatible with MSXML 2.6."

Some compatibility issues are noted.

Virtual Backup Device API Enhancements

Possibly change to "VDC_PrepareTo Freeze" which I interpret to mean that 3rd party backup programs may be impacted.

Serviceability Enhancements

"new serviceability functionality that enables you to uninstall hotfixes applied to SP4 and later versions of SQL Server 2000 running on Windows XP and Windows Server 2003"

Sounds like good news for 24/7 shops?

DB-Library and Embedded SQL for C

Updated in SP4

Kristen
Go to Top of Page

ThreePea
Yak Posting Veteran

USA
83 Posts

Posted - 01/17/2006 :  07:49:01  Show Profile  Visit ThreePea's Homepage  Reply with Quote
Just to bring some closure to our situation, I thought I would post the end of the story for those interested in reading a short novel. Going back a few posts, after Paul Randall stepped in things did get moving, as I stated. The response from MS seemed promising, but faltered soon thereafter. Our support rep was agonizingly slow. He wanted to run pssdiag (which would kill our performance even further) and send him hundreds and hundreds of megabytes of information for him to analyze, which would take a couple days. He would then call back and inform us of a particular sproc or two that was causing high CPU and ask us to analyze that one. We dutifully did, and that would fix that particular sproc, but no change to the system. I strongly suggested that we needed to take a more systemic approach to the problem, as tackling our thousands of individual procedures wasn't going to do the trick. It appeared the engineer was following a set procedure for analyzing performance problems. And of course I understand that, but the way it was going was very slow, and I felt the urgency was still lacking.

After much cajoling and a bit more complaining, we got escalated to another engineer. He was a bit better. But he still wanted to go through much of the same process, more pssdiags, more information sent to Microsoft.

Then our support rep went for "training" for a week, so we were switched to yet another engineer. Even though the case had all of the previous engineer's work and notes there is still an element of starting over with each new rep, which wasted much time. The best thing about engineer #3 was that he was the first one to admit there was a systemic problem triggered by sp4, so we started to make a bit of progress.

Meanwhile our customers were about ready to tear us apart. They were missing shipments and deadlines and really unable to do their job, and risked being fined thousands of dollars per day by their customers due to our performance problems. You can imagine things were a bit tense around here.

Microsoft had us set up an sp3 system in order for us to run comparative tests between sp3 and sp4 to see what was so different in the execution plans. Unfortunately, we could not come up with identical hardware to run the sp3/sp4 comparisons, and therefore the test results were inconclusive.

We were still working very hard on our own. We battled various stored procedures into submission, which gave us the tiniest bit relief. But I mean tiny. We are pretty good with performance tuning, but there was no real pattern that we could see. We would simply apply the various tools from our performance toolbox until we found what worked for a particular sproc.

However, we started to notice a few things. Clue #1 - any stored procedure hitting our 5 core tables were puking more than others. Clue #2 - we noticed that Compile Lock time and Average Latch Wait Time had jumped drastically. Clue #3 - the actual number of compiles and recompiles did not increase, but stayed the same or even dropped after sp4. This means that the compile lock increase was not the result of an increase in the number of recompiles, but was in fact due to the compile itself taking much, much longer. And finally, Clue #4 - after one of the worse days we rebuilt some of our core indexes during the early morning. We normally can't do this because it locks tables and shuts down our customers, but we ran it for as long as we could (about 2 hours) before stopping it. We had an immediate improvement for the remainder of that day.

So now some things were starting to fall into place - here we were with a 100-fold increase in compile lock time. And compiling is very CPU intensive, which would account for the doubling of CPU usage. Plus, updating stats and/or rebuilding indexes helped significantly.

So, after almost a month after Black Sunday, we took the whole system down during our Sunday morning "slower" period and did a full DBREINDEX on all our core tables. This took most of the day, and we simply told our customers we had to do it, and they would simply have to wait. BAM - immediately thereafter the system became useable again. I say useable - it was still 50% worse than pre-sp4, but 100% better than the last few weeks. Now, we could not have continued in this fashion for long, we were still getting way too many timeouts and too high CPU utilization. But it bought us time for our secret weapon. Something we had been planning all along, but when it became apparent this problem wasn't going to go away we moved up the schedule - we migrated to SQL Server 2005 on a brand new 64-bit machine with 3x the ram.

We didn't tell Microsoft we were doing this, we didn't want them to stop working on the problem until we were safely over and done with the migration to the new server. I think you can understand we had to hedge our bets in case we had to back out of the migration to 2005. The last two weeks of December were brutal - trying to keep the performance problems at bay while doing everything necessary to migrate our huge 24/7 OLTP system to SQL Server 2005. But we pulled it off, baby, and we did it in beautiful fashion with nary a hiccup.

SQL Server 2005 is wonderful. The new server is incredible. And our customers are happy again.

Our final theory is this: We know there were a number of significant changes to the compiler and execution plan generation in sp4. These changes did not generate efficient plans with the majority of our major sprocs and tables. They seemed to be much more sensitive to fragmentation and out-of-date statistics then previously, at least in our environment. Compilation time itself also increased, whether due to internal changes and/or the complexity of our stored procedures, we do not know. This huge increase led to a huge jump in CPU time and overall battle for resources - CPU, locks, disk i/o etc. Rebuilding our major tables would give us considerable improvement, which would then gradually deteriorate over time. The data seems to fit this theory, but we never got the chance to prove it before abandoning ship.

Microsoft was getting ready to request our full database. I have no idea how they planned on getting gigabytes of data to Redmond, but they were going to set up identical sp3 and sp4 systems in their lab and run our application to figure out what was so radically different in the execution plans. I would have loved to been able to do that, but it all became moot and there was no reason to continue.

I'm just glad to be going home these days after only 9-10 hours instead of 15-16. :)

-- ThreePea
Go to Top of Page

ThreePea
Yak Posting Veteran

USA
83 Posts

Posted - 01/17/2006 :  07:50:12  Show Profile  Visit ThreePea's Homepage  Reply with Quote
quote:
Originally posted by jen

A standby server is always critical to have, especially for those 24x7 systems.


Jen - What do you use to get your data over to the standby server real-time? We've investigated various replication technologies, but haven't found anything that is fast, low overhead, no data loss, and no micro-management of what is being replicated.

ThreePea
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 01/17/2006 :  08:38:37  Show Profile  Reply with Quote
Thanks for the update ThreePea - seems like a brave move jumping to SQL2k5 under pressure, but I hope you'll look back and be glad to have had the opportunity forced upon you!!

Kristen
Go to Top of Page

lamarshj
Starting Member

Canada
2 Posts

Posted - 02/10/2006 :  12:41:59  Show Profile  Visit lamarshj's Homepage  Reply with Quote
Hi - we've got a JAVA document management product that multiplexes to either an Oracle (sorry, but we have no issues using the big "O") or SQL Server database, and we to have been simply creamed by SP4. We've seen deadlocks on concurrent processes (for which we've coded around), and very slow I/O delivery on single query search results (we can't code around this, of course) via MS JDBC drivers (and others UNA/SPRINTA). I felt it necessary to thank this forum and offer more information on the topic as we've used this as evidence to our clients that it is not our mature and stable code base.

We too have formalized this issue through MS support. The issue number is "SRX060201604591", and I've forwarded links to this forum to our support contact that they might see those at MS that have been involved. We've also jointly run diagnostics on the aforementioned single query scenario and the data's in MS's hands. I will quote our support contact from our email correspondence: "I have also raised your concern regarding performance on sp4 and was told that there are incidences where upgrade to sp4 has degraded performance. In some of those incidences, the sql was re-installed and sp3 was applied."

I will keep this forum up to date on our findings.


JohnLamarsh
Go to Top of Page

jen
Flowing Fount of Yak Knowledge

Sweden
4110 Posts

Posted - 02/13/2006 :  03:42:57  Show Profile  Send jen a Yahoo! Message  Reply with Quote
We actually did 'manual' log shipping through jobs (simple backup and restore of log files), the last ~15 minutes prior to the shift will need to be done invoked (15 minutes is actually good news compared to half day downtime). Data transfer was no problem, the major problem is how to shift n-number of applications to shift to the standby server

solution was to have an application server that propagates the change in ini file across all applications after client reboot

for servers involving replication, only critical databases are backed up and restored on a standby publisher




quote:
Originally posted by ThreePea

quote:
Originally posted by jen

A standby server is always critical to have, especially for those 24x7 systems.


Jen - What do you use to get your data over to the standby server real-time? We've investigated various replication technologies, but haven't found anything that is fast, low overhead, no data loss, and no micro-management of what is being replicated.

ThreePea



--------------------
keeping it simple...
Go to Top of Page
Page: of 3 Previous Topic Topic Next Topic  
Previous Page | Next Page
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.23 seconds. Powered By: Snitz Forums 2000