Hi folks. We have a 2 server setup where Server1 acts as publisher and distributor of transactional replication of our production DB (which is also on Server1). Server2 is the subscriber and pulls from the distributor to the replicated DB on Server2 which is used only for reporting. The DB is just over 4GB if that is relevant and we replicate just about every table and view. Both servers have plenty of resources and are running the latest service packs of SQL2005 and Windows server 2003.
Yesterday, for the second time in 6 months, the subscription basically stopped pulling transactions from the distributor for no apparent reason. I could see that the publisher was sending transactions to the distributor but the subscriber was not applying them to the replicated DB. This happened in the middle of the day so no maintenance jobs or updates were running... just normal business. The ONLY thing I saw to possibly explain it was a request to the replicated DB which took over 2 minutes and was timed out by the application server. This happens occasionally with large reports but doesn't impact replication.
I looked at every log file I could find as well as replication monitor and could see no reason why replication was stopped... but I confirmed changes were not carrying over and the TEMPDB on Server2 was growing to about 1GB. I finally killed and recreated the replication from scratch and everything is fine now.
Can anyone provide any tips on troubleshooting this type of thing should it happen again? Is there any place I should look that I haven't already? I looked at the SQL Server and Agent logs as well as Application and System logs on both DB servers plus our application web server and saw nothing out of the ordinary.
The distributor is on the publisher. I did not know to look in that table... this is the kind of info I'm looking for so thanks for that.
Unfortunately I see no errors that explicitly correspond to this problem. I became aware of the issue at 3:02PM (via a scheduled task that manually tests replication every hour). The first error I see is at 3:22PM - "if @@trancount > 0 rollback tran" and immediately after that is "Query timeout expired" with an error code of "HYT00". So these may be related but they were logged well after the problem started and may be around the time I started troubleshooting.