Bug #14949 Replication Problems
Submitted: 15 Nov 2005 15:14 Modified: 3 Jan 2006 16:43
Reporter: Brad Smith Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.0.18 OS:Windows (Windows Server 2003)
Assigned to: CPU Architecture:Any

[15 Nov 2005 15:14] Brad Smith
Description:
We have been trying to get replication to work for several weeks.  As a part of the implementation we have been running some batch php scripts that do very rudimentery checks to validate the health of replication.  One of the checks is to get a list of the tables in the datbase and then do record counts for each of the tables.  Each time that we have reset the slave after the counts have gotten off.  Within a week (and most of the time a couple of days) of resetting the slave the counts get out of sync again.

How to repeat:
There is not any type of noticable pattern to this other than the same set of tables generally get out of sync.  We are not sure if it because of the type of queries being ran against the tables or not.
[15 Nov 2005 15:32] Brad Smith
Status File 11/13/2005 07:00 a.m.

Attachment: status0511130700.zip (application/x-zip-compressed, text), 3.19 KiB.

[22 Nov 2005 17:59] Valeriy Kravchuk
Thank you for a problem report. 

I have several initial questions for you. What engine is used for your tables? MyISAM? Can you upgrade to some version newer than 4.0.18? 4.0.26, for example. We'll have to verify your problem on the last version in any case.

Can you, please, upload the php scripts you use to check? You may use File tab and upload as private files, if you want.
[22 Nov 2005 20:36] Brad Smith
The tables in the database are myisam.  We tried to upgrade the system to version 5.0 but experienced severe performance problems.
[23 Nov 2005 18:03] Valeriy Kravchuk
Thank you for the additional information. So, you got different counts when binlogs were in sync? What was the largest difference in row numbers among tables? Can it be so, that while (or just after) your php script is "counting" on master, some rows may be inserted on master and slave? You simply can not get guaranteed consitent results without locking all MyISAM tables, I think...

Are there any other evidences of replication problems besides your scripts' results?

You configuration files (my.ini) from both master and slave may be of some interest too.

As for the upgrade, if you are satisfied with 4.0 features, I'd recommend to upgrade to 4.0.26, in any case.
[23 Nov 2005 20:49] Brad Smith
I don't think the problem is that the slave still has queries to run.  I think for some reason some of the queries are not getting ran but replication continues.  When one of the counts get off the record counts for the table either stays consistent or goes up indicating to me that there were some queries missed.  I am going to upload the slave and master ini files as well.  I have the bin logs from each of the servers but can not upload the zip files as it takes to long to upload them.
[25 Nov 2005 16:10] Valeriy Kravchuk
So, if I understood you right, counts on master are always >= than counts on slave? Or sometimes vice versa?

Are there any other evidences of replication problems besides these counts?

Can you provoke some action to not being replicated (insert a row, see all OK in both logs, then select from slave and do not get the row)? Have you analyzed the binlogs from master and slave to find any difference among them?
[30 Nov 2005 13:39] Brad Smith
Yes, the counts are off.

No.  This is the only evidence of replication problems.

I was hoping that I could provide you with the bin logs and see if you could identify the queries that were not being replicated.  We know what table are causing the problem from the listed status.  I was hoping that you all would have some good tools where you could pull specific queries out of the log files.
[30 Nov 2005 15:37] Valeriy Kravchuk
Surely we can analyze your logs, if they are not to large to be uploaded and include the problematic period. How large each individual log is? 

But even corrupted or misinterpreted log is not enough to call this a bug (it may be a hardware failure). That is why I asked for the repeatable sequence of actions that provoke the problem each and every time. At least, on your machine.
[2 Dec 2005 15:01] Brad Smith
I have the zipped files from the master and slave but am not able to upload them.  The master zipped file is 167 meg while the slave is 11 meg.  Is there an ftp site that I can upload them to as opposed to using the web interface?

As for repeatable events.  If I knew which queries were not being ran on the slave then I would be able to repeat the problem.  However, I am not able to identify the problematic queries.

It is unlikely that the problem is a hardware failure.
[3 Dec 2005 16:43] Valeriy Kravchuk
Analysis of 100 Mb file is an option, but the last one I'd want to use... So, please, check similar bug report, http://bugs.mysql.com/bug.php?id=15318, and, because you are also using Windows, try to set that KeepAliveTime and KeepAlivaInterval in the registry of both master and slave to something like:

"KeepAliveTime"=dword:000927c0
"KeepAliveInterval"=dword:000003e8

See
http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default.... and
http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default....
for the details. 

Try to work with these parameters explicitely set and inform about the results (will you note any counters difference or not).
[4 Jan 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".