Bug #25917 MySQL master won't update mysql-bin.index.
Submitted: 29 Jan 2007 14:25 Modified: 13 Apr 2007 6:25
Reporter: Jason Williams Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.27-debug OS:Solaris (Solaris 10 (X64 & SPARC))
Assigned to: CPU Architecture:Any

[29 Jan 2007 14:25] Jason Williams
Description:
I have a master that's been running since the 13th of January. Since
the same date I've had a slave running against it, without any errors
or serious lag. Suddenly, this morning replication broke with
duplicate entry errors. My master claimed to be on mysql-bin.000014
with a significant LOG_POS. The slave however, claimed to be back on
mysql-bin.000001. Trying to skip forward, the slave would skip forward
as far as mysql-bin.0000003, still having duplicate entry errors
(assumedly because it had already been this far). I then tried to set
it to something much further forward such as mysql-bin.000010 at which
it gave me this error:

070129 14:00:12 [ERROR] Error reading packet from server: Could not
find first log file name in binary log index file ( server_errno=1236)
070129 14:00:12 [ERROR] Got fatal error 1236: 'Could not find first
log file name in binary log index file' from master when reading data
from binary log

On the master, the file does in fact exist, and the permissions are
correct. However, I did note that the mysql-bin.index has not been
updated since mysql-bin.000001, and the time stamp on the .index file
is on the 14th of January. I tried to manually add the missing
entries, but to no avail, the slave still couldn't find them. Then I
restarted the slave, and still the same problem.

Its as if the slave was moving right along, and then suddenly decided
to forget where it was and go back to mysql-bin.000001. 

BTW, the MySQL on both boxes is 5.0.27-debug on Solaris 10. The
binaries are the MySQL-built binaries. The master is a SunFire X4500 with 2x dual-core Opteron 2.6GHz CPUs and 16GB of RAM. The slave is a SunFire T2000 with a UltraSPARC T1 chip and 8GB of RAM.

The master .err log shows no errors at all. Also, the only major change in behavior recently, is that on the 22nd we started running FLUSH LOGS twice a day on the master. This was to change our backup procedure.

To pull a backup we run on the master:

SET AUTOCOMMIT=0;
FLUSH TABLES WITH READ LOCK;
FLUSH LOGS; <---Added command.
<run filesystem snapshot of DB and log volumes>
COMMIT;
UNLOCK TABLES;

How to repeat:
Not clear. Restarting the slave does not clear the issue. Not able to repeat, because not able to clear condition.
[29 Jan 2007 14:45] Jason Williams
Tried to CHANGE MASTER back to "mysql-bin.000001", and now the slave is reporting it can't find that log either (same 1236 error). Master report no problems to its own error log.
[29 Jan 2007 14:53] Jason Williams
Doing a RESET SLAVE, followed by CHANGE MASTER with MASTER_LOG_FILE specified to mysql-bin.000001 cause same 1236 error. 

However, doing a RESET SLAVE, followed by CHANGE MASTER without MASTER_LOG_FILE specified, causes slave to start replication at mysql-bin.000001. This causes slave to break replication because of duplicate entry errors. Using SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1000000000; to skip forward seems to move through logs files. Seeing how far this can go.
[29 Jan 2007 14:59] Jason Williams
Purged MASTER LOGS to 'mysql-bin.000014' on master, and then ran RESET SLAVE, followed by CHANGE MASTER without MASTER_LOG_FILE on slave; This caused slave to start replication at 'mysql-bin.000014'.

PURGE MASTER LOGS on master DID NOT update mysql-bin.index on master. Still has all logs files (as edited) up through mysql-bin.000015.
[29 Jan 2007 15:12] Jason Williams
Figured out what happened at time of initial replication break. We have been doing "FLUSH LOGS" once a day since the 22nd, except on the weekends. The last log created when we flushed was mysql-bin.000013 on Friday. Today at the time of the replication break (where the slave reset to mysql-bin.000001 and since couldn't find logs IF specified in CHANGE MASTER), the master rolled the log on its own to mysql-bin.000014 because .000013 had hit the max size for an individual log.

So the reproduction sequence appears to be do a FLUSH LOGS on the master, and then let the new log file grow large enough such that MySQL will roll it on its own. At this point, the slave will lose its brain and reset to .000001 or whatever it thinks is the first log in the sequence.
[30 Jan 2007 11:28] Valeriy Kravchuk
Thank you for a detailed bug report. Please, send your my.cnf from both master and slave, just for completeness.
[30 Jan 2007 22:18] Jason Williams
Master my.cnf

Attachment: my.cnf (application/octet-stream, text), 20.63 KiB.

[30 Jan 2007 22:20] Jason Williams
Slave my.cnf

Attachment: my.cnf (application/octet-stream, text), 20.61 KiB.

[30 Jan 2007 22:21] Jason Williams
Hi Valeriy,

Please find attached to the case master and slave my.cnf files. Thank you for reviewing this!
[12 Mar 2007 11:50] Valeriy Kravchuk
Sorry for a delay with this bug report. Please, try to use a newer version, 5.0.37, and check if you can repeat the same incorrect behaviour.
[12 Mar 2007 18:54] Jason Williams
Hi Valeriy,

I will attempt to check this. Thank you for the response.
[13 Mar 2007 6:25] Valeriy Kravchuk
Please, inform about any results.
[13 Apr 2007 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".