MySQL Bugs: #25917: MySQL master won't update mysql-bin.index.

Bug #25917	MySQL master won't update mysql-bin.index.
Submitted:	29 Jan 2007 14:25	Modified:	13 Apr 2007 6:25
Reporter:	Jason Williams	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.27-debug	OS:	Solaris (Solaris 10 (X64 & SPARC))
Assigned to:		CPU Architecture:	Any

Description:
I have a master that's been running since the 13th of January. Since
the same date I've had a slave running against it, without any errors
or serious lag. Suddenly, this morning replication broke with
duplicate entry errors. My master claimed to be on mysql-bin.000014
with a significant LOG_POS. The slave however, claimed to be back on
mysql-bin.000001. Trying to skip forward, the slave would skip forward
as far as mysql-bin.0000003, still having duplicate entry errors
(assumedly because it had already been this far). I then tried to set
it to something much further forward such as mysql-bin.000010 at which
it gave me this error:

070129 14:00:12 [ERROR] Error reading packet from server: Could not
find first log file name in binary log index file ( server_errno=1236)
070129 14:00:12 [ERROR] Got fatal error 1236: 'Could not find first
log file name in binary log index file' from master when reading data
from binary log

On the master, the file does in fact exist, and the permissions are
correct. However, I did note that the mysql-bin.index has not been
updated since mysql-bin.000001, and the time stamp on the .index file
is on the 14th of January. I tried to manually add the missing
entries, but to no avail, the slave still couldn't find them. Then I
restarted the slave, and still the same problem.

Its as if the slave was moving right along, and then suddenly decided
to forget where it was and go back to mysql-bin.000001.

BTW, the MySQL on both boxes is 5.0.27-debug on Solaris 10. The
binaries are the MySQL-built binaries. The master is a SunFire X4500 with 2x dual-core Opteron 2.6GHz CPUs and 16GB of RAM. The slave is a SunFire T2000 with a UltraSPARC T1 chip and 8GB of RAM.

The master .err log shows no errors at all. Also, the only major change in behavior recently, is that on the 22nd we started running FLUSH LOGS twice a day on the master. This was to change our backup procedure.

To pull a backup we run on the master:

SET AUTOCOMMIT=0;
FLUSH TABLES WITH READ LOCK;
FLUSH LOGS; <---Added command.
<run filesystem snapshot of DB and log volumes>
COMMIT;
UNLOCK TABLES;

How to repeat:
Not clear. Restarting the slave does not clear the issue. Not able to repeat, because not able to clear condition.

Tried to CHANGE MASTER back to "mysql-bin.000001", and now the slave is reporting it can't find that log either (same 1236 error). Master report no problems to its own error log.

Doing a RESET SLAVE, followed by CHANGE MASTER with MASTER_LOG_FILE specified to mysql-bin.000001 cause same 1236 error. 

However, doing a RESET SLAVE, followed by CHANGE MASTER without MASTER_LOG_FILE specified, causes slave to start replication at mysql-bin.000001. This causes slave to break replication because of duplicate entry errors. Using SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1000000000; to skip forward seems to move through logs files. Seeing how far this can go.

Purged MASTER LOGS to 'mysql-bin.000014' on master, and then ran RESET SLAVE, followed by CHANGE MASTER without MASTER_LOG_FILE on slave; This caused slave to start replication at 'mysql-bin.000014'.

PURGE MASTER LOGS on master DID NOT update mysql-bin.index on master. Still has all logs files (as edited) up through mysql-bin.000015.

Figured out what happened at time of initial replication break. We have been doing "FLUSH LOGS" once a day since the 22nd, except on the weekends. The last log created when we flushed was mysql-bin.000013 on Friday. Today at the time of the replication break (where the slave reset to mysql-bin.000001 and since couldn't find logs IF specified in CHANGE MASTER), the master rolled the log on its own to mysql-bin.000014 because .000013 had hit the max size for an individual log.

So the reproduction sequence appears to be do a FLUSH LOGS on the master, and then let the new log file grow large enough such that MySQL will roll it on its own. At this point, the slave will lose its brain and reset to .000001 or whatever it thinks is the first log in the sequence.

Thank you for a detailed bug report. Please, send your my.cnf from both master and slave, just for completeness.

Master my.cnf

Attachment: my.cnf (application/octet-stream, text), 20.63 KiB.

Slave my.cnf

Attachment: my.cnf (application/octet-stream, text), 20.61 KiB.

Hi Valeriy,

Please find attached to the case master and slave my.cnf files. Thank you for reviewing this!

Sorry for a delay with this bug report. Please, try to use a newer version, 5.0.37, and check if you can repeat the same incorrect behaviour.

Hi Valeriy,

I will attempt to check this. Thank you for the response.

Please, inform about any results.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".