MySQL Bugs: #11370: Slave should retry on error 1236 rather than kill the IO thread

Bug #11370	Slave should retry on error 1236 rather than kill the IO thread
Submitted:	16 Jun 2005 2:15	Modified:	11 Dec 2008 0:22
Reporter:	Kolbe Kegel	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When a slave encounters error number 1236 (ER_MASTER_FATAL_ERROR_READING_BINLOG), the IO thread exits, and the slave has to be manually restarted with START SLAVE.

This is an excerpt of an event that resulted in a stopped slave that required a manual restart:

050610  6:30:52 Error reading packet from server: binlog truncated in the middle of event (server_errno=1236)
050610  6:30:52 Got fatal error 1236: 'binlog truncated in the middle of event' from master when reading data from binary log
050610  6:30:52 Slave I/O thread exiting, read up to log 'binary-log.1676', position 820347988

No action had to be taken on the master before the slave was able to resume replication with the execution of START SLAVE in the incident outlined above.

How to repeat:
Cause of stopped replication is unknown, but repeating the observed behavior is not necessary to make suggested chances.

Suggested fix:
Instead of exiting, the slave should pause and attempt to continue replication. The same procedure could be followed that is used when a master is unreachable (i.e. consult master_connect_retry). There may be other errors that could be treated in this same way.

This error usually indicates that either the master binlog got corrupted, or the slave thread lost track of the position. If the slave manages to successfully restart with START SLAVE and no changes to the position, this probably means we have a race condition on the master where the slave connection handler tries to read the binlog after an updating thread started writing to it, but before it finished.

In our case, the server ignored a function return value indicating a failure during memory allocation. The confusion from this resulted in the 1236 error sent to the slave with no indication of an error on the master.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".