Bug #11370 Slave should retry on error 1236 rather than kill the IO thread
Submitted: 16 Jun 2005 2:15 Modified: 11 Dec 2008 0:22
Reporter: Kolbe Kegel Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[16 Jun 2005 2:15] Kolbe Kegel
Description:
When a slave encounters error number 1236 (ER_MASTER_FATAL_ERROR_READING_BINLOG), the IO thread exits, and the slave has to be manually restarted with START SLAVE.

This is an excerpt of an event that resulted in a stopped slave that required a manual restart:

050610  6:30:52 Error reading packet from server: binlog truncated in the middle of event (server_errno=1236)
050610  6:30:52 Got fatal error 1236: 'binlog truncated in the middle of event' from master when reading data from binary log
050610  6:30:52 Slave I/O thread exiting, read up to log 'binary-log.1676', position 820347988

No action had to be taken on the master before the slave was able to resume replication with the execution of START SLAVE in the incident outlined above.

How to repeat:
Cause of stopped replication is unknown, but repeating the observed behavior is not necessary to make suggested chances.

Suggested fix:
Instead of exiting, the slave should pause and attempt to continue replication. The same procedure could be followed that is used when a master is unreachable (i.e. consult master_connect_retry). There may be other errors that could be treated in this same way.
[13 Nov 2005 1:15] Alexander Pachev
This error usually indicates that either the master binlog got corrupted, or the slave thread lost track of the position. If the slave manages to successfully restart with START SLAVE and no changes to the position, this probably means we have a race condition on the master where the slave connection handler tries to read the binlog after an updating thread started writing to it, but before it finished.
[22 Jan 2008 5:50] Mark Callaghan
In our case, the server ignored a function return value indicating a failure during memory allocation. The confusion from this resulted in the 1236 error sent to the slave with no indication of an error on the master.
[11 Nov 2008 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[12 Dec 2008 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".