MySQL Bugs: #70669: Slave can't continue replication after master's crash recovery

Bug #70669	Slave can't continue replication after master's crash recovery
Submitted:	20 Oct 2013 4:25	Modified:	27 Feb 2014 13:16
Reporter:	Yoshinori Matsunobu (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.14	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In MySQL 5.6, the following sequence may happen. In MySQL 5.1, this doesn't happen if sync_binlog=1, but in 5.6 this may happen regardless of sync_binlog settings.

1. master writes to binlog (writing to kernel buffer)
2. binlog dump threads read the binlog events and send to slaves
3. master flushes to binlog (fsync to binlog file)

If OS crash happens between #2 and #3 on master, after master's crash recovery, slaves can't continue replication because binlog events generated at #1 do not exist on master but exist on slaves (server_errno=1236).

Another problem of this scenario is that applications eventually got errors when committing transactions, but they were replicated to slaves.

This problem is expected on less durable settings (sync_binlog!=1), but should not happen on durable settings (sync_binlog=1).

How to repeat:
Setup normal master/slaves replication, and set sync_binlog=1 on master.

Set breakpoint at MYSQL_BIN_LOG::sync_binlog_file (by gdb).
Run any insert statement on the master. gdb hits breakpoint. Then check whether the insert is replicated to slaves or not. I could repeat. This means binlog events are replicated before calling fsync on master.

Another simple approach: Run heavy auto-committed inserts/updates on master. Terminate OS on the master when running benchmarks. Restart master and check whether slave can continue replication or not.

Suggested fix:
In 5.6, LOCK_log is held during writing to kernel buffer (flush_cache_to_file()), but is released when calling fsync() (sync_binlog_file()). So binlog dump threads may read binlog events before fsync() completes. This causes the problem. Holding LOCK_log until fsync() completes(5.1 approach), or making binlog dump threads wait until fsync() completes would be needed.

Hello Yoshinori,

Thank you for the bug report.
Verified as described.

Thanks,
Umesh

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

Fixed in 5.6+. Documented fix in the 5.6.17 and 5.7.4 changelogs as follows:

        Binary log events could be sent to slaves before they were flushed
        to disk on the master, even when sync_binlog was set to 1. This
        could lead to either of those of the following two issues when
        the master was restarted following a crash of the operating
        system:

            ·Replication cannot continue because one or more slaves are
            requesting replicate events that do not exist on the master.

            ·Data exists on one or more slaves, but not on the master.

        Such problems are expected on less durable settings (sync_binlog
        not equal to 1), but it should not happen when sync_binlog is 1.
        To fix this issue, a lock (LOCK_log) is now held during
        synchronization and is released only after the binary events are
        actually written to disk.

Closed.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

5.6$ bzr log -r 5838
------------------------------------------------------------
revno: 5838
committer: Libing Song <libing.song@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2014-02-25 09:39:34 +0800
message:
  BUG#17632285 SLAVE CAN'T CONTINUE REPLICATION AFTER MASTER'S
               CRASH RECOVERY
  
  Binary events might be sent to slaves before they are flushed
  to disk on master, even sync_binlog is set to 1. It can cause
  two problems if the master restarts after an OS crash.
  * Replication cannot continue because the slaves are
    requesting to replication the events don't exist on master.
  * Data exists on slaves, but not exists on the master.
  
  The problems are expected on less durable settings(
  sync_binlog != 1), but it should not happen on durable
  setting(sync_binlog = 1).
  
  Since 5.6 binlog group commit implementation, binlog write
  and sync have been protected by separate mutexes. So dump
  threads can read the binary events simultaneously or even
  before it is synced to disk.
  
  To fixing the problem on durable setting, LOCK_log is hold
  in sync stage and it is released after the binary events are
  synced to disk.