Bug #76746 Broken replication on SQL thread restart if gtid_mode is enabled
Submitted: 19 Apr 2015 5:12 Modified: 26 Aug 2015 10:54
Reporter: Davi Arnaut (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.6 OS:Any
Assigned to: CPU Architecture:Any
Tags: binlog checksum, broken replication, GTID, relay log, sql thread

[19 Apr 2015 5:12] Davi Arnaut
Description:
When the SQL thread starts, it needs to search the current relay log for the
Format_description event of the master before seeking to the last executed
position. This search starts at the beginning of the relay log and continues
until an event other than Format_description or Rotate is found. This method
relies on the layout of the beginning of a relay log, which is usually
composed of the Format_description event of the slave followed by the Rotate
and Format_description events from the master.

The problem is that if GTID mode is enabled, the relay log contains a
Previous_gtids event before the Rotate event, which breaks the aforementioned
method as the search will stop before the Format_description event of the
master is read. That is, the search stops once the Previous_gtid event is
found. This causes the SQL thread to use its own Format_description event to
process events from the master, leading to broken replication in a few
scenarios (for example, when binary log checksum is enabled on master but not
on slave).

How to repeat:
See attached patch.
[19 Apr 2015 5:14] Davi Arnaut
Test case and fix.

Attachment: SQL-thread-init-previous-gtid.patch (application/octet-stream, text), 4.47 KiB.

[19 Apr 2015 10:13] MySQL Verification Team
Hello Davi Arnaut,

Thank you for the report, test case and contribution.

Thanks,
Umesh
[26 Aug 2015 10:54] David Moss
Thanks for your feedback, this is fixed in an upcoming versions and the following was noted in the 5.6.27 changelog:
When a master with --binlog_checksum=none and --gtid-mode=ON was replicating to a slave with --binlog_checksum=crc32, restarting the slave's SQL thread caused an Event crc check error. This was due to the Format_description_log_event from the master not being correctly found in existing relay logs after restarting the slave's SQL thread. The fix ensures that the Previous_gtids_log_event is correctly skipped and that the correct Format_description_log_event is found in existing relay logs after restarting the slave's SQL thread.