MySQL Bugs: #76746: Broken replication on SQL thread restart if gtid

Bug #76746	Broken replication on SQL thread restart if gtid_mode is enabled
Submitted:	19 Apr 2015 5:12	Modified:	26 Aug 2015 10:54
Reporter:	Davi Arnaut (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	binlog checksum, broken replication, GTID, relay log, sql thread

Description:
When the SQL thread starts, it needs to search the current relay log for the
Format_description event of the master before seeking to the last executed
position. This search starts at the beginning of the relay log and continues
until an event other than Format_description or Rotate is found. This method
relies on the layout of the beginning of a relay log, which is usually
composed of the Format_description event of the slave followed by the Rotate
and Format_description events from the master.

The problem is that if GTID mode is enabled, the relay log contains a
Previous_gtids event before the Rotate event, which breaks the aforementioned
method as the search will stop before the Format_description event of the
master is read. That is, the search stops once the Previous_gtid event is
found. This causes the SQL thread to use its own Format_description event to
process events from the master, leading to broken replication in a few
scenarios (for example, when binary log checksum is enabled on master but not
on slave).

How to repeat:
See attached patch.

Test case and fix.

Attachment: SQL-thread-init-previous-gtid.patch (application/octet-stream, text), 4.47 KiB.

Hello Davi Arnaut,

Thank you for the report, test case and contribution.

Thanks,
Umesh

Thanks for your feedback, this is fixed in an upcoming versions and the following was noted in the 5.6.27 changelog:
When a master with --binlog_checksum=none and --gtid-mode=ON was replicating to a slave with --binlog_checksum=crc32, restarting the slave's SQL thread caused an Event crc check error. This was due to the Format_description_log_event from the master not being correctly found in existing relay logs after restarting the slave's SQL thread. The fix ensures that the Previous_gtids_log_event is correctly skipped and that the correct Format_description_log_event is found in existing relay logs after restarting the slave's SQL thread.