Bug #92882 MTS not replication crash-safe with GTID and all the right parameters.
Submitted: 22 Oct 14:47 Modified: 23 Oct 12:58
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.23 OS:Any
Assigned to: CPU Architecture:Any

[22 Oct 14:47] Jean-François Gagné
Description:
Hi,

when using MTS in MySQL 5.7, replication should be crash-safe if using GTID, with relay_log_info_repository to TABLE and relay_log_recovery ON.  However, as shown in how to repeat, it is not the case.

From what I gather, things are failing when relay-log-recovery is managing the gap in transaction execution by trying to look at the relay logs.  Because of the un-synced relay logs have been lost by the crash, things go wring.  So we are in a specific case of Bug#81840.  However, I do not think this should be considered as a duplicate as this bug is in the specific case of GTIDs, more on that below.

Note that things are not failing if replacing an OS crash by a mysqld crash: kill -9 $(cat node1/data/mysql_sandbox*.pid).  In this case, things are not failing because the relay logs are not lost.

Once in this situation (after an OS crash), the replication breakage is easy to fix by running "stop slave; reset slave; start slave;".  This will wipe the relay logs, reconnect to the master, and thanks to GTID download the missing transaction.

However, even if this is easy to fix manually, this situation should not happen: relay-log-recovery should not fail when GTID Auto_Position is enabled as the table mysql.slave_worker_info is not needed in this case.

Many thanks for looking into that,

JFG

How to repeat:
Create a MTS supposedly crash-safe replication environment using dbdeployer.  We need SBR to generate a gap in slave execution later.

$ dbdeployer deploy replication mysql_5.7.23 --gtid \
  -c "binlog_format = 'STATEMENT'" \
  -c "slave_parallel_workers = 10" \
  -c skip-slave-start

$ ./s1 -N <<< 'show global variables like "%relay_log_info_repository%"'
relay_log_info_repository       TABLE

$ ./s1 -N <<< 'show global variables like "%relay_log_r%"'
relay_log_recovery      ON

$ ./s1 <<< 'show slave status\G' | grep Auto_Position
                Auto_Position: 1

Create two schemas and one table per schema for using MTS:

$ ./m <<< "
    create database test_jfg1;
    create database test_jfg2;
    create table test_jfg1.t(id int primary key, v int);
    create table test_jfg2.t(id int primary key, v int)"

Stop one slave and insert data on the master using sleep to cause a gap on slave thanks to SBR (the insert will take some time because of the sleep):

$ ./s1 <<< "stop slave;"; \
  ./m <<< "
    insert into test_jfg1.t values(1,1);
    insert into test_jfg1.t values(2,sleep(30));
    insert into test_jfg2.t values(1,1)"

Start the slave, and because of SBR we expect a gap which we validate by select and show master status.  We then crash the OS.

$ ./s1 <<< "
    start slave;
    do sleep(5);
    select * from test_jfg1.t;
    select * from test_jfg2.t;
    show master status"; \
  sudo bash -c "echo c > /proc/sysrq-trigger"
id      v
1       1
id      v
1       1
File    Position        Binlog_Do_DB    Binlog_Ignore_DB        Executed_Gtid_Set
mysql-bin.000001        154                     00019024-1111-1111-1111-111111111111:1-21:23

After Linux has restarted, and after restarting MySQL (with ./start_all), I have the following in the error logs:

2018-10-22T14:21:13.372198Z 0 [Note] Slave: MTS group recovery relay log info based on Worker-Id 8, group_relay_log_name ./mysql-relay.000003, group_relay_log_pos 1320 group_master_log_name mysql-bin.000001, group_master_log_pos 5685
2018-10-22T14:21:13.373259Z 0 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
2018-10-22T14:21:13.373284Z 0 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
2018-10-22T14:21:13.373318Z 0 [ERROR] Error looking for file after ./mysql-relay.000004.
2018-10-22T14:21:13.376426Z 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2018-10-22T14:21:13.376469Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-10-22T14:21:13.376481Z 0 [Note] Some of the channels are not created/initialized properly. Check for additional messages above. You will not be able to start replication on those channels until the issue is resolved and the server restarted.

Suggested fix:
Do not fail relay-log-recovery in the case of vanished relay logs when GTID Auto_Position is enabled as the table mysql.slave_worker_info is not needed in this case.
[22 Oct 15:35] Jeremy Tinley
This affects us on 5.6.41 as well. Full durability, crash safe replication with multi-threaded fails on MySQL crash.
[23 Oct 12:58] Umesh Shastry
Hello Jean-François,

Thank you for the report and test case.
Observed this with 5.7.23 build.

regards,
Umesh
[23 Oct 13:00] Umesh Shastry
5.7.23 - test results

Attachment: 92882_5.7.23.results (application/octet-stream, text), 13.77 KiB.