MySQL Bugs: #92882: MTS not replication crash-safe with GTID and all the right parameters.

Bug #92882	MTS not replication crash-safe with GTID and all the right parameters.
Submitted:	22 Oct 2018 14:47	Modified:	4 Sep 2019 14:09
Reporter:	J-F Legacy Gagné	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.23	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

when using MTS in MySQL 5.7, replication should be crash-safe if using GTID, with relay_log_info_repository to TABLE and relay_log_recovery ON.  However, as shown in how to repeat, it is not the case.

From what I gather, things are failing when relay-log-recovery is managing the gap in transaction execution by trying to look at the relay logs.  Because of the un-synced relay logs have been lost by the crash, things go wring.  So we are in a specific case of Bug#81840.  However, I do not think this should be considered as a duplicate as this bug is in the specific case of GTIDs, more on that below.

Note that things are not failing if replacing an OS crash by a mysqld crash: kill -9 $(cat node1/data/mysql_sandbox*.pid).  In this case, things are not failing because the relay logs are not lost.

Once in this situation (after an OS crash), the replication breakage is easy to fix by running "stop slave; reset slave; start slave;".  This will wipe the relay logs, reconnect to the master, and thanks to GTID download the missing transaction.

However, even if this is easy to fix manually, this situation should not happen: relay-log-recovery should not fail when GTID Auto_Position is enabled as the table mysql.slave_worker_info is not needed in this case.

Many thanks for looking into that,

JFG

How to repeat:
Create a MTS supposedly crash-safe replication environment using dbdeployer.  We need SBR to generate a gap in slave execution later.

$ dbdeployer deploy replication mysql_5.7.23 --gtid \
  -c "binlog_format = 'STATEMENT'" \
  -c "slave_parallel_workers = 10" \
  -c skip-slave-start

$ ./s1 -N <<< 'show global variables like "%relay_log_info_repository%"'
relay_log_info_repository       TABLE

$ ./s1 -N <<< 'show global variables like "%relay_log_r%"'
relay_log_recovery      ON

$ ./s1 <<< 'show slave status\G' | grep Auto_Position
                Auto_Position: 1

Create two schemas and one table per schema for using MTS:

$ ./m <<< "
    create database test_jfg1;
    create database test_jfg2;
    create table test_jfg1.t(id int primary key, v int);
    create table test_jfg2.t(id int primary key, v int)"

Stop one slave and insert data on the master using sleep to cause a gap on slave thanks to SBR (the insert will take some time because of the sleep):

$ ./s1 <<< "stop slave;"; \
  ./m <<< "
    insert into test_jfg1.t values(1,1);
    insert into test_jfg1.t values(2,sleep(30));
    insert into test_jfg2.t values(1,1)"

Start the slave, and because of SBR we expect a gap which we validate by select and show master status.  We then crash the OS.

$ ./s1 <<< "
    start slave;
    do sleep(5);
    select * from test_jfg1.t;
    select * from test_jfg2.t;
    show master status"; \
  sudo bash -c "echo c > /proc/sysrq-trigger"
id      v
1       1
id      v
1       1
File    Position        Binlog_Do_DB    Binlog_Ignore_DB        Executed_Gtid_Set
mysql-bin.000001        154                     00019024-1111-1111-1111-111111111111:1-21:23

After Linux has restarted, and after restarting MySQL (with ./start_all), I have the following in the error logs:

2018-10-22T14:21:13.372198Z 0 [Note] Slave: MTS group recovery relay log info based on Worker-Id 8, group_relay_log_name ./mysql-relay.000003, group_relay_log_pos 1320 group_master_log_name mysql-bin.000001, group_master_log_pos 5685
2018-10-22T14:21:13.373259Z 0 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
2018-10-22T14:21:13.373284Z 0 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
2018-10-22T14:21:13.373318Z 0 [ERROR] Error looking for file after ./mysql-relay.000004.
2018-10-22T14:21:13.376426Z 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2018-10-22T14:21:13.376469Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-10-22T14:21:13.376481Z 0 [Note] Some of the channels are not created/initialized properly. Check for additional messages above. You will not be able to start replication on those channels until the issue is resolved and the server restarted.

Suggested fix:
Do not fail relay-log-recovery in the case of vanished relay logs when GTID Auto_Position is enabled as the table mysql.slave_worker_info is not needed in this case.

This affects us on 5.6.41 as well. Full durability, crash safe replication with multi-threaded fails on MySQL crash.

Hello Jean-François,

Thank you for the report and test case.
Observed this with 5.7.23 build.

regards,
Umesh

5.7.23 - test results

Attachment: 92882_5.7.23.results (application/octet-stream, text), 13.77 KiB.

Maybe related: Bug#83713 and Bug#93397.

Probably related to Bug#81840, and this might be a regression introduced by the fix of Bug#77496.

Posted by developer:
 
Changelog entry added for MySQL 8.0.18 and 5.7.28:

On a multi-threaded slave with GTIDs in use and MASTER_AUTO_POSITION set to ON, following an unexpected halt the slave would attempt relay log recovery, which failed if relay logs had been lost, preventing replication from starting. However, this step was unnecessary as GTID auto-positioning can be used to restore any missing transactions. In a recovery situation, the slave now checks first whether MASTER_AUTO_POSITION is set to ON, and if it is, skips relay log recovery. 

Documentation change made in
https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-slave-halt.html
https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-slave-halt.html
(topic needs further revision, in progress).