Description:
Hi,
when using MTS in MySQL 5.7, replication should be crash-safe if using GTID, with relay_log_info_repository to TABLE and relay_log_recovery ON. However, as shown in how to repeat, it is not the case.
From what I gather, things are failing when relay-log-recovery is managing the gap in transaction execution by trying to look at the relay logs. Because of the un-synced relay logs have been lost by the crash, things go wring. So we are in a specific case of Bug#81840. However, I do not think this should be considered as a duplicate as this bug is in the specific case of GTIDs, more on that below.
Note that things are not failing if replacing an OS crash by a mysqld crash: kill -9 $(cat node1/data/mysql_sandbox*.pid). In this case, things are not failing because the relay logs are not lost.
Once in this situation (after an OS crash), the replication breakage is easy to fix by running "stop slave; reset slave; start slave;". This will wipe the relay logs, reconnect to the master, and thanks to GTID download the missing transaction.
However, even if this is easy to fix manually, this situation should not happen: relay-log-recovery should not fail when GTID Auto_Position is enabled as the table mysql.slave_worker_info is not needed in this case.
Many thanks for looking into that,
JFG
How to repeat:
Create a MTS supposedly crash-safe replication environment using dbdeployer. We need SBR to generate a gap in slave execution later.
$ dbdeployer deploy replication mysql_5.7.23 --gtid \
-c "binlog_format = 'STATEMENT'" \
-c "slave_parallel_workers = 10" \
-c skip-slave-start
$ ./s1 -N <<< 'show global variables like "%relay_log_info_repository%"'
relay_log_info_repository TABLE
$ ./s1 -N <<< 'show global variables like "%relay_log_r%"'
relay_log_recovery ON
$ ./s1 <<< 'show slave status\G' | grep Auto_Position
Auto_Position: 1
Create two schemas and one table per schema for using MTS:
$ ./m <<< "
create database test_jfg1;
create database test_jfg2;
create table test_jfg1.t(id int primary key, v int);
create table test_jfg2.t(id int primary key, v int)"
Stop one slave and insert data on the master using sleep to cause a gap on slave thanks to SBR (the insert will take some time because of the sleep):
$ ./s1 <<< "stop slave;"; \
./m <<< "
insert into test_jfg1.t values(1,1);
insert into test_jfg1.t values(2,sleep(30));
insert into test_jfg2.t values(1,1)"
Start the slave, and because of SBR we expect a gap which we validate by select and show master status. We then crash the OS.
$ ./s1 <<< "
start slave;
do sleep(5);
select * from test_jfg1.t;
select * from test_jfg2.t;
show master status"; \
sudo bash -c "echo c > /proc/sysrq-trigger"
id v
1 1
id v
1 1
File Position Binlog_Do_DB Binlog_Ignore_DB Executed_Gtid_Set
mysql-bin.000001 154 00019024-1111-1111-1111-111111111111:1-21:23
After Linux has restarted, and after restarting MySQL (with ./start_all), I have the following in the error logs:
2018-10-22T14:21:13.372198Z 0 [Note] Slave: MTS group recovery relay log info based on Worker-Id 8, group_relay_log_name ./mysql-relay.000003, group_relay_log_pos 1320 group_master_log_name mysql-bin.000001, group_master_log_pos 5685
2018-10-22T14:21:13.373259Z 0 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
2018-10-22T14:21:13.373284Z 0 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
2018-10-22T14:21:13.373318Z 0 [ERROR] Error looking for file after ./mysql-relay.000004.
2018-10-22T14:21:13.376426Z 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2018-10-22T14:21:13.376469Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-10-22T14:21:13.376481Z 0 [Note] Some of the channels are not created/initialized properly. Check for additional messages above. You will not be able to start replication on those channels until the issue is resolved and the server restarted.
Suggested fix:
Do not fail relay-log-recovery in the case of vanished relay logs when GTID Auto_Position is enabled as the table mysql.slave_worker_info is not needed in this case.