Description:
There were a master instance and a slave instance running well. So many transcations was running on master and dumped to slave.
I don't known what's the time that slave's sql thread run error,then I restart slave mysql. During starting mysql, it prints a lot of messages as follows and MySQL Failed to create or recover replication info repositories.
2018-08-22T02:12:09.926908Z 0 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name ../relaylog/relay-bin.000432, group_relay_log_pos 2824967 group_master_log_name mysql-bin.000159, group_master_log_pos 2824754
2018-08-22T02:12:09.983572Z 0 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000159, event_master_log_pos 4225595.
2018-08-22T02:12:09.998740Z 0 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000159, event_master_log_pos 4503416.
..................................................................................................................................................
..................................................................................................................................................
2018-08-22T02:12:32.408856Z 0 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000159, event_master_log_pos 424784.
2018-08-22T02:12:32.408921Z 0 [ERROR] Error looking for file after ../relaylog/relay-bin.000698.
2018-08-22T02:12:32.466196Z 0 [ERROR] Failed to initialize the master info structure
2018-08-22T02:12:32.466238Z 0 [ERROR] Failed to create or recover replication info repositories.
the content of relay-log.info is as follows.
../relaylog/relay-bin.000034
241
mysql-bin.000003
4
the content of worker-relay-log.info.1 is as follows.
../relaylog/relay-bin.000033
28974
mysql-bin.000003
1514067
The relaylog position of global checkpoint for slave parallel replay is (relay-bin.000034,241).The relaylog position of checkpoint for parallel replay worker1 is (relay-bin.000033,28974). relay-bin.000034 is newer than relay-bin.0000333.
The binlog position of global checkpoint for slave parallel replay is (mysql-bin.000003,4).The relaylog position of checkpoint for parallel replay worker1 is (mysql-bin.000003,1514067).
Obviously,The relaylog position of global chckpoint for slave parallel replay is newer than checkpoint for parallel replay worker1, but the binlog position of global checkpoint for slave parallel replay is older than the relaylog position of checkpoint for parallel replay worker1.
How to repeat:
ensure the condtion bellow.
(1)The master and slave instances run well, master has dumped transactions to slave.
(2)io thread and sql thread run well on slave .
login on slave,execute commands as follows.
(1)stop slave io_thread.
(2)start slave io_thread,it was separated stages as follows.
I Realized it by gdb and add break point in function Rotate_log_event::do_update_pos(Relay_log_info *rli)
stage1: sql thread replay fake rotate event and update binlog postion of relay-log.info to (mysql-bin.000003,4).
stage2: kill slave mysqld process.
statg3: restart slave msyql, the error was produced
Suggested fix:
the error was produced when calling function mts_recovery_groups
bool mts_recovery_groups(Relay_log_info *rli)
{
........................
}