Description:
I recently found the following error when dealing with a replication interruption problem:
2024-12-08T11:28:10.383094+08:00 36 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name /flash/rds/mysql-inst/14360/mysql/slave-relay.000009, group_relay_log_pos 11704 group_master_log_name mysql-bin.000005, group_master_log_pos 11491
2024-12-08T11:28:10.383221+08:00 36 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000004, event_master_log_pos 2509.
2024-12-08T11:28:10.383236+08:00 36 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000004, event_master_log_pos 4592.
......
2024-12-08T11:28:10.383376+08:00 36 [ERROR] Error looking for file after /flash/rds/mysql-inst/14360/mysql/slave-relay.000013.
I checked the slave_relay_log_info table and the slave_worker_info table:
mysql> select * from mysql.slave_relay_log_info\G
*************************** 1. row ***************************
Number_of_lines: 7
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Relay_log_pos: 269867403
Master_log_name: mysql-bin.000004
Master_log_pos: 269867190
Sql_delay: 0
Number_of_workers: 8
Id: 1
Channel_name:
1 row in set (0.00 sec)
mysql> select * from mysql.slave_worker_info\G
*************************** 1. row ***************************
Id: 1
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Relay_log_pos: 11704
Master_log_name: mysql-bin.000005
Master_log_pos: 11491
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
Checkpoint_master_log_pos: 194
Checkpoint_seqno: 4
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
Channel_name:
*************************** 2. row ***************************
Id: 2
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Relay_log_pos: 13867
Master_log_name: mysql-bin.000005
Master_log_pos: 13654
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
Checkpoint_master_log_pos: 194
Checkpoint_seqno: 5
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
Channel_name:
*************************** 3. row ***************************
Id: 3
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Relay_log_pos: 16092
Master_log_name: mysql-bin.000005
Master_log_pos: 15879
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
Checkpoint_master_log_pos: 194
Checkpoint_seqno: 6
Checkpoint_group_size: 64
Checkpoint_group_bitmap: @
Channel_name:
*************************** 4. row ***************************
Id: 4
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Relay_log_pos: 18335
Master_log_name: mysql-bin.000005
Master_log_pos: 18122
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
Checkpoint_master_log_pos: 194
Checkpoint_seqno: 7
Checkpoint_group_size: 64
Checkpoint_group_bitmap: �
Channel_name:
*************************** 5. row ***************************
Id: 5
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Relay_log_pos: 20588
Master_log_name: mysql-bin.000005
Master_log_pos: 20375
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
Checkpoint_master_log_pos: 194
Checkpoint_seqno: 8
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
Channel_name:
*************************** 6. row ***************************
Id: 6
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Relay_log_pos: 269529594
Master_log_name: mysql-bin.000004
Master_log_pos: 269529381
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Checkpoint_relay_log_pos: 269111706
Checkpoint_master_log_name: mysql-bin.000004
Checkpoint_master_log_pos: 269111493
Checkpoint_seqno: 64
Checkpoint_group_size: 64
Checkpoint_group_bitmap: @
Channel_name:
*************************** 7. row ***************************
Id: 7
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Relay_log_pos: 269554431
Master_log_name: mysql-bin.000004
Master_log_pos: 269554218
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Checkpoint_relay_log_pos: 269111706
Checkpoint_master_log_name: mysql-bin.000004
Checkpoint_master_log_pos: 269111493
Checkpoint_seqno: 67
Checkpoint_group_size: 64
Checkpoint_group_bitmap: �@
Channel_name:
*************************** 8. row ***************************
Id: 8
Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Relay_log_pos: 269564287
Master_log_name: mysql-bin.000004
Master_log_pos: 269564074
Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
Checkpoint_relay_log_pos: 269111706
Checkpoint_master_log_name: mysql-bin.000004
Checkpoint_master_log_pos: 269111493
Checkpoint_seqno: 68
Checkpoint_group_size: 64
Checkpoint_group_bitmap: �
Channel_name:
8 rows in set (0.00 sec)
===============================================
We can clearly see that there was a problem with the replication recovery.
We know that in the mts_recovery_groups function, we need to obtain the gap transactions that need to be recovered based on the information in the slave_relay_log_info table and the slave_worker_info table.
In the above situation:
LOG_POS_COORD cp=
{
(char *) rli->get_group_master_log_name(),
rli->get_group_master_log_pos()
};
will be:
cp = {"mysql-bin.000004", 269867190}
And worker 1 needs to be checked because:
LOG_POS_COORD w_last= { const_cast<char*>(worker->get_group_master_log_name()),
worker->get_group_master_log_pos() };
will be
w_last = {"mysql-bin.000005", 11491}
so mts_event_coord_cmp(&w_last, &cp) > 0
But we will never be able to meet the following conditions:
(ret= mts_event_coord_cmp(&ev_coord, &w_last)) == 0
bacause mysql-bin.000004 is never equal to mysql-bin.000005!!!
Therefore, the replication recovery failed.
I think there are two reasons for the above problem:
1. The information in the slave_relay_log_info table is lagging behind the information in the slave_worker_info table because we call mta_checkpoint_routine periodically.
2.In the mts_recovery_groups function, we did not process the rotate log event, but simply ignored it.
The master error log I gave is from MySQL 5.7.44, but MySQL 8.0 also has the same problem because the logic is the same.
How to repeat:
Frequently kill replicas. In some cases, you can see the above error. For MySQL 8.0, you need to set MASTER_AUTO_POSITION to 0.
In the description, I have explained the cause of the above problem in detail. If necessary, I can provide a test case, but it will take some time.
Suggested fix:
I think adding a handle to the rotate log event in mts_recovery_groups will fix this problem. I will provide a fix later.