Bug #116932 Not update the slave_relay_log_info table in time causes replication recovery failure
Submitted: 11 Dec 2024 1:45 Modified: 11 Dec 2024 8:10
Reporter: karry zhang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:MySQL8.0,5.7 OS:Any
Assigned to: CPU Architecture:Any

[11 Dec 2024 1:45] karry zhang
Description:
I recently found the following error when dealing with a replication interruption problem:
2024-12-08T11:28:10.383094+08:00 36 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name /flash/rds/mysql-inst/14360/mysql/slave-relay.000009, group_relay_log_pos 11704 group_master_log_name mysql-bin.000005, group_master_log_pos 11491
2024-12-08T11:28:10.383221+08:00 36 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000004, event_master_log_pos 2509.
2024-12-08T11:28:10.383236+08:00 36 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000004, event_master_log_pos 4592.
......
2024-12-08T11:28:10.383376+08:00 36 [ERROR] Error looking for file after /flash/rds/mysql-inst/14360/mysql/slave-relay.000013.

I checked the slave_relay_log_info table and the slave_worker_info table:
mysql> select * from mysql.slave_relay_log_info\G
*************************** 1. row ***************************
  Number_of_lines: 7
   Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
    Relay_log_pos: 269867403
  Master_log_name: mysql-bin.000004
   Master_log_pos: 269867190
        Sql_delay: 0
Number_of_workers: 8
               Id: 1
     Channel_name:
1 row in set (0.00 sec)

mysql> select * from mysql.slave_worker_info\G
*************************** 1. row ***************************
                        Id: 1
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
             Relay_log_pos: 11704
           Master_log_name: mysql-bin.000005
            Master_log_pos: 11491
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
  Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
 Checkpoint_master_log_pos: 194
          Checkpoint_seqno: 4
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap:
              Channel_name:
*************************** 2. row ***************************
                        Id: 2
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
             Relay_log_pos: 13867
           Master_log_name: mysql-bin.000005
            Master_log_pos: 13654
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
  Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
 Checkpoint_master_log_pos: 194
          Checkpoint_seqno: 5
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap:
              Channel_name:
*************************** 3. row ***************************
                        Id: 3
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
             Relay_log_pos: 16092
           Master_log_name: mysql-bin.000005
            Master_log_pos: 15879
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
  Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
 Checkpoint_master_log_pos: 194
          Checkpoint_seqno: 6
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap: @
              Channel_name:
*************************** 4. row ***************************
                        Id: 4
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
             Relay_log_pos: 18335
           Master_log_name: mysql-bin.000005
            Master_log_pos: 18122
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
  Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
 Checkpoint_master_log_pos: 194
          Checkpoint_seqno: 7
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap: �
              Channel_name:
*************************** 5. row ***************************
                        Id: 5
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
             Relay_log_pos: 20588
           Master_log_name: mysql-bin.000005
            Master_log_pos: 20375
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000009
  Checkpoint_relay_log_pos: 407
Checkpoint_master_log_name: mysql-bin.000005
 Checkpoint_master_log_pos: 194
          Checkpoint_seqno: 8
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap:
              Channel_name:
*************************** 6. row ***************************
                        Id: 6
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
             Relay_log_pos: 269529594
           Master_log_name: mysql-bin.000004
            Master_log_pos: 269529381
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
  Checkpoint_relay_log_pos: 269111706
Checkpoint_master_log_name: mysql-bin.000004
 Checkpoint_master_log_pos: 269111493
          Checkpoint_seqno: 64
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap:   @
              Channel_name:
*************************** 7. row ***************************
                        Id: 7
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
             Relay_log_pos: 269554431
           Master_log_name: mysql-bin.000004
            Master_log_pos: 269554218
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
  Checkpoint_relay_log_pos: 269111706
Checkpoint_master_log_name: mysql-bin.000004
 Checkpoint_master_log_pos: 269111493
          Checkpoint_seqno: 67
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap:   �@
              Channel_name:
*************************** 8. row ***************************
                        Id: 8
            Relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
             Relay_log_pos: 269564287
           Master_log_name: mysql-bin.000004
            Master_log_pos: 269564074
 Checkpoint_relay_log_name: /flash/rds/mysql-inst/14360/mysql/slave-relay.000008
  Checkpoint_relay_log_pos: 269111706
Checkpoint_master_log_name: mysql-bin.000004
 Checkpoint_master_log_pos: 269111493
          Checkpoint_seqno: 68
     Checkpoint_group_size: 64
   Checkpoint_group_bitmap:    �
              Channel_name:
8 rows in set (0.00 sec)
===============================================
We can clearly see that there was a problem with the replication recovery.

We know that in the mts_recovery_groups function, we need to obtain the gap transactions that need to be recovered based on the information in the slave_relay_log_info table and the slave_worker_info table.

In the above situation:

  LOG_POS_COORD cp=
  {
    (char *) rli->get_group_master_log_name(),
    rli->get_group_master_log_pos()
  };

will be:
cp = {"mysql-bin.000004", 269867190}

And worker 1 needs to be checked because:

LOG_POS_COORD w_last= { const_cast<char*>(worker->get_group_master_log_name()),
                            worker->get_group_master_log_pos() };
will be

w_last = {"mysql-bin.000005", 11491}

so mts_event_coord_cmp(&w_last, &cp) > 0

But we will never be able to meet the following conditions:

(ret= mts_event_coord_cmp(&ev_coord, &w_last)) == 0

bacause mysql-bin.000004 is never equal to mysql-bin.000005!!!

Therefore, the replication recovery failed.

I think there are two reasons for the above problem:

1. The information in the slave_relay_log_info table is lagging behind the information in the slave_worker_info table because we call mta_checkpoint_routine periodically.

2.In the mts_recovery_groups function, we did not process the rotate log event, but simply ignored it.

The master error log I gave is from MySQL 5.7.44, but MySQL 8.0 also has the same problem because the logic is the same.

How to repeat:
Frequently kill replicas. In some cases, you can see the above error. For MySQL 8.0, you need to set MASTER_AUTO_POSITION to 0.
In the description, I have explained the cause of the above problem in detail. If necessary, I can provide a test case, but it will take some time.

Suggested fix:
I think adding a handle to the rotate log event in mts_recovery_groups will fix this problem. I will provide a fix later.
[11 Dec 2024 8:10] MySQL Verification Team
Thank you for the report.
Having a test case and a proposed fix would be great.