Bug #104334 wrong Master_log_pos in mysql.slave_relay_log_info cause mts recovery failed
Submitted: 16 Jul 2021 9:17 Modified: 20 Jul 2021 12:49
Reporter: zkong kong Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:8.0.25 OS:CentOS
Assigned to: MySQL Verification Team CPU Architecture:Any

[16 Jul 2021 9:17] zkong kong
Description:
In our production server, after outof disk space and the server can't start replication again. after some investigation I found that the Master_log_pos is much smaller than it should be;
 
mysql> select * from slave_relay_log_info;
+-----------------+---------------------------+---------------+-----------------+----------------+-----------+-------------------+------+--------------+---------------------------+---------------------------+--------------------+---------------------------------+---------------------------------------------+----------------------------------------------+
| Number_of_lines | Relay_log_name            | Relay_log_pos | Master_log_name | Master_log_pos | Sql_delay | Number_of_workers | Id   | Channel_name | Privilege_checks_username | Privilege_checks_hostname | Require_row_format | Require_table_primary_key_check | Assign_gtids_to_anonymous_transactions_type | Assign_gtids_to_anonymous_transactions_value |
+-----------------+---------------------------+---------------+-----------------+----------------+-----------+-------------------+------+--------------+---------------------------+---------------------------+--------------------+---------------------------------+---------------------------------------------+----------------------------------------------+
|              14 | ./node01-relay-bin.000013 |           449 | binlog.000004   |            156 |         0 |                 4 |    1 |              | NULL                      | NULL                      |                  0 | STREAM                          | OFF                                         |                                              |
+-----------------+---------------------------+---------------+-----------------+----------------+-----------+-------------------+------+--------------+---------------------------+---------------------------+--------------------+---------------------------------+---------------------------------------------+----------------------------------------------+

After lookup the relay_log found it's the rotate log set it. 

# at 361
#210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x775352ad    Rotate to binlog.000004  pos: 156
# at 405
#210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x0b55c3a7    Rotate to binlog.000004  pos: 22103933

How to repeat:
set up gtid and stop and stop slave

Suggested fix:
remove the wrong pos rotate event in relay log
[19 Jul 2021 12:46] MySQL Verification Team
Hi,

> after outof disk space

You should not have allowed this to happen, MySQL Enterprise Server can monitor this for you or any other of the thousands of monitoring tools. 

MySQL does not behave nicely if you fill your disk and that will not change in near future. Random problems can happen, you can get into "impossible state" with disk full problem. 

> Master_log_pos is much smaller than it should be;

> #210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x775352ad    Rotate to binlog.000004  pos: 156
> # at 405
> #210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x0b55c3a7    Rotate to binlog.000004  pos: 22103933

> set up gtid and stop and stop slave

I cannot reproduce this. Can you provide a way to reproduce this problem.

Kind regards
[20 Jul 2021 12:43] zkong kong
Hi:
Reproduce steps:
1. set up a master and slave use gtid;
2. create a table on master and wait slave catch up with it.
   create table t(id int primary key, name varchar(20));
3. stop slave and write a record at master:
   insert into t values(1,'kk');
4. start slave again the check the last relay log file:
   
   # at 361
#210720 20:35:49 server id 0  end_log_pos 0 CRC32 0x86972457    Rotate to binlog.000004  pos: 156
# at 405
#210720 20:35:49 server id 0  end_log_pos 0 CRC32 0x08426073    Rotate to binlog.000004  pos: 373
# at 449
#210720 20:36:29 server id 1  end_log_pos 452 CRC32 0x5320e983  GTID    last_committed=1        sequence_number=2       rbr_only=yes    original_committed_timestamp=1626784589649812   immediate_commit_timestamp=1626784589649812     transaction_length=283
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=1626784589649812 (2021-07-20 20:36:29.649812 CST)
# immediate_commit_timestamp=1626784589649812 (2021-07-20 20:36:29.649812 CST)
/*!80001 SET @@session.original_commit_timestamp=1626784589649812*//*!*/;
/*!80014 SET @@session.original_server_version=80025*//*!*/;
/*!80014 SET @@session.immediate_server_version=80025*//*!*/;
SET @@SESSION.GTID_NEXT= '9ec76a39-e951-11eb-a9b4-3c15fb3ccbeb:2'/*!*/;
# at 528
#210720 20:36:29 server id 1  end_log_pos 527 CRC32 0x39880c3b  Query   thread_id=8     exec_time=0     error_code=0
SET TIMESTAMP=1626784589/*!*/;
SET @@session.pseudo_thread_id=8/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1168113696/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C latin1 *//*!*/;
SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=255/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
/*!80011 SET @@session.default_collation_for_utf8mb4=255*//*!*/;
BEGIN
/*!*/;
# at 603
#210720 20:36:29 server id 1  end_log_pos 582 CRC32 0x16f813d5  Table_map: `test`.`t` mapped to number 119
# at 658
#210720 20:36:29 server id 1  end_log_pos 625 CRC32 0xe21c38b7  Write_rows: table id 119 flags: STMT_END_F

BINLOG '
TcP2YBMBAAAANwAAAEYCAAAAAHcAAAAAAAEABHRlc3QAAXQAAgMPAlAAAgEBAAID/P8A1RP4Fg==
TcP2YB4BAAAAKwAAAHECAAAAAHcAAAAAAAEAAgAC/wABAAAAAmtrtzgc4g==
'/*!*/;
### INSERT INTO `test`.`t`
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='kk' /* VARSTRING(80) meta=80 nullable=1 is_null=0 */

Best regards
[20 Jul 2021 12:49] zkong kong
After some debug, I found that the first previous_gtids_log_event can't be exclude by the slave executed_gtids and it's changed to the wrong rotate_event by the slave.