MySQL Bugs: #104334: wrong Master_log_pos in mysql.slave_relay_log

Bug #104334	wrong Master_log_pos in mysql.slave_relay_log_info cause mts recovery failed
Submitted:	16 Jul 2021 9:17	Modified:	20 Jul 2021 12:49
Reporter:	zkong kong	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	8.0.25	OS:	CentOS
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
In our production server, after outof disk space and the server can't start replication again. after some investigation I found that the Master_log_pos is much smaller than it should be;
 
mysql> select * from slave_relay_log_info;
+-----------------+---------------------------+---------------+-----------------+----------------+-----------+-------------------+------+--------------+---------------------------+---------------------------+--------------------+---------------------------------+---------------------------------------------+----------------------------------------------+
| Number_of_lines | Relay_log_name            | Relay_log_pos | Master_log_name | Master_log_pos | Sql_delay | Number_of_workers | Id   | Channel_name | Privilege_checks_username | Privilege_checks_hostname | Require_row_format | Require_table_primary_key_check | Assign_gtids_to_anonymous_transactions_type | Assign_gtids_to_anonymous_transactions_value |
+-----------------+---------------------------+---------------+-----------------+----------------+-----------+-------------------+------+--------------+---------------------------+---------------------------+--------------------+---------------------------------+---------------------------------------------+----------------------------------------------+
|              14 | ./node01-relay-bin.000013 |           449 | binlog.000004   |            156 |         0 |                 4 |    1 |              | NULL                      | NULL                      |                  0 | STREAM                          | OFF                                         |                                              |
+-----------------+---------------------------+---------------+-----------------+----------------+-----------+-------------------+------+--------------+---------------------------+---------------------------+--------------------+---------------------------------+---------------------------------------------+----------------------------------------------+

After lookup the relay_log found it's the rotate log set it. 

# at 361
#210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x775352ad    Rotate to binlog.000004  pos: 156
# at 405
#210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x0b55c3a7    Rotate to binlog.000004  pos: 22103933

How to repeat:
set up gtid and stop and stop slave

Suggested fix:
remove the wrong pos rotate event in relay log

Hi,

> after outof disk space

You should not have allowed this to happen, MySQL Enterprise Server can monitor this for you or any other of the thousands of monitoring tools. 

MySQL does not behave nicely if you fill your disk and that will not change in near future. Random problems can happen, you can get into "impossible state" with disk full problem. 

> Master_log_pos is much smaller than it should be;

> #210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x775352ad    Rotate to binlog.000004  pos: 156
> # at 405
> #210716 16:31:46 server id 0  end_log_pos 0 CRC32 0x0b55c3a7    Rotate to binlog.000004  pos: 22103933

> set up gtid and stop and stop slave

I cannot reproduce this. Can you provide a way to reproduce this problem.

Kind regards

Hi:
Reproduce steps:
1. set up a master and slave use gtid;
2. create a table on master and wait slave catch up with it.
   create table t(id int primary key, name varchar(20));
3. stop slave and write a record at master:
   insert into t values(1,'kk');
4. start slave again the check the last relay log file:
   
   # at 361
#210720 20:35:49 server id 0  end_log_pos 0 CRC32 0x86972457    Rotate to binlog.000004  pos: 156
# at 405
#210720 20:35:49 server id 0  end_log_pos 0 CRC32 0x08426073    Rotate to binlog.000004  pos: 373
# at 449
#210720 20:36:29 server id 1  end_log_pos 452 CRC32 0x5320e983  GTID    last_committed=1        sequence_number=2       rbr_only=yes    original_committed_timestamp=1626784589649812   immediate_commit_timestamp=1626784589649812     transaction_length=283
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=1626784589649812 (2021-07-20 20:36:29.649812 CST)
# immediate_commit_timestamp=1626784589649812 (2021-07-20 20:36:29.649812 CST)
/*!80001 SET @@session.original_commit_timestamp=1626784589649812*//*!*/;
/*!80014 SET @@session.original_server_version=80025*//*!*/;
/*!80014 SET @@session.immediate_server_version=80025*//*!*/;
SET @@SESSION.GTID_NEXT= '9ec76a39-e951-11eb-a9b4-3c15fb3ccbeb:2'/*!*/;
# at 528
#210720 20:36:29 server id 1  end_log_pos 527 CRC32 0x39880c3b  Query   thread_id=8     exec_time=0     error_code=0
SET TIMESTAMP=1626784589/*!*/;
SET @@session.pseudo_thread_id=8/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1168113696/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C latin1 *//*!*/;
SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=255/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
/*!80011 SET @@session.default_collation_for_utf8mb4=255*//*!*/;
BEGIN
/*!*/;
# at 603
#210720 20:36:29 server id 1  end_log_pos 582 CRC32 0x16f813d5  Table_map: `test`.`t` mapped to number 119
# at 658
#210720 20:36:29 server id 1  end_log_pos 625 CRC32 0xe21c38b7  Write_rows: table id 119 flags: STMT_END_F

BINLOG '
TcP2YBMBAAAANwAAAEYCAAAAAHcAAAAAAAEABHRlc3QAAXQAAgMPAlAAAgEBAAID/P8A1RP4Fg==
TcP2YB4BAAAAKwAAAHECAAAAAHcAAAAAAAEAAgAC/wABAAAAAmtrtzgc4g==
'/*!*/;
### INSERT INTO `test`.`t`
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='kk' /* VARSTRING(80) meta=80 nullable=1 is_null=0 */

Best regards

After some debug, I found that the first previous_gtids_log_event can't be exclude by the slave executed_gtids and it's changed to the wrong rotate_event by the slave.