Bug #101964 relay_log_space_limit + stopped sql_thread will stop fetching logs silently
Submitted: 10 Dec 2020 19:57 Modified: 11 Dec 2020 8:19
Reporter: Carlos Tutte Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.32, 8.0.22 OS:Any
Assigned to: CPU Architecture:Any

[10 Dec 2020 19:57] Carlos Tutte
Description:
When using relay_log_space_limit , if io_thread goes faster than sql_thread and relay_log_space_limit is reached, the following message is printed on "SHOW SLAVE STATUS \G":
"Waiting for the slave SQL thread to free enough relay log space"

Problem is, if sql_thread is stopped (io_thread running), after relay_log_space_limit is reached, it will stop fetching binary logs from source but in a silent way, i.e, not saying that relay_log_space_limit was reached and there might be more binary logs to fetch from source, which are not fetched even though io_thread is running. This prevents automatic monitoring, and might cause the replica to fall out of sync if sql_thread is stopped for long, and binary logs from source are purged.

How to repeat:
On replica:
1) Add to my.cnf:
[mysqld]
relay_log_space_limit = 25000
max-relay-log-size = 8192

2) Restart replica
systemctl restart mysqld

3) on replica, only start io_thread;
stop slave;
start slave io_thread;

4) On source:
create database test;
use test;
CREATE TABLE `joinit` (
  `i` int(11) NOT NULL AUTO_INCREMENT,
  `s` varchar(64) DEFAULT NULL,
  `t` time NOT NULL,
  `g` int(11) NOT NULL,
  PRIMARY KEY (`i`)
) ENGINE=InnoDB  DEFAULT CHARSET=latin1;
INSERT INTO joinit VALUES (NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )));
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; -- +256 rows
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; -- +512 rows
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; -- +1024 rows
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; -- +16k rows
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 
INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit; 

5) After source finished execution, check replica :
show slave status \G

It will be seen that relay log file/pos reached the following:
               Relay_Log_File: carlos-tutte-replupstream80-replication-2-relay-bin.000004
                Relay_Log_Pos: 360

6) After executing "start slave'" again, and as replica catches up and applies relay log (purging prior relay log and making room for newer relay logs), the following new relay log coords will be seen:

               Relay_Log_File: carlos-tutte-replupstream80-replication-2-relay-bin.000026
                Relay_Log_Pos: 277

This means that there were more binary logs from replica that were NOT fetched until space was freed, but nowhere it mentioned it was stuck

Suggested fix:
Add a message on io_thread saying something similar to:
"Cannot fetch more binary logs from source since relay_log_space_limit has been reached"
[11 Dec 2020 8:19] MySQL Verification Team
Hello Carlos Tutte,

Thank you for the report.

regards,
Umesh