MySQL Bugs: #80084: Seconds_behind_master distorted because of verbatim master's FDE in relaylog

Bug #80084	Seconds_behind_master distorted because of verbatim master's FDE in relaylog
Submitted:	20 Jan 2016 19:49	Modified:	2 Jan 2020 22:30
Reporter:	Kenny Gryp	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.9, 5.6.22	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Reopening http://bugs.mysql.com/bug.php?id=66921

Master's FDE is written in relay log, with its verbatim timestamp, which is usually start of master's binary log.
That means that in relay log one event may be hours or days or minutes older than the remaining context.

Those timestamps are used to show "Seconds behind master", which is the only usable metric telling about slave health, and when SQL thread is reading master's FDE (when it is reading rotated binlog), Seconds_behind_master will be calculated against start of binlog position and not the binlog position that matches relay log position, which is obviously a bug.

In the relay log I see this:

This still happens, check the timestamps in in the commented lines:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#160105 14:23:18 server id 27436  end_log_pos 120 CRC32 0xca565278 	Start: binlog v 4, server v 5.6.22-log created 160105 14:23:18
BINLOG '
xsOLVg8sawAAdAAAAHgAAABAAAQANS42LjIyLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAXhS
Vso=
'/*!*/;
# at 120
#160115 15:51:40 server id 29436  end_log_pos 134218413 CRC32 0x49967c04 	Rotate to mysql-bin.024718  pos: 4
# at 167
#160105 14:23:18 server id 27436  end_log_pos 216 CRC32 0xdd391511 	Rotate to mysql-relay.049495  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

How to repeat:

In order to reproduce the distorted Seconds_behind_master, 
- set max_relay_log_size low so a lot of relay logs are created
- create a write heavy workload
- fetch SHOW SLAVE STATUS at least once per second
- watch seconds_behind_master, at slave rotation slave lag will be wrong and increasing over time..

Suggested fix:
SQL thread either should not change 'seconds behind master' timestamp if it encounters FDE.
Alternatively, master's FDE written to relay log can be rewritten to reflect the context better (either use just relaylog timestamp, or surrounding binlog events, somehow ;-)

Example output of per second metrics over a day.

Attachment: Asynchronous_Slave_Lag.png (image/png, text), 75.39 KiB.

As you can see in the attached image which displays SHOW SLAVE STATUS every second (this is 5.7.9). Having single second points in time where there is 3.500.00 replication lag is not normal. All this is because of the relay log timestamp.

Hello,

Would you please have a look at this bug? It's been open for quite a long time and it's not been verified.

I have various companies I work with that have this issue when we're looking at seconds_behind_master every second. It kind of breaks trending.

Thanks!

MySQL 8.0 introduces a better infrastructure in performance schema to track
how far behind in time a slave is w.r.t. its immediate master, as well as
to the original master where the transaction executed in the first place.

https://mysqlhighavailability.com/new-monitoring-replication-features-and-more/

Please see note: [19 Sep 2018 16:55] Luis Soares. Thanks.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".