MySQL Bugs: #66921: Seconds_behind_master distorted because of verbatim master's FDE in relaylog

Bug #66921	Seconds_behind_master distorted because of verbatim master's FDE in relaylog
Submitted:	21 Sep 2012 20:33	Modified:	21 May 2013 8:56
Reporter:	Domas Mituzas	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1, probably others	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Master's FDE is written in relay log, with its verbatim timestamp, which is usually start of master's binary log.
That means that in relay log one event may be hours or days or minutes older than the remaining context.

Those timestamps are used to show "Seconds behind master", which is the only usable metric telling about slave health, and when SQL thread is reading master's FDE (when it is reading rotated binlog), Seconds_behind_master will be calculated against start of binlog position and not the binlog position that matches relay log position, which is obviously a bug.

How to repeat:
Replicate, observe Seconds_behind_master show incorrect values at relay log rotations. 

Suggested fix:
SQL thread either should not change 'seconds behind master' timestamp if it encounters FDE.
Alternatively, master's FDE written to relay log can be rewritten to reflect the context better (either use just relaylog timestamp, or surrounding binlog events, somehow ;-)

Example relay log (11111 is a master, 22222 is a slave):

/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#120921 13:21:42 server id 22222  end_log_pos 106 	Start: binlog v 4, server v 5.1.53-log created 120921 13:21:42
BINLOG '
VsxcUA8ZsqAHZgAAAGoAAAAAAAQANS4xLjUzX2JsYWNraG9sZV9ub21lbWNhY2hlLWxvZwAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAUwAEGggAAAAICAgC
'/*!*/;
# at 106
#120921 13:21:42 server id 4294967295  end_log_pos 151 	Rotate to binary-logs.017880  pos: 81259576
# at 151
#691231 16:00:00 server id 11111  end_log_pos 0 	Rotate to binary-logs.017880  pos: 81259576
# at 196
#120921 13:17:39 server id 11111  end_log_pos 0 	Start: binlog v 4, server v 5.1.53-log created 120921 13:17:39
BINLOG '
Y8tcUA8vVywGZgAAAAAAAAAAAAQANS4xLjUzX2JsYWNraG9sZV9ub21lbWNhY2hlLWxvZwAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAUwAEGggAAAAICAgC
'/*!*/;
# at 298
#120921 13:21:42 server id 11111  end_log_pos 81259647 	Query	thread_id=279844609	exec_time=0	error_code=0
SET TIMESTAMP=1348258902/*!*/;
SET @@session.pseudo_thread_id=279844609/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C latin1 *//*!*/;
SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN

Verified as described for 5.1. Other versions not tested yet.

I run into this bug all the time. Seconds behind master will jump from a real value (usually 0) to a very large number (9000+ seconds), and back again. It really annoys our monitoring, and the people receiving the alarms (me). It does appear to correlate very strongly with relay log rotation on the slave itself. I increased slave_net_timeout so inactivity doesn't cause rotations as often, but other processes such as backups or bulk purge transactions will trigger it as well. I really hope this gets fixed soon. Thanks for listening :)

Not repeatable on later versions, fixed in 5.6 along with BUG#52166.

This still happens:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#160105 14:23:18 server id 27436  end_log_pos 120 CRC32 0xca565278 	Start: binlog v 4, server v 5.6.22-log created 160105 14:23:18
BINLOG '
xsOLVg8sawAAdAAAAHgAAABAAAQANS42LjIyLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAXhS
Vso=
'/*!*/;
# at 120
#160115 15:51:40 server id 29436  end_log_pos 134218413 CRC32 0x49967c04 	Rotate to mysql-bin.024718  pos: 4
# at 167
#160105 14:23:18 server id 27436  end_log_pos 216 CRC32 0xdd391511 	Rotate to mysql-relay.049495  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

In order to reproduce the distorted Seconds_behind_master, 
- set max_relay_log_size low so a lot of relay logs are created
- create a write heavy workload
- fetch SHOW SLAVE STATUS at least once per second
- watch seconds_behind_master, at slave rotation slave lag will be high

Version: MySQL Community 5.6.22

I also see this on 5.7.9

As the bug is closed, I created a new bug: http://bugs.mysql.com/bug.php?id=80084