Bug #52308 chained replication + different system times = seconds_behind_master is wrong
Submitted: 23 Mar 2010 16:19 Modified: 25 Oct 2018 6:40
Reporter: Sven Sandberg Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.88, 5.1+ OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: chained replication, Seconds_Behind_Master

[23 Mar 2010 16:19] Sven Sandberg
Description:
There are two problems, but only one fix, so I'm reporting them as one bug.

(P1) Assume:
 - There are three servers replicating in a chain:
   master_1 -> master_2 -> master_3
 - Each server has local time, and the servers reside in Reykjavik, Stockholm,
   and Helsinki, respectively. So when master_1 has system time 9:00, then
   master_2 has system time 10:00 and master_3 has system time 11:00.

Further, assume that master_1 executes a query at 9:00, and the query is replicated to master_2 and master_3 instantly. Then:

 - The query is binlogged with time 9:00 on master_1
 - When master_3 receives the query, the query still has timestamp 9:00, but
   master_3 has system time 11:00.
 - master_3 knows that its clock is 1 hour ahead of master_2. I.e.:
   mi->clock_diff_with_master = 3600
 - master_3 takes the known difference between its own clock and the immediate
   master master_2 into account when computing seconds_behind_master. However,
   the event's timestamp is stored relative to the original master's system
   time. So master_3 will report Seconds_behind_master as:
     11:00 - 9:00 - mi->clock_diff_with_master = 1 hour.

This is wrong since master_3 is not behind master_2 at all.

(P2) mi->clock_diff_with_master is lost when the IO thread reconnects. So if the SQL thread is behind the IO thread and the user issues CHANGE MASTER to another server that has a different clock difference, then Seconds_behind_master will be wrong for the events of the relay log that originate from the first master.

A problem similar to (P1) exists in the current work-in-progress implementation of WL#344. The fix outlined below would fix the problem also for WL#344.

How to repeat:
Difficult to setup since it requires three different computers.

Suggested fix:
Actions:
(A1) Clarify if seconds_behind_master denotes:
     (A1-1) the number of seconds behind the immediate master (master_2); or
     (A1-2) the number of seconds behind the originating master (master_1).

(A2) Ensure it works consistently with the semantics defined in (A1).

I think (A1-1) makes more sense than (A1-2). (If we used (A1-2), and master_1 and master_2 would both receive updates from clients, then seconds_behind_master on master_3 would alternate between the time that master_3 is behind master_2 and the time that master_3 is behind master_1. That would be unusable.)

To understand the fix, we need to understand more of the background. Some background facts:

(F1) Every binlog event has a timestamp. The timestamp is equivalent to the
     return value of the mysql function UNIX_TIMESTAMP() on the server where
     the query was first executed. In other words, it is set to the value of
     the mysql variable @@session.timestamp if @@session.timestamp is set; or
     to the return value of the C function call time(0) at the start of the
     query otherwise.

(F2) When the binlog is copied to the slave relay log, the timestamp is
     preserved. When the slave SQL thread executes the event, it first sets the
     slave SQL thread's @@session.timestamp to the event's timestamp.

(F3) A consequence of (F1) and (F2): when the query is binlogged on the slave,
     it still has the same timestamp as it had on the master.

(F3) A consequence of (F1) and (F2): when queries like
     'INSERT INTO t1 VALUES(UNIX_TIMESTAMP())' are executed on the slave, the
     inserted value is the time that the query was executed on the originating
     master. This prevents the slave from going out of sync.

(F4) When the IO thread starts, it computes the difference in system clock
     between itself and its master (it does this by connecting to the master as
     a client and executing 'SELECT UNIX_TIMESTAMP()', and subtracts the return
     value of the C function call time(0)). The difference is stored in
     mi->clock_diff_with_master.

(F5) When the slave computes Seconds_behind_master is computed, it uses the
     last timestamp of any event retrieved from the master, and adds
     mi->clock_diff_with_master to it.

Due to (F3), we have to keep the time that the query was executed on the originating master as part of the event, at least for Query_log_events. In order to not break replication from old to new server, we even have to keep this time in the event's timestamp. But we can change the timestamp of other event types, and we can add new status variables to Query_log_events (unknown status variables are ignored on old servers).

So a suggested fix is to change the binlog format as follows:
(B1) Add a new status variable to Query_log_event, say
     Q_IMMEDIATE_MASTER_TIMESTAMP
(B2) For event types other than Query_log_event, the event's timestamp is
     modified as follows: if the event is in a binlog, it holds the time when
     the event was executed on the server that wrote the binlog. If the event
     is in a relay log, it holds the time when the event was executed on the
     immediate master, using the slave's system clock (i.e., the time
     difference between the master and the slave is taken into account in this
     timestamp).
(B3) For Query_log_events, the event's timestamp holds the time that the event
     was executed on the original server (just like before the bugfix). The
     status variable Q_IMMEDIATE_MASTER_TIMESTAMP holds the same thing that
     other events have in their timestamps.

The new format can be implemented as follows:
(I1) Add a new field to THD, say real_start_time, that holds the time that
     UNIX_TIMESTAMP() would return if @@session.timestamp was not set.

(I2) Remove Log_event::when, add the new field
     Log_event::last_executed_timestamp, and add the new field
     Query_log_event::original_timestamp.

     Query_log_event::original_timestamp holds the value that UNIX_TIMESTAMP()
     would return on the original master.

     Log_event::last_executed_timestamp holds the time when the event was
     executed on the immediate master (if the event is read from a relay log)
     or the time when the event was executed on the same server (if the event
     was created by this server or if it was read from a binlog by
     mysqlbinlog). All these times are relative to the system time of the
     *current* server.)

(I3) In Log_event::Log_event(const char *buf...), read the event's timestamp
     into Log_event::last_executed_timestamp.

(I4) In Log_event::write_header(), write THD::real_start_time as the timestamp
     of non-Query_log_events, and write Query_log_event::original_timestamp as
     the timestamp of Query_log_events.

(I5) In Query_log_event::write(IO_CACHE*), write the new status variable. It
     should have the value Log_event::last_executed_timestamp.

(I6) In Query_log_event::Query_log_event(const char *buf...):
     (1) copy Log_event::last_executed_timestamp to
         Query_log_event::original_timestamp
     (2) check if the event has the new status variable. If it has, read the
         new status variable into Log_event::last_executed_timestamp. Otherwise,
         set Log_event::last_executed_timestamp = -1

(I7) In the IO thread, when it reads a Query_log_event that has the new status
     variable, add mi->clock_diff_with_master to the new status variable before
     writing to the relay log.

(I8) In the IO thread, when it reads a non-Query_log_event, add
     mi->clock_diff_with_master to the event's timestamp before writing to the
     relay log.

(I9) In Query_log_event::do_apply_event(), set @@session.timestamp to
     Query_log_event::original_timestamp (instead of to Log_event::when).

(I10) When computing seconds_behind_master: if
      Log_event::last_executed_timestamp == -1, then the event is a
      Query_log_event originating from an old server. Return
      Query_log_event::original_timestamp + mi->clock_diff_with_master.
      Otherwise, return Log_event::last_executed_timestamp.

(I probably made some mistakes in the implementation, but I think the architecture should work).
[23 Mar 2010 16:24] Sven Sandberg
In the fix outlined above, probably Execute_load_query_log_event, that inherits from Query_log_event, needs to be treated like Query_log_event.
[29 Mar 2010 19:05] Omer Barnir
triage: setting P3 as customer issue - support will look if more are affected and will be retriaged then
[12 May 2010 6:19] Zhenxing He
I think P1 is a dup of Bug#52704
[12 May 2010 8:18] Sven Sandberg
Zhenxing, P1 happens because the event that master_3 executes has a timestamp originating from master_1 and not master_2. After the fix of BUG#52704 at http://lists.mysql.com/commits/107138 , the event still does not have the timestamp of master_2. So I think the bugs are different.