MySQL Bugs: #62839: race in computing Seconds_Behind

Bug #62839	race in computing Seconds_Behind_Master overstates replication lag
Submitted:	20 Oct 2011 5:51	Modified:	12 Nov 2011 4:14
Reporter:	Mark Callaghan	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	replication, show, slave, STATUS

Description:
Note that replication lag is something that has made many of my users unhappy with MySQL, so reporting it incorrectly is really bad PR for MySQL. I am sure other deployments feel the same way.

I am trying to figure out why a server has intermittent spikes in replication lag. In one case it goes 0 -> 45 -> 0 over a period of a few seconds. I compared consecutive SET TIMESTAMP values from the binlog and the difference is never greater than 2 seconds. 

Statements replayed during replication don't run for more than one or two seconds.

I think the problem is a race condition in this code. If last_master_timestamp is 0 when time_diff is computed and non-zero when protocol->store(..) is reached, then this returns the following as the lag:
   time(0) - mi->clock_diff_with_master
And that can explain the bogus jumps.

Now I must explain why this can happen. There are two places where it will be zero:
1) last_master_timestamp is zero before the first binlog event has been completed and possibly after a reconnect
2) last_master_timestamp is temporarily set to zero when the SQL thread catches up

So if the read of last_master_timestamp to compute time_diff is done during #1 or #2 above and then last_master_timestamp is set to a non-zero value before protocol->store(...) is called, then I think I have explained this. 

    /*
      Seconds_Behind_Master: if SQL thread is running and I/O thread is
      connected, we can compute it otherwise show NULL (i.e. unknown).
    */
    if ((mi->slave_running == MYSQL_SLAVE_RUN_CONNECT) &&
        mi->rli.slave_running)
    {
      long time_diff= ((long)(time(0) - mi->rli.last_master_timestamp)
                       - mi->clock_diff_with_master);
      /*
        Apparently on some systems time_diff can be <0. Here are possible
        reasons related to MySQL:
        - the master is itself a slave of another master whose time is ahead.
        - somebody used an explicit SET TIMESTAMP on the master.
        Possible reason related to granularity-to-second of time functions
        (nothing to do with MySQL), which can explain a value of -1:
        assume the master's and slave's time are perfectly synchronized, and
        that at slave's connection time, when the master's timestamp is read,
        it is at the very end of second 1, and (a very short time later) when
        the slave's timestamp is read it is at the very beginning of second
        2. Then the recorded value for master is 1 and the recorded value for
        slave is 2. At SHOW SLAVE STATUS time, assume that the difference
        between timestamp of slave and rli->last_master_timestamp is 0
        (i.e. they are in the same second), then we get 0-(2-1)=-1 as a result.
        This confuses users, so we don't go below 0: hence the max().

        last_master_timestamp == 0 (an "impossible" timestamp 1970) is a
        special marker to say "consider we have caught up".
      */
      protocol->store((longlong)(mi->rli.last_master_timestamp ?
                                 max(0, time_diff) : 0));

How to repeat:
Read above

Suggested fix:
assign last_master_timestamp to a local variable

Filing a bug past my bedtime might explain this, but it isn't obvious now how the race would explain the result jumping to 45. It would explain a jump to a value near the current unix timestamp.
     time(0) - mi->clock_diff_with_master

Looking at this again data_lock might prevent the race. Although the lack of docs for any of the RPL mutexes makes this code difficult to read:
  /*
    standard lock acquistion order to avoid deadlocks:
    run_lock, data_lock, relay_log.LOCK_log, relay_log.LOCK_index
  */
  pthread_mutex_t data_lock,run_lock;
and
pthread_mutex_t data_lock,run_lock;

Regardless, I still get unexplained jumps in replication lag and I think part of the problem is that SHOW SLAVE STATUS returns incorrect values sometimes.

My current theory is that this is expected behavior on a slave that has lag because it does keep up with the master and last_master_timestamp is reset to 0 while the SQL thread waits. This has confused me before and will confuse me again. I thought I added a SHOW STATUS variable to the FB patch for reporting the lag using a counter that isn't reset in that case, but I cannot find it.

This seems to be a duplicate of bug #52166.

Indeed, looks similar to that bug Mikiya mentioned. No plans to fix it in 5.1 :(

My servers aren't idle. This should be closed as it isn't a bug. It is a result of the design -- SBM is reset to zero whenever the SQL thread catches the IO thread. If that server really has lag and the IO thread can't keep up with the master then SBM flips between zero (when SQL thread catches IO thread) and the real lag.

I published an option to change mysqld to not reset SBM when the SQL thread catches the IO thread. For people with non-idle servers I think this is a good thing -- http://bazaar.launchpad.net/~mysqlatfacebook/mysqlatfacebook/5.1/revision/3753

Mark - I guess the situation would be better summed up like this:

Seconds_behind_master should be named more like Seconds_behind_last_master_event_received.  And I guess your modified version is like a Seconds_since_last_master_event_executed.

To more accurately report Seconds_behind_master you would need to somehow communicate more often with the master to determine it's actual master log position using some second in band channel or header - or a separate connection.

The design is faulty as you say, your solution prefers very active servers, the default solution prefers more idle ones.  The best solution would require more major changes.

This is a duplicate of Bug #59598