Bug #98115 seconds_behind_master is not correct
Submitted: 3 Jan 2020 2:23 Modified: 28 Jan 2020 15:34
Reporter: hailiang zhao Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.25 OS:CentOS (6.5)
Assigned to: MySQL Verification Team CPU Architecture:x86
Tags: MTS, Seconds_Behind_Master

[3 Jan 2020 2:23] hailiang zhao
Description:
we have a double master architecture which is circle replication and MTS ,Also we use after-sync semisync,the detail is the following:

(writer/read)         (no request )

     A        ---->
              <----     B

A and B forms a circle replication, only A node accept all application request(write and read), there is no writing and reading on B,  the "seconds_behind_master" of B is zero and is normal, However,  the "seconds_behind_master" of A is abnormal, which doesn't equal zero from time to time

How to repeat:
you can use 5.7.25, MTS and after-sync semisync to create a circle replication,for example:
(writer/read)         (no request )

     A        ---->
              <----     B

only A accept write request, some time you can find seconds_behind_master of A node is abnormal
[3 Jan 2020 7:57] hailiang zhao
additional informatioin:  use GTID ,MTS(8 thread)

we also use gdb to debug, setting break point for "apply_event_and_update_pos" function,  and when A insert a row,  The results of (*ptr_ev)->common_header->type_code are all ROTATE_EVENT

Does ROTATE_EVENT cause master's "seconds_behind_master" to be not correct?
[9 Jan 2020 2:30] Darren Zhao
How to imitate this question?

sysbench /usr/local/sysbench-master/src/lua/oltp_read_write.lua --mysql-host=localhost --mysql-port=3306 --mysql-user=root --mysql-password='root' --mysql-socket=/tmp/mysql3306.sock --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=10000000 prepare

and then you can observer A's seconds_behind_master value from time to time.
[9 Jan 2020 2:34] Darren Zhao
the following is results:
 Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 70
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
        Seconds_Behind_Master: 0
[16 Jan 2020 1:03] MySQL Verification Team
Hi,

I'm not sure this is a bug. I see the SBM value jumps from zero up (I don't see as high values as you do, so nothing as high as 70 seconds) but that's normal
[16 Jan 2020 1:04] MySQL Verification Team
Are you sure DATETIME on both servers is properly set and that you did not have some NTP events happening during your test?
[16 Jan 2020 1:28] Darren Zhao
@ Bogdan Kecman 
Are you sure DATETIME on both servers is properly set and that you did not have some NTP events happening during your test?

I'm sure NTP events are normal and  DATETIME on both two servers is properly set. we have tested for many times,and we found when binlog is rotating , SBM may come up.
[16 Jan 2020 1:35] MySQL Verification Team
Hi,

> I'm sure NTP events are normal and  DATETIME on both two servers is properly set.

perfect, as I remember a bug we chased for hours till we figured out NTP had time jumps that were messing with timestamp comparison.

> we have tested for many times,and we found when binlog is rotating , SBM may come up.

It is expected that SBM jumps when binlog is rotating as you get slowdown in execution of statements but are you sure the jump is "higher than expected"? How are you fetching the data, you are executing "show slave status" and parsing the output or? As you can also extract the statement that way and see if that's really something that happened that many seconds ago? When I'm running this test on my setup I see the SBM jump to 5-6 seconds but it's a "real value" not a bug.
[16 Jan 2020 1:48] Darren Zhao
1、How are you fetching the data, you are executing "show slave status" and parsing the output or? 
YES, we use script to fetch the data:
while true
do
    mysql -A -uroot -pxx   -e"show slave status\G" | grep 'Seconds_Behind_Master' >> delay.log
    echo `date` >> delay.log
done

2、A is master, this is a circular replication, and A hasn't any events from B,we think A should't occur delay, but some time SBM is very big data:
Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.28.65.36
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql-bin.000168
          Read_Master_Log_Pos: 888959738
               Relay_Log_File: mysql-relay-bin.000344
                Relay_Log_Pos: 367
        Relay_Master_Log_File: mysql-bin.000168
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 888959104
              Relay_Log_Space: 621
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 7082
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 77161
                  Master_UUID: e223a4d5-222e-11ea-8b0e-08f1eaf15364
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: cdf5562d-222e-11ea-a850-08f1eaf16af2:1-2959959331
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version:
[16 Jan 2020 17:25] MySQL Verification Team
Hi,

This does not seem normal, I have never seen this. Let me try to reproduce this again, this is not what I reproduced.
[28 Jan 2020 15:34] MySQL Verification Team
Hi,

I cannot reproduce what you are reporting no matter what I try. 

Can you reproduce this with 5.7.29 too ?

thanks
[15 Jul 2020 13:44] Buchan Milne
Seconds_Behind_Master change after enabling MTS

Attachment: MySQL_MTS_Seconds_Behind_Master.png (image/png, text), 30.81 KiB.

[15 Jul 2020 13:47] Buchan Milne
We are impacted by this issue

We have a similar setup, with circular replication.

We recently upgraded to 5.7, then enabled GTID online, then enabled MTS. We aren't currently using semi-sync.

Before enabling MTS, we didn't ever see the Seconds_Behind_Master on the writable master being above 0. Within 15m of enabling MTS, Seconds_Behind_Master  began varying between 0 and up to over 600

We have performed this migration following the same process in 4 installations, and one is in progress (GTID has been enabled, but MTS hasn't). All 4 installations on MTS exhibit this issue, and the installation on GTID but not MTS does not currently exhibit the issue.

I have attached a graph indicating the Seconds_Behind_Master p0 and p100 for one of the installations we recently upgraded, as long as the catchup-rate, indicating the improvement we got once MTS had been enabled on the circular master as well:
https://bugs.mysql.com/file.php?id=29691&bug_id=98115

Percona has also noted this issue:

https://www.percona.com/blog/2017/01/27/wrong-seconds_behind_master-with-slave_parallel_wo...

Our automation tooling that performs (among other things) failovers checks the Seconds_Behind_Master on the writable before performing a failover to the circular master, and is now often failing due to this issue.

Is it possible to fix this, or is there another alternative to ensuring that a writable master is safely replicating from a writable before performing a failover?

(We are currently running Percona-Server-5.7.27).
[1 Oct 2021 22:42] Artsiom Lipauka
I have same problem with 5.7.17. seconds_behind_master abnormal values periodically (~600000 and etc)