MySQL Bugs: #94847: Seconds_Behind_Master jumps to a huge number and return 0 with a idle Master

Bug #94847	Seconds_Behind_Master jumps to a huge number and return 0 with a idle Master
Submitted:	1 Apr 2019 4:04	Modified:	5 Apr 2019 1:37
Reporter:	Wang Wen'an	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.7.23	OS:	Debian (Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86
Tags:	replication, Seconds_Behind_Master

Description:
Env: Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 
Version: MySQL 5.7.23, Master-Master structure
What we did: MySQL is updated from 5.7.12 to 5.7.23
Desc: Assume the structure: A<-->B, A is RW and B is standby. "Seconds_behind_Master" in A's slave status jumps from 0 to a huge number, and return 0 quickly very often. B is a standby server so the "Seconds_behind_Master" in A should always be 0.

What's more, the "huge number" seems like the create time of the last B's binlog in Linux.

Some slave status in A:

-----------------------------------
2019-04-01 10:34:48
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.200.102.122
                  Master_User: _rpl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.012083
          Read_Master_Log_Pos: 646485456
               Relay_Log_File: relaylog.000958
                Relay_Log_Pos: 401
        Relay_Master_Log_File: binlog.012083
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 646485425
              Relay_Log_Space: 685
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 7176
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 102122
                  Master_UUID: a4f8d90e-e463-11e7-a37c-001b21bd324e
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: a4f8d90e-e463-11e7-a37c-001b21bd324e:14995562843
            Executed_Gtid_Set: a4f8d90e-e463-11e7-a37c-001b21bd324e:1-14995562843,
a62e430c-e463-11e7-a13d-001b21bd32ce:1-1791986605
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:

And I collect the info when SBM is greater than 5, the appeared time seems irregular:

2019-04-01 11:38:26
Read_Master_Log_Pos: 1068288908
Exec_Master_Log_Pos: 1068288877
10994
2019-04-01 11:38:38
Read_Master_Log_Pos: 1070128833
Exec_Master_Log_Pos: 1070128802
11006
2019-04-01 11:38:48
Read_Master_Log_Pos: 1071536157
Exec_Master_Log_Pos: 1071536092
11016
2019-04-01 11:39:34
Read_Master_Log_Pos: 4545396
Exec_Master_Log_Pos: 4545365
30
2019-04-01 11:40:19
Read_Master_Log_Pos: 12050257
Exec_Master_Log_Pos: 12050226
75
2019-04-01 11:40:50
Read_Master_Log_Pos: 16817092
Exec_Master_Log_Pos: 16817061
106
2019-04-01 11:40:54
Read_Master_Log_Pos: 17421637
Exec_Master_Log_Pos: 17421606
110
2019-04-01 11:40:54
Read_Master_Log_Pos: 17487823
Exec_Master_Log_Pos: 17487758
110
2019-04-01 11:41:14
Read_Master_Log_Pos: 20543107
Exec_Master_Log_Pos: 20543076
130

The SBM changes because binlog in B is rotated near 11:39.

-rw-r----- 1 mysql mysql 1.1G 4月   1 05:00 binlog.012081
-rw-r----- 1 mysql mysql 1.1G 4月   1 08:35 binlog.012082
-rw-r----- 1 mysql mysql 1.1G 4月   1 11:39 binlog.012083
-rw-r----- 1 mysql mysql 204M 4月   1 11:58 binlog.012084
-rw-r----- 1 mysql mysql 3.8K 4月   1 11:39 binlog.index

How to repeat:
None

Suggested fix:
None

Hi,

Do you have regular replication where A is master and B is slave or you have master-master circular replication?

kind regards
Bogdan

Hi,

Fot the two questions :

Do you have regular replication where A is master and B is slave?
Yes, A(Master) has Insert/Update/Delete, and B(Standby) sets log_slave_update = ON, so B(Standby) has binlog rotate without any user connections.

You have master-master circular replication?
No, these two MySQL instances keep the same UUID after udpated from 5.7.12, and erverything is OK in 5.7.12.

Hi,

if A is a master and you don't have circular replication, where does the "Seconds_behind_Master on A"  comes from? Why is "show slave status" showing anything on A ?!

I think this is not a bug but you need to contact support team to review your setup. If you think otherwise please describe properly your setup and the problem.

kind regards
Bogdan

furthermore, did you run mysql_upgrade after upgrading?

Hi,
    Maybe I misunderstand the "circular replication". Our structure is A<-->B, a Master-Master structure, both A and B are Master and Slave, but only A have Insert/Update/Delete. B is used as Standby and will accept user connections when A is down.

    mysql_upgrade was ran after upgrading correctly.

Hi,

>    Maybe I misunderstand the "circular replication". Our structure is A<-->B, a Master-Master structure, both A and B are Master and Slave

Yes, that is circular replication. 

I was not able to reproduce your problem. Did upgrade from 5.7.12 to 5.7.25, ran a circular replication with a high load on single node and have not notice slave "lagging" on that node. 

You might want to turn on full log on your "standby" server to check if there are any queries executed there.

kind regards
Bogdan

Hi,
    Load on A(Master) is low, nearly 2000 qps and 3 average load in Linux. GTID of B(Standby) is not changed, but I will check the general log later.
    What's more, I have tried to change multi-thread-replication to single-thread-replication by only setting slave_parallel_work to 0 yesterday,  the SBM looks normally. But MTR is necessary, so I changed back after several hours and DBM jumps again.