MySQL Bugs: #5116: Seconds_Behind_Master incorrectly reported

Bug #5116	Seconds_Behind_Master incorrectly reported
Submitted:	20 Aug 2004 0:23	Modified:	21 Sep 2004 2:52
Reporter:	Devananda v	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	4.1.3-beta	OS:	Linux (Mandrake 9.2)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
We have 7 pairs of servers set up with bidirectional replication (InnoDB), such that, for example, my1 replicates to my1r and my1r replicates to my1. my1r does not handle any client connections, but is completely ready to if we need or want to fail over, which we often do. What I am seeing is Seconds_Behind_Master reports as ~0 on both my1 and my1r most of the time, but sometimes my1 will report it as increasingly large, even though replication has not failed (ie, the Exec_Master_Log_Pos is correct).

Here is an example of this.

*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: db04.sjo.neopets.com
Master_User: repuser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: db04-bin.000026
Read_Master_Log_Pos: 29800242
Relay_Log_File: my4r-relay-bin.000015
Relay_Log_Pos: 681720672
Relay_Master_Log_File: db04-bin.000026
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: 
Replicate_Ignore_DB: 
Replicate_Do_Table: 
Replicate_Ignore_Table: 
Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
Last_Errno: 0
Last_Error: 
Skip_Counter: 0
Exec_Master_Log_Pos: 29800242
Relay_Log_Space: 681720672
Until_Condition: None
Until_Log_File: 
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File: 
Master_SSL_CA_Path: 
Master_SSL_Cert: 
Master_SSL_Cipher: 
Master_SSL_Key: 
Seconds_Behind_Master: 0

**************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: rac04.neopets.com
Master_User: repuser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: my4r-bin.000010
Read_Master_Log_Pos: 79
Relay_Log_File: db04-relay-bin.000001
Relay_Log_Pos: 9676
Relay_Master_Log_File: my4r-bin.000010
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: 
Replicate_Ignore_DB: 
Replicate_Do_Table: 
Replicate_Ignore_Table: 
Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
Last_Errno: 0
Last_Error: 
Skip_Counter: 0
Exec_Master_Log_Pos: 79
Relay_Log_Space: 9676
Until_Condition: None
Until_Log_File: 
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File: 
Master_SSL_CA_Path: 
Master_SSL_Cert: 
Master_SSL_Cipher: 
Master_SSL_Key: 
Seconds_Behind_Master: 59344

Here, rac04 is reporting that it is 59344 seconds behind db04. One possibility that was explained to me is that since there have not been any replicated events from db04 to rac04, it believes itself to be behind. I do not find this plausible, since on other servers this is not the case - no events are replicated and Seconds_Behind_Master remains 0. Also, I have not found a consistent way to reproduce this error; it seems to happen randomly to any of our servers.

How to repeat:
Unknown, sorry, but it is happening frequently.

Hi,

You may be having system time problems for some reason.  Can you enable the logging of
warnings with the msyqld startup option --log-warnings and see if you see this in
your .err file:

Warning: "SELECT UNIX_TIMESTAMP()" failed on master,
do not trust column Seconds_Behind_Master of SHOW SLAVE STATUS"

Thanks!

After looking at this again I'm not sure that this is a bug but maybe a feature request.    The 
Seconds_Behind_Master is a comparison of the the current system time and the timestamp for 
the last statement executed by replication's SQL thread.

Based on the example you presented the seconds behind master came out to about 14 hours.  Is 
it possible that no updates had happened on the master for that long?

Best Regards

We could add a feature request that adds a check to try and determine if the slave is fully caught 
up and display 0 instead of

"The number of seconds that have elapsed since the timestamp of the last master's event 
executed by the slave SQL thread. "

That is from http://dev.mysql.com/doc/mysql/en/SHOW_SLAVE_STATUS.html

>After looking at this again I'm not sure that this is a bug but maybe a
>feature request.    The Seconds_Behind_Master is a comparison of the the current system time
>and the timestamp for the last statement executed by replication's SQL thread.

>Based on the example you presented the seconds behind master came out
>to about 14 hours.  Is it possible that no updates had happened on the master for that long?

It is entirely possible, but then I have two questions regarding this. 

We have currently 7 pairs of servers each replicating A -> B and B -> A, where only A recieves client connections - therefor B is continuously processing A's binlog whereas A is indefinitely 'waiting for master (B) to send event'. Why would I see Seconds_Behind_Master increasing on only 3 of the 7 B's, when they are all (theoretically) following this model?

Also, a hypothetical question: if I set up 2 servers, C and D, and began replication from C to D only, then connected to C and ran a couple transactions and verfied that D had replicated them (at this point seconds_behind_master should be ~0) and then *left them alone for a while* and did nothing on either C or D until an hour later, would D report that it was 3600 seconds_behind_master ? If this is the case (and how mysql is intended to behave), then this shows me that we have some other sort of problem, and 4 of our pairs (the ones that are reporting seconds_behind_master=0) are somehow misbehaving and I need to investigate that instead, and then I must apologize for thinking it was your bug, and not ours :)

Thanks for your feedback!
Devananda
Neopets, Inc

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

This issue is popping up here, as well.
Setup: replication A -> B, using 5.0.18-standard (Solaris 8). Seconds_Behind_Master sometimes reaches very high values for only a short period of time.

Example: at Thu Jun  1 13:45:49 2006, the value of 'Seconds_Behind_Master' is 2449, for just one second. No queries were being executed at that time, and the latest query occurred just a few minutes before.

The system time on the servers is identical and they are located closely together, on the same network segment.

Here are the excerpts from the master and slave binlogs from around that time:

MASTER

#060601 13:42:21 server id 32  end_log_pos 125  Query   thread_id=50809 exec_time=0     error_code=0
SET TIMESTAMP=1149162141;
update LOGGER set VISITORS = (VISITORS + 1) where LOGGER_ID = 1;
# at 2145566
#060601 13:42:21 server id 32  end_log_pos 2145593      Xid = 1185303
COMMIT;
# at 2145593
#060601 13:52:59 server id 32  end_log_pos 28   Intvar
SET INSERT_ID=17;
# at 2145621
#060601 13:52:59 server id 32  end_log_pos 259  Query   thread_id=54329 exec_time=0     error_code=0
SET TIMESTAMP=1149162779;
insert into TIME_STATUS (MACHINE_ID, START_CODE, START_TIME, STATUS_TYPE_ID, MACHINE_STATUS_ID, MODE) values (327,'TM-1100', from_unixtime(1149162777),26,35,'N'
);
# at 2145852
#060601 13:52:59 server id 32  end_log_pos 2145879      Xid = 1186847
COMMIT;
# at 2145879

SLAVE

#060601 13:42:21 server id 32  end_log_pos 125  Query   thread_id=50809 exec_time=0     error_code=0
use msm;
SET TIMESTAMP=1149162141;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1;
SET @@session.sql_mode=0;
SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=8;
update LOGGER set VISITORS = (VISITORS + 1) where LOGGER_ID = 1;
# at 16748
#060601 13:42:21 server id 32  end_log_pos 16775        Xid = 1446
COMMIT;
# at 16775
#060601 13:52:59 server id 32  end_log_pos 28   Intvar
SET INSERT_ID=17;
# at 16803
#060601 13:52:59 server id 32  end_log_pos 259  Query   thread_id=54329 exec_time=0     error_code=0
SET TIMESTAMP=1149162779;
SET @@session.time_zone='SYSTEM';
insert into TIME_STATUS (MACHINE_ID, START_CODE, START_TIME, STATUS_TYPE_ID, MACHINE_STATUS_ID, MODE) values (327,'TM-1100', from_unixtime(1149162777),26,35,'N'
);
# at 17034

I would like to add the following: Seconds_Behind_Master returns to 0 after the previously mentioned value of 2449, but 33 minutes and 42 seconds later it suddenly becomes 4472 (and then drops back to 0):

Thu Jun  1 13:45:49 2006  Seconds_Behind_Master : 2449

- 0 in between -

Thu Jun  1 14:19:31 2006  Seconds_Behind_Master : 4472