MySQL Bugs: #50296: Slave reconnects earlier than the prescribed slave_net

Bug #50296	Slave reconnects earlier than the prescribed slave_net_timeout value
Submitted:	13 Jan 2010 2:13	Modified:	7 May 2018 14:44
Reporter:	Elena Stepanova	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.5.1-m2	OS:	Solaris (Sparc)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
If master_heartbeat_periodis set to 30 or higher in CHANGE MASTER query, Slave_received_heartbeats does not increase on slave.

Same happens if master_heartbeat_period is set implicitly based on slave_net_timeout.

The test case in 'How to repeat' section, the 1st 'show status' returns 1 in Slave_received_heartbeats, while the 2nd and the 3rd return 0.

Master log and error files look like the slave reconnects every 30 seconds, regardless the heartbeat value; e.g. for the provided test case, master error log shows

# this is initial start slave (called from the inc file)
4:54:27 [Note] Start binlog_dump to slave_server(2), pos(, 4)

# this is our first start slave, with master_heartbeat_period=10
4:54:28 [Note] Start binlog_dump to slave_server(2), pos(, 4)

# this is our second start slave, with master_heartbeat_period=30
4:54:39 [Note] Start binlog_dump to slave_server(2), pos(, 4)

# We did not do this, it appeared while we were in sleep 35
4:55:09 [Note] Start binlog_dump to slave_server(2), pos(master-bin.000001, 107)

# this is our third start slave, with master_heartbeat_period=40
4:55:14 [Note] Start binlog_dump to slave_server(2), pos(, 4)

# We did not do this, it appeared while we were in sleep 45
4:55:45 [Note] Start binlog_dump to slave_server(2), pos(master-bin.000001, 107)

Corresponding connects can be seen in the master log file.

I'm not sure what's the primary issue -- whether slave reconnects because it does not receive heartbeats, or it does not receive heartbeats because it reconnects every 30 seconds anyway.

So far only was only observed on Solaris Sparc.

SunOS 5.10 Generic_141444-09 sun4v sparc SUNW,SPARC-Enterprise-T5120
SunOS 5.11 snv_116 sun4v sparc SUNW,SPARC-Enterprise-T5120

Tried and could not reproduce on Linux (SUSE 10.2), Solaris 10 x86_64, Windows Server 2008.

How to repeat:
--source include/master-slave.inc

connection slave;
stop slave;
set global slave_net_timeout=80;
eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=10;
start slave;
--real_sleep 11
show status like '%heartbeat%';
stop slave;
eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=30;
start slave;
--real_sleep 35
show status like '%heartbeat%';
stop slave;
eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root';
start slave;
--real_sleep 45
show status like '%heartbeat%';

Elena, what is slave_max_timeout on your slaves?

If slave_max_timeout > hb_period there will be no hb events sent by master 
because the slave reconnects earlier.

Elena, sorry: slave_net_timeout in above ^.

Andrei,

I assume you meant there will be no hb events if slave_net_timeout is *less* than hb period, not greater -- ?

slave_net_timeout is set in the test case explicitly to 80 (to show that the issue can also be observed when hb_period is not provided in the CHANGE MASTER command, but is set to a half of slave_net_timeout, and at the same time not to wait too long for the test to finish), although I experienced the problem with different values of slave_net_timeout.

Elena, yes, your assumed what i meant. Sorry for the double-confusion.
Answering `what's the primary issue -- whether slave reconnects because it does not receive heartbeats, or it does not receive heartbeats because it reconnects every 30 seconds anyway',
Slave should not try reconnecting, that is to maintain the current link, while 80 seconds of the slave_net_timeout is not elapsed and no event incl a hb event is received.
Because of reconnecting happens every 30 seconds and the hb period is to >= 30
it looks like the actual value of slave_net_timeout is 30 seconds not 80.
Could you please SELECT @@global.slave_net_timeout as well, and set the value via the server init param --slave_net_timeout=80.

Also notice that time values in 

eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT,
master_user='root', master_heartbeat_period=10;
start slave;
--real_sleep 11
show status like '%heartbeat%';

are way too close and the hb event practically can be missed (although it can not happen every single try).

See the modified test case below.
- Now it selects @@global.slave_net_timeout before and after the flow (to make sure that the value does not get damaged);
- I increased sleep interval to ~2 times of hb_period: 70 seconds for hb_period=31;
- I removed the last part with implicit hb_period to make the test faster, although the problem is still there;
- slave_net_timeout is set on startup.

# Test output:

SELECT @@global.slave_net_timeout;
@@global.slave_net_timeout
80
stop slave;
change master to master_host='127.0.0.1',master_port=13000, master_user='root', master_heartbeat_period=10;
start slave;
# After 11 seconds sleep:
show status like '%heartbeat%';
Variable_name  Value
Slave_heartbeat_period 10.000
Slave_received_heartbeats      1
stop slave;
change master to master_host='127.0.0.1',master_port=13000, master_user='root', master_heartbeat_period=31;
start slave;
# After 70 seconds sleep:
show status like '%heartbeat%';
Variable_name  Value
Slave_heartbeat_period 31.000
Slave_received_heartbeats      0
SELECT @@global.slave_net_timeout;
@@global.slave_net_timeout
80

Master log excerpt -- please note 2 extra connects with hb_period=31, with 30 second interval.

                    6 Connect   root@localhost on
                    6 Query     SELECT UNIX_TIMESTAMP()
                    6 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    6 Query     SET @master_heartbeat_period= 10000000000
                    6 Binlog Dump       Log: ''  Pos: 4
100113 17:11:37     7 Connect   root@localhost on
                    7 Query     SELECT UNIX_TIMESTAMP()
                    7 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    7 Query     SET @master_heartbeat_period= 31000000512
                    7 Binlog Dump       Log: ''  Pos: 4
100113 17:12:07     8 Connect   root@localhost on
                    8 Query     SET NAMES latin1
                    8 Query     SELECT UNIX_TIMESTAMP()
                    8 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    8 Query     SET @master_heartbeat_period= 31000000512
                    8 Binlog Dump       Log: 'master-bin.000001'  Pos: 107
100113 17:12:37     9 Connect   root@localhost on
                    9 Query     SET NAMES latin1
                    9 Query     SELECT UNIX_TIMESTAMP()
                    9 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    9 Query     SET @master_heartbeat_period= 31000000512
                    9 Binlog Dump       Log: 'master-bin.000001'  Pos: 107

# Test case:

run as 
perl ./mysql-test-run.pl --mysqld=--slave_net_timeout=80 rpl_bug50296

> cat rpl_bug50296.test

--source include/master-slave.inc

connection slave;
SELECT @@global.slave_net_timeout;
stop slave;
eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=10;
start slave;
--real_sleep 11
--echo # After 11 seconds sleep:
show status like '%heartbeat%';

stop slave;
eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=31;
start slave;
--real_sleep 70
--echo # After 70 seconds sleep:
show status like '%heartbeat%';

SELECT @@global.slave_net_timeout;

#<EOF>

Elena, thanks for details.

I think we need to remove hb out of scope.
It's obvious the slave IO reconnects
in 30 secs regularly that is earlier than the prescribed slave_net_timeout's 80 secs.

Could you please disable the hb

eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT,
master_user='root', master_heartbeat_period=0

and test it again.

The point of testing this time, to repeat, it that there must be no reconnecting while sleep time is less than the slave net timeout.

Thanks!

Modified test case:
- added dummy selects in master, just to indicate actual test start/end in the server log;
- set master_heartbeat_period=0 in CHANGE MASTER;
- sleep time is 70 seconds, the test is run with slave_net_timeout=80, so there should be no reconnect

run as 
perl ./mysql-test-run.pl --mysqld=--slave_net_timeout=80 rpl_bug50296_2

cat rpl_bug50296_2.test

--source include/master-slave.inc

connection master;
SELECT 'Test start', SLEEP(2);

connection slave;
SELECT @@global.slave_net_timeout;
stop slave;
eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=0;
start slave;
--real_sleep 70

connection master;
SELECT 'Test finish';

#<EOF>

Master log excerpt:

                    3 Query     SELECT 'Test start', SLEEP(2)
100113 18:34:40     6 Connect   root@localhost on
                    6 Query     SELECT UNIX_TIMESTAMP()
                    6 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    6 Binlog Dump       Log: ''  Pos: 4
100113 18:35:10     7 Connect   root@localhost on
                    7 Query     SET NAMES latin1
                    7 Query     SELECT UNIX_TIMESTAMP()
                    7 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    7 Binlog Dump       Log: 'master-bin.000001'  Pos: 107
100113 18:35:40     8 Connect   root@localhost on
                    8 Query     SET NAMES latin1
                    8 Query     SELECT UNIX_TIMESTAMP()
                    8 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    8 Binlog Dump       Log: 'master-bin.000001'  Pos: 107
100113 18:35:50     3 Query     SELECT 'Test finish'

Excellent!

You made this:

   Modified test case:
   - added dummy selects in master, just to indicate actual test start/end in   the server log;
   - set master_heartbeat_period=0 in CHANGE MASTER;
   - sleep time is 70 seconds, the test is run with slave_net_timeout=80, 
     so there should be no reconnect

    run as 

    perl ./mysql-test-run.pl --mysqld=--slave_net_timeout=80 rpl_bug50296_2

and the query log shows 

1. auto-reconnecting happened
2. the same 30 secs between the manual and the auto-reconnecting

100113 18:35:10     7 Connect   root@localhost on
                    7 Query     SET NAMES latin1
                    7 Query     SELECT UNIX_TIMESTAMP()
                    7 Query     SHOW VARIABLES LIKE 'SERVER_ID'
                    7 Binlog Dump       Log: 'master-bin.000001'  Pos: 107
100113 18:35:40     8 Connect   root@localhost on
                    8 Query     SET NAMES latin1
                    ...

Do I read it correct?

That's correct.

Just to sum up. 
Synopsis time hb involvement appeared not to relate to the issue.
The refined scenario proves Slave reconnects seeing no traffic from the master earlier than the prescribed slave_net_timeout value.

That's right, if master heartbeats count as 'traffic from the master', then everything I've seen so far can be generalized this way -- Andrei, thanks for the help in clearing this up. I've updated the synopsis.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100854

2977 Andrei Elkin	2010-02-19
      Bug #50296 Slave reconnects earlier than the prescribed slave_net_timeout value
      
      There was auto-reconnecting by slave earlier than a prescribed by slave_net_timeout value.
      The issue happened on 64bit solaris that spotted rather incorrect casting of 
      the ulong slave_net_timeout into the uint of mysql.options.read_timeout.
      
      Notice, that there is no reason for slave_net_timeout to be of type of ulong.
      Since it's primarily passed as arg to mysql_options the type can be made
      as uint to avoid all conversion hassles.
      That's what the fixes are made.
      A "side" effect of the patch is the new default value for the server system variable
      to be as max of uint type.
     @ sql/mysql_priv.h
        changing type for slave_net_timeout from ulong to uint
     @ sql/mysqld.cc
        changing type for slave_net_timeout from ulong to uint
     @ sql/sys_vars.cc
        Refining the max value for slave_net_timeout to be as the max for uint type.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100907

2978 Andrei Elkin	2010-02-19
      Bug #50296 Slave reconnects earlier than the prescribed slave_net_timeout value
      
      adding a regression test. The test cant be run reliably without making it to last over some 20 secs. There it is placed in suite/large_tests.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/101774

2977 Andrei Elkin 2010-02-28
Bug #50296 Slave reconnects earlier than the prescribed slave_net_timeout value

There was auto-reconnecting by slave earlier than a prescribed by slave_net_timeout value.
The issue happened on 64bit solaris that spotted rather incorrect casting of
the ulong slave_net_timeout into the uint of mysql.options.read_timeout.

Notice, that there is no reason for slave_net_timeout to be of type of ulong.
Since it's primarily passed as arg to mysql_options the type can be made
as uint to avoid all conversion hassles.
That's what the fixes are made.
A "side" effect of the patch is a new value for the max of slave_net_timeout
to be the max of the unsigned int type (therefore to vary across platforms).

Note, a regression test can't be made to run reliably without making it to last over some
20 secs. That's why it is placed in suite/large_tests.
@ mysql-test/suite/large_tests/r/rpl_slave_net_timeout.result
the new test results.
@ mysql-test/suite/large_tests/t/rpl_slave_net_timeout-slave.opt
Initialization of the option that yields slave_net_timeout's default.
@ sql/mysql_priv.h
changing type for slave_net_timeout from ulong to uint
@ sql/mysqld.cc
changing type for slave_net_timeout from ulong to uint
@ sql/sys_vars.cc
Refining the max value for slave_net_timeout to be as the max for uint type.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101775

2978 Andrei Elkin	2010-02-28
      bug#50296
      
      a small refinement to the test

Pushed to {next-mr,6.0-codebase}-bugfixing trees.

Doc team, please pay attention to defined (seems like for the first time) by the fixes patch valid range of slave_net_timeout.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101888

3115 Marc Alff	2010-03-01
      Bug#50296 Slave reconnects earlier than the prescribed slave_net_timeout value
      
      Fixed failed assert on 64 bit platforms, introduced by the previous fix.
      
      mysqld: sys_vars.h:125: Assertion `size == sizeof(T)' failed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101913

3116 Andrei Elkin	2010-03-01
      BUG#50296
      
      restoring the maximum of slave_net_timeout as defined
      #define LONG_TIMEOUT ((ulong) 3600L*24L*365L)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101916

3117 Andrei Elkin	2010-03-01
      BUG#50296
      
      restoring the maximum of slave_net_timeout as defined
      #define LONG_TIMEOUT ((ulong) 3600L*24L*365L)

To the docs team.

Please notice, an early patch comments
`A "side" effect of the patch is a new value for the max of slave_net_timeout
 to be the max of the unsigned int type (therefore to vary across platforms).
'
are reverted by the last commit.
The max for slave_net_timeout had been set and remains to be
#define LONG_TIMEOUT ((ulong) 3600L*24L*365L)

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100306102742-yw9zzgw9ac5r65m5) (version source revid:bar@mysql.com-20100305074327-h09o5lw290s04lcf) (merge vers: 6.0.14-alpha) (pib:16)

Pushed into mysql-next-mr (revid:alik@sun.com-20100306102638-qna09hbjb5gm940h) (version source revid:alik@sun.com-20100304153932-9hajxhhyanqbckmu) (pib:16)

Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:alik@sun.com-20100304153932-9hajxhhyanqbckmu) (merge vers: 5.5.99-m3) (pib:16)

Documented in the 5.5.3 and 6.0.14 changelogs as follows:

      If a CHANGE MASTER TO statement set MASTER_HEARTBEAT_PERIOD to 
      30 or higher, Slave_received_heartbeats did not increase on 
      the slave. This caused the slave to reconnect before the time 
      indicated by slave_net_timeout_value had elapsed.

NOTE: I am returning this bug report to Open status and requesting further testing and a possible backport.

REASON: MASTER_HEARTBEAT_PERIOD also exists in MySQL Cluster, NDB versions 6.3+ (Slave_received_heartbeats status variable also exists in these versions). Therefore, it seems that we need to determine whether this bug also exists there as well. 

Recommend testing with recent telco-6.3/7.0/7.1 trees and backporting fix to these if the bug is present. If bug is not found in Cluster trees, then this bug report can simply be closed, as the fix is already made in documented in all (other) affected versions.

Also, please indicate who verified this bug, as the Verifier field appears to be empty.

Thanks!

Added to changelog entry:

        This issue was observed only on Solaris/SPARC.

Amended last sentence of changelog entry to read "This issue affected big-endian 64-bit platforms such as Solaris/SPARC." (as suggested by Andrei)

Andrei, I believe you've misunderstand my intention. I set this bug to Open, not Need Doc Info -- I was not requesting *information*; I was requesting a *backport*.

Reason: This is a bug in a feature which also exists in 5.1-telco, but not in 5.1-main.

Returning to Open status, since the issue I raised has not yet been addressed.

Note1: IIRC, I discussed this bug with BOcklin and he agreed this needs to go into Cluster.

I'm changing lead so it won't fall off his radar this time.

Note2: If it's decided NOT to backport this fix to Cluster, then it merely needs to be closed. If it IS backported, then it needs to go through the normal bug cycle ending with [Pushes to affected telco trees] -> [Documenting status] -> [Jon adds changelog entries] -> [Closed status].

Thanks!

Jon, thanks for explaining it to me. Let it stay in Open. 
At the cluster-re-verifying we just need to disregard former comments'

  `REASON: MASTER_HEARTBEAT_PERIOD also exists in MySQL Cluster'

and instead to pay attention to the actual problem of slave_net_timeout on 64-bit endians.

&rei

It's still valid to keep this bug open after 8 years?. Thanks.

Since it's long since been fixed in 5.5+ and was apparently never backported, there's no reason to keep this open. Closing.