Bug #50296 | Slave reconnects earlier than the prescribed slave_net_timeout value | ||
---|---|---|---|
Submitted: | 13 Jan 2010 2:13 | Modified: | 7 May 2018 14:44 |
Reporter: | Elena Stepanova | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.5.1-m2 | OS: | Solaris (Sparc) |
Assigned to: | Assigned Account | CPU Architecture: | Any |
[13 Jan 2010 2:13]
Elena Stepanova
[13 Jan 2010 11:49]
Andrei Elkin
Elena, what is slave_max_timeout on your slaves? If slave_max_timeout > hb_period there will be no hb events sent by master because the slave reconnects earlier.
[13 Jan 2010 11:52]
Andrei Elkin
Elena, sorry: slave_net_timeout in above ^.
[13 Jan 2010 12:08]
Elena Stepanova
Andrei, I assume you meant there will be no hb events if slave_net_timeout is *less* than hb period, not greater -- ? slave_net_timeout is set in the test case explicitly to 80 (to show that the issue can also be observed when hb_period is not provided in the CHANGE MASTER command, but is set to a half of slave_net_timeout, and at the same time not to wait too long for the test to finish), although I experienced the problem with different values of slave_net_timeout.
[13 Jan 2010 13:32]
Andrei Elkin
Elena, yes, your assumed what i meant. Sorry for the double-confusion. Answering `what's the primary issue -- whether slave reconnects because it does not receive heartbeats, or it does not receive heartbeats because it reconnects every 30 seconds anyway', Slave should not try reconnecting, that is to maintain the current link, while 80 seconds of the slave_net_timeout is not elapsed and no event incl a hb event is received. Because of reconnecting happens every 30 seconds and the hb period is to >= 30 it looks like the actual value of slave_net_timeout is 30 seconds not 80. Could you please SELECT @@global.slave_net_timeout as well, and set the value via the server init param --slave_net_timeout=80. Also notice that time values in eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=10; start slave; --real_sleep 11 show status like '%heartbeat%'; are way too close and the hb event practically can be missed (although it can not happen every single try).
[13 Jan 2010 14:20]
Elena Stepanova
See the modified test case below. - Now it selects @@global.slave_net_timeout before and after the flow (to make sure that the value does not get damaged); - I increased sleep interval to ~2 times of hb_period: 70 seconds for hb_period=31; - I removed the last part with implicit hb_period to make the test faster, although the problem is still there; - slave_net_timeout is set on startup. # Test output: SELECT @@global.slave_net_timeout; @@global.slave_net_timeout 80 stop slave; change master to master_host='127.0.0.1',master_port=13000, master_user='root', master_heartbeat_period=10; start slave; # After 11 seconds sleep: show status like '%heartbeat%'; Variable_name Value Slave_heartbeat_period 10.000 Slave_received_heartbeats 1 stop slave; change master to master_host='127.0.0.1',master_port=13000, master_user='root', master_heartbeat_period=31; start slave; # After 70 seconds sleep: show status like '%heartbeat%'; Variable_name Value Slave_heartbeat_period 31.000 Slave_received_heartbeats 0 SELECT @@global.slave_net_timeout; @@global.slave_net_timeout 80 Master log excerpt -- please note 2 extra connects with hb_period=31, with 30 second interval. 6 Connect root@localhost on 6 Query SELECT UNIX_TIMESTAMP() 6 Query SHOW VARIABLES LIKE 'SERVER_ID' 6 Query SET @master_heartbeat_period= 10000000000 6 Binlog Dump Log: '' Pos: 4 100113 17:11:37 7 Connect root@localhost on 7 Query SELECT UNIX_TIMESTAMP() 7 Query SHOW VARIABLES LIKE 'SERVER_ID' 7 Query SET @master_heartbeat_period= 31000000512 7 Binlog Dump Log: '' Pos: 4 100113 17:12:07 8 Connect root@localhost on 8 Query SET NAMES latin1 8 Query SELECT UNIX_TIMESTAMP() 8 Query SHOW VARIABLES LIKE 'SERVER_ID' 8 Query SET @master_heartbeat_period= 31000000512 8 Binlog Dump Log: 'master-bin.000001' Pos: 107 100113 17:12:37 9 Connect root@localhost on 9 Query SET NAMES latin1 9 Query SELECT UNIX_TIMESTAMP() 9 Query SHOW VARIABLES LIKE 'SERVER_ID' 9 Query SET @master_heartbeat_period= 31000000512 9 Binlog Dump Log: 'master-bin.000001' Pos: 107 # Test case: run as perl ./mysql-test-run.pl --mysqld=--slave_net_timeout=80 rpl_bug50296 > cat rpl_bug50296.test --source include/master-slave.inc connection slave; SELECT @@global.slave_net_timeout; stop slave; eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=10; start slave; --real_sleep 11 --echo # After 11 seconds sleep: show status like '%heartbeat%'; stop slave; eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=31; start slave; --real_sleep 70 --echo # After 70 seconds sleep: show status like '%heartbeat%'; SELECT @@global.slave_net_timeout; #<EOF>
[13 Jan 2010 15:02]
Andrei Elkin
Elena, thanks for details. I think we need to remove hb out of scope. It's obvious the slave IO reconnects in 30 secs regularly that is earlier than the prescribed slave_net_timeout's 80 secs. Could you please disable the hb eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=0 and test it again. The point of testing this time, to repeat, it that there must be no reconnecting while sleep time is less than the slave net timeout. Thanks!
[13 Jan 2010 15:42]
Elena Stepanova
Modified test case: - added dummy selects in master, just to indicate actual test start/end in the server log; - set master_heartbeat_period=0 in CHANGE MASTER; - sleep time is 70 seconds, the test is run with slave_net_timeout=80, so there should be no reconnect run as perl ./mysql-test-run.pl --mysqld=--slave_net_timeout=80 rpl_bug50296_2 cat rpl_bug50296_2.test --source include/master-slave.inc connection master; SELECT 'Test start', SLEEP(2); connection slave; SELECT @@global.slave_net_timeout; stop slave; eval change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_user='root', master_heartbeat_period=0; start slave; --real_sleep 70 connection master; SELECT 'Test finish'; #<EOF> Master log excerpt: 3 Query SELECT 'Test start', SLEEP(2) 100113 18:34:40 6 Connect root@localhost on 6 Query SELECT UNIX_TIMESTAMP() 6 Query SHOW VARIABLES LIKE 'SERVER_ID' 6 Binlog Dump Log: '' Pos: 4 100113 18:35:10 7 Connect root@localhost on 7 Query SET NAMES latin1 7 Query SELECT UNIX_TIMESTAMP() 7 Query SHOW VARIABLES LIKE 'SERVER_ID' 7 Binlog Dump Log: 'master-bin.000001' Pos: 107 100113 18:35:40 8 Connect root@localhost on 8 Query SET NAMES latin1 8 Query SELECT UNIX_TIMESTAMP() 8 Query SHOW VARIABLES LIKE 'SERVER_ID' 8 Binlog Dump Log: 'master-bin.000001' Pos: 107 100113 18:35:50 3 Query SELECT 'Test finish'
[13 Jan 2010 15:59]
Andrei Elkin
Excellent! You made this: Modified test case: - added dummy selects in master, just to indicate actual test start/end in the server log; - set master_heartbeat_period=0 in CHANGE MASTER; - sleep time is 70 seconds, the test is run with slave_net_timeout=80, so there should be no reconnect run as perl ./mysql-test-run.pl --mysqld=--slave_net_timeout=80 rpl_bug50296_2 and the query log shows 1. auto-reconnecting happened 2. the same 30 secs between the manual and the auto-reconnecting 100113 18:35:10 7 Connect root@localhost on 7 Query SET NAMES latin1 7 Query SELECT UNIX_TIMESTAMP() 7 Query SHOW VARIABLES LIKE 'SERVER_ID' 7 Binlog Dump Log: 'master-bin.000001' Pos: 107 100113 18:35:40 8 Connect root@localhost on 8 Query SET NAMES latin1 ... Do I read it correct?
[13 Jan 2010 16:22]
Elena Stepanova
That's correct.
[13 Jan 2010 18:16]
Andrei Elkin
Just to sum up. Synopsis time hb involvement appeared not to relate to the issue. The refined scenario proves Slave reconnects seeing no traffic from the master earlier than the prescribed slave_net_timeout value.
[13 Jan 2010 18:46]
Elena Stepanova
That's right, if master heartbeats count as 'traffic from the master', then everything I've seen so far can be generalized this way -- Andrei, thanks for the help in clearing this up. I've updated the synopsis.
[19 Feb 2010 11:43]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/100854 2977 Andrei Elkin 2010-02-19 Bug #50296 Slave reconnects earlier than the prescribed slave_net_timeout value There was auto-reconnecting by slave earlier than a prescribed by slave_net_timeout value. The issue happened on 64bit solaris that spotted rather incorrect casting of the ulong slave_net_timeout into the uint of mysql.options.read_timeout. Notice, that there is no reason for slave_net_timeout to be of type of ulong. Since it's primarily passed as arg to mysql_options the type can be made as uint to avoid all conversion hassles. That's what the fixes are made. A "side" effect of the patch is the new default value for the server system variable to be as max of uint type. @ sql/mysql_priv.h changing type for slave_net_timeout from ulong to uint @ sql/mysqld.cc changing type for slave_net_timeout from ulong to uint @ sql/sys_vars.cc Refining the max value for slave_net_timeout to be as the max for uint type.
[19 Feb 2010 17:52]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/100907 2978 Andrei Elkin 2010-02-19 Bug #50296 Slave reconnects earlier than the prescribed slave_net_timeout value adding a regression test. The test cant be run reliably without making it to last over some 20 secs. There it is placed in suite/large_tests.
[28 Feb 2010 17:32]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101774 2977 Andrei Elkin 2010-02-28 Bug #50296 Slave reconnects earlier than the prescribed slave_net_timeout value There was auto-reconnecting by slave earlier than a prescribed by slave_net_timeout value. The issue happened on 64bit solaris that spotted rather incorrect casting of the ulong slave_net_timeout into the uint of mysql.options.read_timeout. Notice, that there is no reason for slave_net_timeout to be of type of ulong. Since it's primarily passed as arg to mysql_options the type can be made as uint to avoid all conversion hassles. That's what the fixes are made. A "side" effect of the patch is a new value for the max of slave_net_timeout to be the max of the unsigned int type (therefore to vary across platforms). Note, a regression test can't be made to run reliably without making it to last over some 20 secs. That's why it is placed in suite/large_tests. @ mysql-test/suite/large_tests/r/rpl_slave_net_timeout.result the new test results. @ mysql-test/suite/large_tests/t/rpl_slave_net_timeout-slave.opt Initialization of the option that yields slave_net_timeout's default. @ sql/mysql_priv.h changing type for slave_net_timeout from ulong to uint @ sql/mysqld.cc changing type for slave_net_timeout from ulong to uint @ sql/sys_vars.cc Refining the max value for slave_net_timeout to be as the max for uint type.
[28 Feb 2010 17:39]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101775 2978 Andrei Elkin 2010-02-28 bug#50296 a small refinement to the test
[28 Feb 2010 17:58]
Andrei Elkin
Pushed to {next-mr,6.0-codebase}-bugfixing trees. Doc team, please pay attention to defined (seems like for the first time) by the fixes patch valid range of slave_net_timeout.
[1 Mar 2010 14:33]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101888 3115 Marc Alff 2010-03-01 Bug#50296 Slave reconnects earlier than the prescribed slave_net_timeout value Fixed failed assert on 64 bit platforms, introduced by the previous fix. mysqld: sys_vars.h:125: Assertion `size == sizeof(T)' failed.
[1 Mar 2010 16:47]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101913 3116 Andrei Elkin 2010-03-01 BUG#50296 restoring the maximum of slave_net_timeout as defined #define LONG_TIMEOUT ((ulong) 3600L*24L*365L)
[1 Mar 2010 16:49]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101916 3117 Andrei Elkin 2010-03-01 BUG#50296 restoring the maximum of slave_net_timeout as defined #define LONG_TIMEOUT ((ulong) 3600L*24L*365L)
[3 Mar 2010 14:54]
Andrei Elkin
To the docs team. Please notice, an early patch comments `A "side" effect of the patch is a new value for the max of slave_net_timeout to be the max of the unsigned int type (therefore to vary across platforms). ' are reverted by the last commit. The max for slave_net_timeout had been set and remains to be #define LONG_TIMEOUT ((ulong) 3600L*24L*365L)
[6 Mar 2010 10:30]
Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100306102742-yw9zzgw9ac5r65m5) (version source revid:bar@mysql.com-20100305074327-h09o5lw290s04lcf) (merge vers: 6.0.14-alpha) (pib:16)
[6 Mar 2010 10:31]
Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100306102638-qna09hbjb5gm940h) (version source revid:alik@sun.com-20100304153932-9hajxhhyanqbckmu) (pib:16)
[6 Mar 2010 10:59]
Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:alik@sun.com-20100304153932-9hajxhhyanqbckmu) (merge vers: 5.5.99-m3) (pib:16)
[8 Mar 2010 20:22]
Jon Stephens
Documented in the 5.5.3 and 6.0.14 changelogs as follows: If a CHANGE MASTER TO statement set MASTER_HEARTBEAT_PERIOD to 30 or higher, Slave_received_heartbeats did not increase on the slave. This caused the slave to reconnect before the time indicated by slave_net_timeout_value had elapsed. NOTE: I am returning this bug report to Open status and requesting further testing and a possible backport. REASON: MASTER_HEARTBEAT_PERIOD also exists in MySQL Cluster, NDB versions 6.3+ (Slave_received_heartbeats status variable also exists in these versions). Therefore, it seems that we need to determine whether this bug also exists there as well. Recommend testing with recent telco-6.3/7.0/7.1 trees and backporting fix to these if the bug is present. If bug is not found in Cluster trees, then this bug report can simply be closed, as the fix is already made in documented in all (other) affected versions. Also, please indicate who verified this bug, as the Verifier field appears to be empty. Thanks!
[9 Mar 2010 8:40]
Jon Stephens
Added to changelog entry: This issue was observed only on Solaris/SPARC.
[9 Mar 2010 9:06]
Jon Stephens
Amended last sentence of changelog entry to read "This issue affected big-endian 64-bit platforms such as Solaris/SPARC." (as suggested by Andrei)
[15 Mar 2010 5:14]
Jon Stephens
Andrei, I believe you've misunderstand my intention. I set this bug to Open, not Need Doc Info -- I was not requesting *information*; I was requesting a *backport*. Reason: This is a bug in a feature which also exists in 5.1-telco, but not in 5.1-main. Returning to Open status, since the issue I raised has not yet been addressed. Note1: IIRC, I discussed this bug with BOcklin and he agreed this needs to go into Cluster. I'm changing lead so it won't fall off his radar this time. Note2: If it's decided NOT to backport this fix to Cluster, then it merely needs to be closed. If it IS backported, then it needs to go through the normal bug cycle ending with [Pushes to affected telco trees] -> [Documenting status] -> [Jon adds changelog entries] -> [Closed status]. Thanks!
[22 Mar 2010 13:20]
Andrei Elkin
Jon, thanks for explaining it to me. Let it stay in Open. At the cluster-re-verifying we just need to disregard former comments' `REASON: MASTER_HEARTBEAT_PERIOD also exists in MySQL Cluster' and instead to pay attention to the actual problem of slave_net_timeout on 64-bit endians. &rei
[30 Apr 2018 11:43]
MySQL Verification Team
It's still valid to keep this bug open after 8 years?. Thanks.
[7 May 2018 14:44]
Jon Stephens
Since it's long since been fixed in 5.5+ and was apparently never backported, there's no reason to keep this open. Closing.