Bug #57918 rpl_get_master_version_and_clock times out sporadically
Submitted: 2 Nov 2010 8:54 Modified: 6 Jan 2011 4:42
Reporter: Alexander Nozdrin Email Updates:
Status: Closed Impact on me:
None 
Category:Tests: Replication Severity:S3 (Non-critical)
Version:5.6 (Dahlia, Erica) OS:Any
Assigned to: Daogang Qu CPU Architecture:Any
Tags: experimental

[2 Nov 2010 8:54] Alexander Nozdrin
Description:
rpl.rpl_get_master_version_and_clock [ fail ] timeout after 900 seconds
        Test ended at 2010-10-30 22:43:38

Test case timeout after 900 seconds

== /export/home/pb2/test/sb_1-2471220-1288466864.57/mysql-5.6.99-m5-linux-x86_64-test/mysql-test/var-n_mix/log/rpl_get_master_version_and_clock.log == 

**** SHOW BINLOG EVENTS on slave ****
SHOW BINLOG EVENTS IN 'slave-bin.000001';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
slave-bin.000001	4	Format_desc	2	109	Server ver: 5.6.99-m5-debug-log, Binlog ver: 4
slave-bin.000001	109	Query	2	176	BEGIN
slave-bin.000001	176	Query	2	402	use `mtr`; INSERT INTO test_suppressions (pattern) VALUES ( NAME_CONST('pattern',_latin1'Slave I/O: Master command COM_REGISTER_SLAVE failed: .*' COLLATE 'latin1_swedish_ci'))
slave-bin.000001	402	Query	2	470	COMMIT
slave-bin.000001	470	Query	2	537	BEGIN
slave-bin.000001	537	Query	2	816	use `mtr`; INSERT INTO test_suppressions (pattern) VALUES ( NAME_CONST('pattern',_latin1'Slave I/O: .* failed with error: Lost connection to MySQL server at \'reading initial communication packet\'' COLLATE 'latin1_swedish_ci'))
slave-bin.000001	816	Query	2	884	COMMIT
slave-bin.000001	884	Query	2	951	BEGIN
slave-bin.000001	951	Query	2	1218	use `mtr`; INSERT INTO test_suppressions (pattern) VALUES ( NAME_CONST('pattern',_latin1'Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .*' COLLATE 'latin1_swedish_ci'))
slave-bin.000001	1218	Query	2	1286	COMMIT
slave-bin.000001	1286	Query	2	1353	BEGIN
slave-bin.000001	1353	Query	2	1562	use `mtr`; INSERT INTO test_suppressions (pattern) VALUES ( NAME_CONST('pattern',_latin1'Slave I/O thread .* register on master' COLLATE 'latin1_swedish_ci'))
slave-bin.000001	1562	Query	2	1630	COMMIT

How to repeat:
Log: http://pb2.norway.sun.com/?action=archive_download&archive_id=2472613&pretty=please

XRef: http://pb2.norway.sun.com/?template=mysql_show_test_failure&test_failure_id=&test_output_i...
[18 Nov 2010 8:12] Alexander Nozdrin
Another symptom:
------------------------------------------------------------
rpl.rpl_get_master_version_and_clock [ fail ] Found warnings/errors in server log file!
        Test ended at 2010-11-18 00:49:46
line
101118  2:49:33 [ERROR] Error on COM_BINLOG_DUMP: 2003  Can't connect to MySQL server on '127.0.0.1' (111), will retry in 1 secs
101118  2:49:33 [ERROR] Failed on request_dump()
101118  2:49:33 [Warning] Slave I/O: Master command COM_BINLOG_DUMP failed: failed dump request, reconnecting to try again, log 'master-bin.000002' at postion 109, Error_code: 1597
^ Found warnings in /export/home2/pb2/test/sb_2-2544708-1290036552.5/mysql-5.6.1-m5-linux-x86_64-test/mysql-test/var-n_mix/log/mysqld.2.err
ok

 - saving '/export/home2/pb2/test/sb_2-2544708-1290036552.5/mysql-5.6.1-m5-linux-x86_64-test/mysql-test/var-n_mix/log/rpl.rpl_get_master_version_and_clock/' to '/export/home2/pb2/test/sb_2-2544708-1290036552.5/mysql-5.6.1-m5-linux-x86_64-test/mysql-test/var-n_mix/log/rpl.rpl_get_master_version_and_clock/'

Retrying test rpl.rpl_get_master_version_and_clock, attempt(2/3)...

rpl.rpl_get_master_version_and_clock     [ retry-pass ]  23680

Retrying test rpl.rpl_get_master_version_and_clock, attempt(3/3)...

rpl.rpl_get_master_version_and_clock     [ retry-pass ]  24193
------------------------------------------------------------

Log: http://pb2.norway.sun.com/?action=archive_download&archive_id=2546897&pretty=please

Tree: http://pb2.norway.sun.com/?template=show_pushes&branch=daily-trunk-bugfixing
[18 Nov 2010 9:03] Alexander Nozdrin
The test case has been made experimental due to this bug.
[23 Nov 2010 20:17] Alfranio Junior
Please, check if the problem is not in 
  
  . extra/rpl_tests/rpl_get_master_version_and_clock.test

and could be easily fixed by the following patch:

--- mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test	2010-10-21 13:24:31 +0000
+++ mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test	2010-11-23 20:01:42 +0000
@@ -60,6 +60,7 @@
 # Send shutdown to the connected server and give
 # it 10 seconds to die before zapping it
 shutdown_server 10;
+--source include/wait_until_disconnected.inc
 
 connection slave;
 --echo slave is unblocked

Any test case that crashes or shuts down a server must follow the following pattern:

--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
# CRASH OR SHUTDOWN
--source include/wait_until_disconnected.inc
--enable_reconnect
--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
--source include/wait_until_connected_again.inc
[24 Nov 2010 15:04] Bjørn Munch
Since this test restarts the server, a timeout could potentially have been caused by some test failure combined with Bug #58424. Although this only triggers if the server that's expected to die dies within 100 ms *after* the test client has failed.
[9 Dec 2010 8:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126392

3427 Dao-Gang.Qu@sun.com	2010-12-09
      Bug #57918  	rpl_get_master_version_and_clock times out sporadically
      
      The reason is that sometimes the slave is unblocked before the
      master server is really shutdown.
      
      Symptom 1, found warnings/errors in server log file.
      In the situation, the 'get_master_version_and_clock' func
      will return normally inster of network error expected. Then
      the 'handle_slave_io' func will go to invoke 'get_master_uuid'
      func, which will return normally if the master server is
      still alive. Then the process will go on untill invoke
      'request_dump' func, which will print the "Error on
      COM_BINLOG_DUMP: 2003  Can't connect to MySQL server on
      '127.0.0.1' (111), will retry in 1 secs" error as symptom
      reported if the server is really shutdown in the moment.
      while the test is waiting for the network error.
      
      Symptom 2, timeout after 900 seconds
      In the situation, the 'get_master_version_and_clock',
      get_master_uuid, and 'request_dump' funcs will retrurn
      normally if the master server is alive during the period,
      then the 'handle_slave_io' func will enter
      'while (!io_slave_killed(thd,mi))' block and try to connect
      master while the master sever is really shutdown and the
      test will not have a chance to start master, because during
      the period the test will stall there to wait for a network
      error until it exits with a timeout error.
      
      To fix these problems to source wait_until_disconnected.inc
      after shutdown master server for guaranteeing the master
      server is really shutdown before the slave is unblocked.
     @ mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
        Update test to guarantee the master server is really
        shutdown before the slave is unblocked.
[10 Dec 2010 6:08] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126496

3427 Dao-Gang.Qu@sun.com	2010-12-10
      Bug #57918  	rpl_get_master_version_and_clock times out sporadically
      
      The reason is that sometimes the slave is unblocked before the
      master server is really shutdown.
      
      Symptom 1, found warnings/errors in server log file.
      In the situation, the 'get_master_version_and_clock' func
      will return normally instead of network error expected.
      Then the 'handle_slave_io' func will go to invoke
      'get_master_uuid' func, which will return normally if
      the master server is still alive. Then the process will
      go on until invoke 'request_dump' func, which will print
      the "Error on COM_BINLOG_DUMP: 2003  Can't connect to
      MySQL server on '127.0.0.1' (111), will retry in 1 secs"
      error as symptom reported if the server is really shutdown
      in the moment, while the test is waiting for the network
      error.
      
      Symptom 2, timeout after 900 seconds
      In the situation, the 'get_master_version_and_clock',
      get_master_uuid, and 'request_dump' funcs will return
      normally if the master server is alive during the period,
      then the 'handle_slave_io' func will enter
      'while (!io_slave_killed(thd,mi))' block and exit with error
      when fail to reconnect master, which is really shutdown
      at that time. Then the test will not have a chance to start
      master, because during the period it will stall there to wait
      for a network error until it exits with a timeout error.
      
      
      To fix these problems, we source wait_until_disconnected.inc
      after shutdown master server for guaranteeing the master
      server is really shutdown before the slave is unblocked.
     @ mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
        Update test to guarantee the master server is really
        shutdown before the slave is unblocked.
[17 Dec 2010 9:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127151

3443 Dao-Gang.Qu@sun.com	2010-12-17
      Bug #57918  	rpl_get_master_version_and_clock times out sporadically
      
      The reason is that sometimes the slave is unblocked before the
      master server is really shutdown.
      
      Symptom 1, found warnings/errors in server log file.
      In the situation, the 'get_master_version_and_clock' func
      will return normally instead of network error expected.
      Then the 'handle_slave_io' func will go to invoke
      'get_master_uuid' func, which will return normally if
      the master server is still alive. Then the process will
      go on until invoke 'request_dump' func, which will print
      the "Error on COM_BINLOG_DUMP: 2003  Can't connect to
      MySQL server on '127.0.0.1' (111), will retry in 1 secs"
      error as symptom reported if the server is really shutdown
      in the moment, while the test is waiting for the network
      error.
      
      Symptom 2, timeout after 900 seconds
      In the situation, the 'get_master_version_and_clock',
      get_master_uuid, and 'request_dump' funcs will return
      normally if the master server is alive during the period,
      then the 'handle_slave_io' func will enter
      'while (!io_slave_killed(thd,mi))' block and exit with error
      when fail to reconnect master, which is really shutdown
      at that time. Then the test will not have a chance to start
      master, because during the period it will stall there to wait
      for a network error until it exits with a timeout error.
      
      
      To fix these problems, we source wait_until_disconnected.inc
      after shutdown master server for guaranteeing the master
      server is really shutdown before the slave is unblocked.
     @ mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
        Update test to guarantee the master server is really
        shutdown before the slave is unblocked.
[17 Dec 2010 10:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127152

3444 Dao-Gang.Qu@sun.com	2010-12-17
      Bug #57918  	rpl_get_master_version_and_clock times out sporadically
      
      The reason is that sometimes the slave is unblocked before the
      master server is really shutdown.
      
      Symptom 1, found warnings/errors in server log file.
      In the situation, the 'get_master_version_and_clock' func
      will return normally instead of network error expected.
      Then the 'handle_slave_io' func will go to invoke
      'get_master_uuid' func, which will return normally if
      the master server is still alive. Then the process will
      go on until invoke 'request_dump' func, which will print
      the "Error on COM_BINLOG_DUMP: 2003  Can't connect to
      MySQL server on '127.0.0.1' (111), will retry in 1 secs"
      error as symptom reported if the server is really shutdown
      in the moment, while the test is waiting for the network
      error.
      
      Symptom 2, timeout after 900 seconds
      In the situation, the 'get_master_version_and_clock',
      get_master_uuid, and 'request_dump' funcs will return
      normally if the master server is alive during the period,
      then the 'handle_slave_io' func will enter
      'while (!io_slave_killed(thd,mi))' block and exit with error
      when fail to reconnect master, which is really shutdown
      at that time. Then the test will not have a chance to start
      master, because during the period it will stall there to wait
      for a network error until it exits with a timeout error.
      
      
      To fix these problems, we source wait_until_disconnected.inc
      after shutdown master server for guaranteeing the master
      server is really shutdown before the slave is unblocked.
     @ mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
        Update test to guarantee the master server is really
        shutdown before the slave is unblocked.
[17 Dec 2010 12:57] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:georgi.kodinov@oracle.com-20101217125013-y8pb3az32rtbplc9) (version source revid:georgi.kodinov@oracle.com-20101217112525-t71wcarh16fii9bz) (merge vers: 5.6.1) (pib:24)
[18 Dec 2010 13:11] Jon Stephens
Testing only; no changelog entry needed. Closed.
[27 Dec 2010 9:36] Anitha Gopi
Re opening since this test is still in the experimental group. See http://trollheim.norway.sun.com/archive/2712721.disabled_tests.html. If the test is passing now please remove it from collections/default.experimental
[27 Dec 2010 10:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127594

3472 Dao-Gang.Qu@sun.com	2010-12-27
      Bug #57918  	rpl_get_master_version_and_clock times out sporadically
      
      Post fix
      Get rid of 'rpl_get_master_version_and_clock' entry from
      collections/default.experimental.
     @ mysql-test/collections/default.experimental
        Get rid of 'rpl_get_master_version_and_clock' entry.
[29 Dec 2010 12:52] Bugs System
Pushed into mysql-trunk 5.6.2 (revid:alexander.nozdrin@oracle.com-20101229113850-5rhksndrwz5dwcd7) (version source revid:alexander.nozdrin@oracle.com-20101229113149-c4hnmghox8obr75c) (merge vers: 5.6.2) (pib:24)
[6 Jan 2011 4:42] Jon Stephens
Testing changes only; no changelog entry needed. Closed.