Bug #44799 rpl_backup_shutdown test fails on PB1 in pb-valgrind platform with timeout
Submitted: 11 May 2009 18:31 Modified: 12 Jun 2009 13:31
Reporter: Hema Sridharan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:mysql-6.0-backup OS:Any
Assigned to: Luis Soares CPU Architecture:Any

[11 May 2009 18:31] Hema Sridharan
Description:
The error that is noticed in pushbuild is a valgrind error and occurs only in platform pb-valgrind that runs valgrind.

How to repeat:
Please see this webpage for failure,

https://intranet.mysql.com/secure/pushbuild/xref.pl?testname=rpl.rpl_backup_shutdown
[11 May 2009 18:33] Hema Sridharan
Please take a look at the valgrind error that is visible in pb-valgrind platform

Attachment: valgrind_error.txt (text/plain), 16.66 KiB.

[12 May 2009 9:26] Sveta Smirnova
Thank you for the report.

Verified as described:

mysqltest: In included file "./include/wait_for_slave_param.inc": Result content mismatch

==27424== 
==27424== ERROR SUMMARY: 3322732 errors from 100 contexts (suppressed: 5 from 1)
==27424== malloc/free: in use at exit: 2,104 bytes in 2 blocks.
==27424== malloc/free: 72,323 allocs, 72,321 frees, 165,784,892 bytes allocated.
==27424== For counts of detected errors, rerun with: -v
==27424== searching for pointers to 2 not-freed blocks.
==27424== checked 2,296,104 bytes.
==27424== 
==27424== 
==27424== 2,048 bytes in 1 blocks are still reachable in loss record 2 of 2
==27424==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
==27424==    by 0x4A05883: realloc (vg_replace_malloc.c:306)
==27424==    by 0x3429E06D19: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
==27424==    by 0x4C98C8F: my_thread_global_init (my_thr_init.c:148)
==27424==    by 0x4C8FC4D: my_init (my_init.c:99)
==27424==    by 0x417542: main (mysqltest.cc:7512)
==27424== 
==27424== LEAK SUMMARY:
==27424==    definitely lost: 0 bytes in 0 blocks.
==27424==      possibly lost: 0 bytes in 0 blocks.
==27424==    still reachable: 2,048 bytes in 1 blocks.
==27424==         suppressed: 56 bytes in 1 blocks.
[14 May 2009 0:01] Luis Soares
The mentioned failure (rpl_backup_shutdown) is due to timeout in
valgrind build (test case hangs while waiting for the slave to stop
when facing a RESTORE event). This seems to be causing some extra
valgrind output. However, it is unclear why it is failing on valgrind
build only.

Aside from valgrind error/warnings the following is also found in PB
output:

--- /data0/pushbuild/pb1/pb/bzr_mysql-6.0-backup/245/mysql-6.0.11-alpha-pb245/mysql-test/suite/rpl/r/rpl_backup_shutdown.result	2009-05-12 22:04:53.000000000 +0300
+++ /data0/pushbuild/pb1/pb/bzr_mysql-6.0-backup/245/mysql-6.0.11-alpha-pb245/mysql-test/suite/rpl/r/rpl_backup_shutdown.reject	2009-05-13 10:20:34.000000000 +0300
@@ -116,31 +116,38 @@
 #
 # Slave will be stopped as result of restore incident event
 # Connecting to slave...
-# Check the slave status
+**** ERROR: timeout after  seconds while waiting for slave parameter Slave_SQL_Running = No ****
+Message: Failed while waiting for slave SQL thread to stop
+Current connection is 'slave'
+Note: the following output may have changed since the failure was detected
+

(...)

Going through the test case (rpl_backup_shutdown), the test seems to
hang at this point (at least by cross-referencing from output from
failure above):

 76 --echo # Slave will be stopped as result of restore incident event
 77 --echo # Connecting to slave...
 78 connection slave;
 79 source include/wait_for_slave_sql_to_stop.inc;

Placing an exit; just after line 79 and running the test case locally
with valgrind enabled, seems to corroborate the suspicion.
[14 May 2009 0:27] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73986

2810 Luis Soares	2009-05-14
      BUG#44799: Some rpl tests fails in pushbuild because of bad usage of
      dbug.c
      
      The motivation for this bug report was due to fact that
      rpl_backup_shutdown was failing on PB in valgrind run. It happens that
      the test case actually times out causing extra valgrind output.
      
      This test case explicitly stops and restarts the master server. This
      is fine in all PB runs except one, the one in the valgrind host. Since
      valgrind is makes execution slower, it may happen that the master
      takes too long to restart, making the slave reconnection routine to
      give up before the master is ready again. In the case of an MTR run
      the reconnect period seems small enough for this to happen, ultimately
      causing the test case to timeout.
      
      This patch addresses this, by setting the --master-retry-count to a
      large value on the slave.
     @ mysql-test/suite/rpl/t/rpl_backup_shutdown-slave.opt
        Added slave option to reconfigure master-retry-count.
[14 May 2009 15:45] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/74081

2810 Luis Soares	2009-05-14
      BUG#44799: rpl_backup_shutdown test fails on PB1 in pb-valgrind 
      platform with timeout
      
      The motivation for this bug report was due to fact that
      rpl_backup_shutdown was failing on PB in valgrind run. It happens that
      the test case actually times out causing extra valgrind output.
            
      This test case explicitly stops and restarts the master server. This
      is fine in all PB runs except one, the one in the valgrind host. Since
      valgrind makes execution slower, it may happen that the master
      takes too long to restart, making the slave reconnection routine to
      give up before the master is ready again. In the case of an MTR run
      the reconnect period seems small enough for this to happen, ultimately
      causing the test case to timeout.
            
      This patch addresses this, by setting the --master-retry-count to a
      large value on the slave.
[14 May 2009 15:46] Luis Soares
This last commit is just to fix bug synopsis.
[14 May 2009 15:55] Luis Soares
Pushed to 6.0-backup.
[9 Jun 2009 6:41] Jørgen Løland
Merged to Azalea June 2
[12 Jun 2009 13:31] Jon Stephens
Does not appear to be any user-facing changes associated with this fix, so nothing to document for the Manual; closed without taking further action.