MySQL Bugs: #44799: rpl_backup_shutdown test fails on PB1 in pb-valgrind platform with timeout

Bug #44799	rpl_backup_shutdown test fails on PB1 in pb-valgrind platform with timeout
Submitted:	11 May 2009 18:31	Modified:	12 Jun 2009 13:31
Reporter:	Hema Sridharan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	mysql-6.0-backup	OS:	Any
Assigned to:	Luis Soares	CPU Architecture:	Any

Description:
The error that is noticed in pushbuild is a valgrind error and occurs only in platform pb-valgrind that runs valgrind.

How to repeat:
Please see this webpage for failure,

https://intranet.mysql.com/secure/pushbuild/xref.pl?testname=rpl.rpl_backup_shutdown

Please take a look at the valgrind error that is visible in pb-valgrind platform

Attachment: valgrind_error.txt (text/plain), 16.66 KiB.

Thank you for the report.

Verified as described:

mysqltest: In included file "./include/wait_for_slave_param.inc": Result content mismatch

==27424== 
==27424== ERROR SUMMARY: 3322732 errors from 100 contexts (suppressed: 5 from 1)
==27424== malloc/free: in use at exit: 2,104 bytes in 2 blocks.
==27424== malloc/free: 72,323 allocs, 72,321 frees, 165,784,892 bytes allocated.
==27424== For counts of detected errors, rerun with: -v
==27424== searching for pointers to 2 not-freed blocks.
==27424== checked 2,296,104 bytes.
==27424== 
==27424== 
==27424== 2,048 bytes in 1 blocks are still reachable in loss record 2 of 2
==27424==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
==27424==    by 0x4A05883: realloc (vg_replace_malloc.c:306)
==27424==    by 0x3429E06D19: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
==27424==    by 0x4C98C8F: my_thread_global_init (my_thr_init.c:148)
==27424==    by 0x4C8FC4D: my_init (my_init.c:99)
==27424==    by 0x417542: main (mysqltest.cc:7512)
==27424== 
==27424== LEAK SUMMARY:
==27424==    definitely lost: 0 bytes in 0 blocks.
==27424==      possibly lost: 0 bytes in 0 blocks.
==27424==    still reachable: 2,048 bytes in 1 blocks.
==27424==         suppressed: 56 bytes in 1 blocks.

The mentioned failure (rpl_backup_shutdown) is due to timeout in
valgrind build (test case hangs while waiting for the slave to stop
when facing a RESTORE event). This seems to be causing some extra
valgrind output. However, it is unclear why it is failing on valgrind
build only.

Aside from valgrind error/warnings the following is also found in PB
output:

--- /data0/pushbuild/pb1/pb/bzr_mysql-6.0-backup/245/mysql-6.0.11-alpha-pb245/mysql-test/suite/rpl/r/rpl_backup_shutdown.result	2009-05-12 22:04:53.000000000 +0300
+++ /data0/pushbuild/pb1/pb/bzr_mysql-6.0-backup/245/mysql-6.0.11-alpha-pb245/mysql-test/suite/rpl/r/rpl_backup_shutdown.reject	2009-05-13 10:20:34.000000000 +0300
@@ -116,31 +116,38 @@
 #
 # Slave will be stopped as result of restore incident event
 # Connecting to slave...
-# Check the slave status
+**** ERROR: timeout after  seconds while waiting for slave parameter Slave_SQL_Running = No ****
+Message: Failed while waiting for slave SQL thread to stop
+Current connection is 'slave'
+Note: the following output may have changed since the failure was detected
+

(...)

Going through the test case (rpl_backup_shutdown), the test seems to
hang at this point (at least by cross-referencing from output from
failure above):

 76 --echo # Slave will be stopped as result of restore incident event
 77 --echo # Connecting to slave...
 78 connection slave;
 79 source include/wait_for_slave_sql_to_stop.inc;

Placing an exit; just after line 79 and running the test case locally
with valgrind enabled, seems to corroborate the suspicion.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73986

2810 Luis Soares	2009-05-14
      BUG#44799: Some rpl tests fails in pushbuild because of bad usage of
      dbug.c
      
      The motivation for this bug report was due to fact that
      rpl_backup_shutdown was failing on PB in valgrind run. It happens that
      the test case actually times out causing extra valgrind output.
      
      This test case explicitly stops and restarts the master server. This
      is fine in all PB runs except one, the one in the valgrind host. Since
      valgrind is makes execution slower, it may happen that the master
      takes too long to restart, making the slave reconnection routine to
      give up before the master is ready again. In the case of an MTR run
      the reconnect period seems small enough for this to happen, ultimately
      causing the test case to timeout.
      
      This patch addresses this, by setting the --master-retry-count to a
      large value on the slave.
     @ mysql-test/suite/rpl/t/rpl_backup_shutdown-slave.opt
        Added slave option to reconfigure master-retry-count.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/74081

2810 Luis Soares	2009-05-14
      BUG#44799: rpl_backup_shutdown test fails on PB1 in pb-valgrind 
      platform with timeout
      
      The motivation for this bug report was due to fact that
      rpl_backup_shutdown was failing on PB in valgrind run. It happens that
      the test case actually times out causing extra valgrind output.
            
      This test case explicitly stops and restarts the master server. This
      is fine in all PB runs except one, the one in the valgrind host. Since
      valgrind makes execution slower, it may happen that the master
      takes too long to restart, making the slave reconnection routine to
      give up before the master is ready again. In the case of an MTR run
      the reconnect period seems small enough for this to happen, ultimately
      causing the test case to timeout.
            
      This patch addresses this, by setting the --master-retry-count to a
      large value on the slave.

This last commit is just to fix bug synopsis.

Pushed to 6.0-backup.

Merged to Azalea June 2

Does not appear to be any user-facing changes associated with this fix, so nothing to document for the Manual; closed without taking further action.