Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
Submitted: 28 Jun 2008 10:52 Modified: 24 Jun 2009 11:09
Reporter: Sven Sandberg Email Updates:
Status: Closed Impact on me:
None 
Category:Tests: Replication Severity:S7 (Test Cases)
Version:6.0 OS:Any
Assigned to: Andrei Elkin CPU Architecture:Any
Tags: disabled, pushbuild, rpl.rpl_heartbeat, sporadic, test failure, timeout

[28 Jun 2008 10:52] Sven Sandberg
Description:
TEST: rpl.rpl_heartbeat

The test occasionally times out in pushbuild.

How to repeat:
 WHERE: 6.0/alik Mon Jun 2 14:18:22 2008/'vm-win2003-32-a' Win32 VS2003 -max-nt/n_mix
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=7

 WHERE: 6.0/chad Fri May 23 20:27:32 2008/'vm-win2003-32-a' Win32 VS2003 -max-nt/n_stm
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=3

 WHERE: 6.0/mleich on Thu Jun 26 13:18:34 2008/'vm-win2003-32-a' Win32 VS2003 -max-nt/ps_stm
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=15
[28 Jun 2008 11:11] Sven Sandberg
WHERE: 6.0/azundris on Sat Jun 21 09:16:21 2008/'powermacg5' -max/n_mix
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=11
 --
 WHERE: 6.0-rpl/skozlov on Mon Jun 23 21:26:22 2008/'vm-win2003-64-b' Win64 VS2005 -max-nt/n_mix
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0-rpl&order=17
[15 Jul 2008 7:19] Alexander Nozdrin
Test case has been disabled because it fails too often.
[12 Dec 2008 17:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61533

2807 Andrei Elkin	2008-12-12
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      The reason of the failure on windows platform was not detected. Still, a piece of heartbeat code
      had a flaw fixed with Bug #39077.
      It's probable that the patch for the latter bug, which is going to be pushed to 6.0 main,
      can help with the current.
      
      Attempted to fix with patch for Bug #39077.
[12 Dec 2008 21:34] Andrei Elkin
the fixes for possibly related bug#39077 are pushed in order to monitor
passing of the test.
The status is set to in-progress till further openings of confirmation the bug is really over.
[19 Dec 2008 9:36] Sven Sandberg
Setting to "Can't repeat" since it has not happened since 2008-07-04. Please re-open the bug if it happens again.
[19 Dec 2008 17:35] Sven Sandberg
xref: http://tinyurl.com/3q7nr9
[20 Jan 2009 18:57] Bugs System
Pushed into 6.0.10-alpha (revid:joro@sun.com-20090119171328-2hemf2ndc1dxl0et) (version source revid:azundris@mysql.com-20081230114916-c290n83z25wkt6e4) (merge vers: 6.0.9-alpha) (pib:6)
[30 Jan 2009 14:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64652

2988 Andrei Elkin	2009-01-30
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      Finally there happened to be the timeout again:
      https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0-bugteam&order=45...
      
      The test is conditionally disabled not to run on windows.
      
      Todo: remove +-- source include/not_windows.inc upon the case's been fixed.
[4 Feb 2009 11:15] Bugs System
Pushed into 6.0.10-alpha (revid:kostja@sun.com-20090204104420-mw1i2u9lum4bxjo6) (version source revid:joro@sun.com-20090131161307-ydhtowoaf0m3nzu0) (merge vers: 6.0.10-alpha) (pib:6)
[5 Feb 2009 13:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/65335

3027 Andrei Elkin	2009-02-05
      mysql-test/suite/rpl/t/rpl_heartbeat.test is let to run on windows for watching over bug#37714 show-up after mtr2 has been pushed; it might be that the former mtr contributed to the bug issue
[6 Feb 2009 21:25] Andrei Elkin
Setting it in-progress to gather regression evidence that pb can supply.
If the timeout failure won't show up then we could relate it to the old mtr.
[14 Feb 2009 13:00] Bugs System
Pushed into 6.0.10-alpha (revid:matthias.leich@sun.com-20090212211028-y72faag15q3z3szy) (version source revid:alexey.kopytov@sun.com-20090206100220-tkvd9v83791i895x) (merge vers: 6.0.10-alpha) (pib:6)
[23 Feb 2009 13:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67188

3075 Andrei Elkin	2009-02-23
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      Logs on PB show that the IO thread was down by the clean-up (drop table t1) of the test.
      A propable reason for IO thread to stop is a small value of slave_net_timeout - chosen as 
      tradeoff betweeen a need to test counting of heartbeats and the test execution time.
      On a slow env it can be that the timeout elapses first before any heartbeat got arrived.
      
      Fixed with performing the clean-up separately by the master and the slave.
[23 Feb 2009 18:54] Andrei Elkin
Alfranio, I think one of you with Luis needs substitution by Serge who
involved into rpl heartbeat testing.
This patch must be of his interest, not least he spotted the test failure last time. I hope you're okay with giving him a chance :-)

De nada, Andrei.
[23 Feb 2009 18:56] Andrei Elkin
There are two patches committed, still in-progress till the second patch proves correlation of small value slave net timeout with the failure.
So far I have been watching over the test passage.
[24 Feb 2009 15:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67394

3075 Andrei Elkin	2009-02-24
      only comments regarding to bug#37714. The push is to make pb executing rpl_heartbeat
[18 Mar 2009 13:17] Bugs System
Pushed into 6.0.11-alpha (revid:joro@sun.com-20090318122208-1b5kvg6zeb4hxwp9) (version source revid:azundris@mysql.com-20090224072212-51w0xg6doju2drup) (merge vers: 6.0.10-alpha) (pib:6)
[3 Apr 2009 16:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71335

3177 Andrei Elkin	2009-04-03
      bug#37714
      
      debug print out for the test is added
[3 Apr 2009 16:27] Andrei Elkin
Still in-progress, a debug push is about to be done.
[6 May 2009 14:09] Bugs System
Pushed into 6.0.12-alpha (revid:svoj@sun.com-20090506125450-yokcmvqf2g7jhujq) (version source revid:aelkin@mysql.com-20090403162450-66ih5occv33rsc6a) (merge vers: 6.0.11-alpha) (pib:6)
[3 Jun 2009 15:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75539

2858 Andrei Elkin	2009-06-03
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      The reason of the bug is a feature of pthread_cond_timedwait() having a time window
      in between of the timer elapsed that wakes up the thread and
      the thread re-acquired the mutex. There could be signals sent to the dump thread at
      times of the interval so that the dump thread was not aware of updating of the binlog
      and continued to stay in the loop.
      
      Fixed by augmenting MYSQL_BIN_LOG class with a counter what is checked prior and after
      the wake-up to catch the fact of the binlog got updated.
[3 Jun 2009 15:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75540

2859 Andrei Elkin	2009-06-03
      Bug #37714   rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout 
      
      cleaning the test out of a debug print.
[8 Jun 2009 17:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75868

2860 Andrei Elkin	2009-06-08
      Bug #37714   rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout 
      
      Restroring the pre-debug push aelkin@mysql.com-20090223133029-31b45i2aw9uaompa
      values of slave net timeout and hb to reduce the test pass time as twice.
[15 Jun 2009 14:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76283

2864 Andrei Elkin	2009-06-15
      Bug #37714  rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout 
      
      The reason of the bug is a feature of pthread_cond_timedwait() having a time window
      in between of the timer elapsed that wakes up the thread and
      the thread re-acquired the mutex. There could be signals sent to the dump thread at
      times of the interval so that the dump thread was not aware of updating of the
      binlog and continued to stay in the loop.
            
      Fixed by augmenting MYSQL_BIN_LOG class with a counter that is checked before and
      after the wake-up to catch the fact of the binlog got updated.
[16 Jun 2009 12:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76388

2866 Andrei Elkin	2009-06-16
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      rpl_backup_multi revealed the assert
      DBUG_ASSERT(ret == 0 && signal_cnt != mysql_bin_log.signal_cnt || thd->killed)
      does not hold in a case of multiple dump threads. A waiting for binlog update
      thread can catch a broad-cast signal without the binlog having actually refreshed.
      
      The assert is removed.
[19 Jun 2009 7:54] Bugs System
Pushed into 5.4.4-alpha (revid:zhenxing.he@sun.com-20090619074435-4mlfkqqol4nzpq10) (version source revid:zhenxing.he@sun.com-20090619074435-4mlfkqqol4nzpq10) (merge vers: 5.4.4-alpha) (pib:11)
[24 Jun 2009 11:09] Jon Stephens
Test failure only, no user-facing changes to document. Closed.
[27 Oct 2009 9:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/88270

3137 He Zhenxing	2009-10-27
      Backport Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      rpl_backup_multi revealed the assert
      DBUG_ASSERT(ret == 0 && signal_cnt != mysql_bin_log.signal_cnt || thd->killed)
      does not hold in a case of multiple dump threads. A waiting for binlog update
      thread can catch a broad-cast signal without the binlog having actually refreshed
      
      The assert is removed.
     @ sql/sql_repl.cc
        assert does not hold in a case of multiple dump threads. A waiting for binlog update thread can catch a broad-cast signal without the binlog having actually refreshed.
[23 Dec 2009 11:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/95491

3074 Andrei Elkin	2009-12-23
      Bug #49802: backport Bug #37714  rpl.rpl_heartbeat  to telco
      
      fixed with backporting two patches of bug@37714