Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
Submitted: 28 Jun 2008 12:52 Modified: 24 Jun 13:09
Reporter: Sven Sandberg
Status: Closed
Category:Tests: Replication Severity:S2 (Serious)
Version:6.0 OS:Any
Assigned to: Andrei Elkin Target Version:6.0
Tags: rpl.rpl_heartbeat, pushbuild, test failure, sporadic, timeout, disabled
Triage: D3 (Medium)

[28 Jun 2008 12:52] Sven Sandberg
Description:
TEST: rpl.rpl_heartbeat

The test occasionally times out in pushbuild.

How to repeat:
 WHERE: 6.0/alik Mon Jun 2 14:18:22 2008/'vm-win2003-32-a' Win32 VS2003 -max-nt/n_mix
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=7

 WHERE: 6.0/chad Fri May 23 20:27:32 2008/'vm-win2003-32-a' Win32 VS2003 -max-nt/n_stm
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=3

 WHERE: 6.0/mleich on Thu Jun 26 13:18:34 2008/'vm-win2003-32-a' Win32 VS2003
-max-nt/ps_stm
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=15
[28 Jun 2008 13:11] Sven Sandberg
WHERE: 6.0/azundris on Sat Jun 21 09:16:21 2008/'powermacg5' -max/n_mix
 URL: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0&order=11
 --
 WHERE: 6.0-rpl/skozlov on Mon Jun 23 21:26:22 2008/'vm-win2003-64-b' Win64 VS2005
-max-nt/n_mix
 URL:
https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0-rpl&order=17
[15 Jul 2008 9:19] Alexander Nozdrin
Test case has been disabled because it fails too often.
[12 Dec 2008 18:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61533

2807 Andrei Elkin	2008-12-12
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      The reason of the failure on windows platform was not detected. Still, a piece of
heartbeat code
      had a flaw fixed with Bug #39077.
      It's probable that the patch for the latter bug, which is going to be pushed to 6.0
main,
      can help with the current.
      
      Attempted to fix with patch for Bug #39077.
[12 Dec 2008 22:34] Andrei Elkin
the fixes for possibly related bug#39077 are pushed in order to monitor
passing of the test.
The status is set to in-progress till further openings of confirmation the bug is really
over.
[19 Dec 2008 10:36] Sven Sandberg
Setting to "Can't repeat" since it has not happened since 2008-07-04. Please re-open the
bug if it happens again.
[19 Dec 2008 18:35] Sven Sandberg
xref: http://tinyurl.com/3q7nr9
[20 Jan 19:57] Bugs System
Pushed into 6.0.10-alpha (revid:joro@sun.com-20090119171328-2hemf2ndc1dxl0et) (version
source revid:azundris@mysql.com-20081230114916-c290n83z25wkt6e4) (merge vers:
6.0.9-alpha) (pib:6)
[30 Jan 15:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64652

2988 Andrei Elkin	2009-01-30
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      Finally there happened to be the timeout again:
     
https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-6.0-bugteam&order=45...
      
      The test is conditionally disabled not to run on windows.
      
      Todo: remove +-- source include/not_windows.inc upon the case's been fixed.
[4 Feb 12:15] Bugs System
Pushed into 6.0.10-alpha (revid:kostja@sun.com-20090204104420-mw1i2u9lum4bxjo6) (version
source revid:joro@sun.com-20090131161307-ydhtowoaf0m3nzu0) (merge vers: 6.0.10-alpha)
(pib:6)
[5 Feb 14:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/65335

3027 Andrei Elkin	2009-02-05
      mysql-test/suite/rpl/t/rpl_heartbeat.test is let to run on windows for watching
over bug#37714 show-up after mtr2 has been pushed; it might be that the former mtr
contributed to the bug issue
[6 Feb 22:25] Andrei Elkin
Setting it in-progress to gather regression evidence that pb can supply.
If the timeout failure won't show up then we could relate it to the old mtr.
[14 Feb 14:00] Bugs System
Pushed into 6.0.10-alpha (revid:matthias.leich@sun.com-20090212211028-y72faag15q3z3szy)
(version source revid:alexey.kopytov@sun.com-20090206100220-tkvd9v83791i895x) (merge
vers: 6.0.10-alpha) (pib:6)
[23 Feb 14:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67188

3075 Andrei Elkin	2009-02-23
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      Logs on PB show that the IO thread was down by the clean-up (drop table t1) of the
test.
      A propable reason for IO thread to stop is a small value of slave_net_timeout -
chosen as 
      tradeoff betweeen a need to test counting of heartbeats and the test execution
time.
      On a slow env it can be that the timeout elapses first before any heartbeat got
arrived.
      
      Fixed with performing the clean-up separately by the master and the slave.
[23 Feb 19:54] Andrei Elkin
Alfranio, I think one of you with Luis needs substitution by Serge who
involved into rpl heartbeat testing.
This patch must be of his interest, not least he spotted the test failure last time. I
hope you're okay with giving him a chance :-)

De nada, Andrei.
[23 Feb 19:56] Andrei Elkin
There are two patches committed, still in-progress till the second patch proves
correlation of small value slave net timeout with the failure.
So far I have been watching over the test passage.
[24 Feb 16:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67394

3075 Andrei Elkin	2009-02-24
      only comments regarding to bug#37714. The push is to make pb executing
rpl_heartbeat
[18 Mar 14:17] Bugs System
Pushed into 6.0.11-alpha (revid:joro@sun.com-20090318122208-1b5kvg6zeb4hxwp9) (version
source revid:azundris@mysql.com-20090224072212-51w0xg6doju2drup) (merge vers:
6.0.10-alpha) (pib:6)
[3 Apr 18:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71335

3177 Andrei Elkin	2009-04-03
      bug#37714
      
      debug print out for the test is added
[3 Apr 18:27] Andrei Elkin
Still in-progress, a debug push is about to be done.
[6 May 16:09] Bugs System
Pushed into 6.0.12-alpha (revid:svoj@sun.com-20090506125450-yokcmvqf2g7jhujq) (version
source revid:aelkin@mysql.com-20090403162450-66ih5occv33rsc6a) (merge vers: 6.0.11-alpha)
(pib:6)
[3 Jun 17:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75539

2858 Andrei Elkin	2009-06-03
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      The reason of the bug is a feature of pthread_cond_timedwait() having a time window
      in between of the timer elapsed that wakes up the thread and
      the thread re-acquired the mutex. There could be signals sent to the dump thread at
      times of the interval so that the dump thread was not aware of updating of the
binlog
      and continued to stay in the loop.
      
      Fixed by augmenting MYSQL_BIN_LOG class with a counter what is checked prior and
after
      the wake-up to catch the fact of the binlog got updated.
[3 Jun 17:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75540

2859 Andrei Elkin	2009-06-03
      Bug #37714   rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout 
      
      cleaning the test out of a debug print.
[8 Jun 19:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75868

2860 Andrei Elkin	2009-06-08
      Bug #37714   rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout 
      
      Restroring the pre-debug push aelkin@mysql.com-20090223133029-31b45i2aw9uaompa
      values of slave net timeout and hb to reduce the test pass time as twice.
[15 Jun 16:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76283

2864 Andrei Elkin	2009-06-15
      Bug #37714  rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout 
      
      The reason of the bug is a feature of pthread_cond_timedwait() having a time window
      in between of the timer elapsed that wakes up the thread and
      the thread re-acquired the mutex. There could be signals sent to the dump thread at
      times of the interval so that the dump thread was not aware of updating of the
      binlog and continued to stay in the loop.
            
      Fixed by augmenting MYSQL_BIN_LOG class with a counter that is checked before and
      after the wake-up to catch the fact of the binlog got updated.
[16 Jun 14:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76388

2866 Andrei Elkin	2009-06-16
      Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to timeout
      
      rpl_backup_multi revealed the assert
      DBUG_ASSERT(ret == 0 && signal_cnt != mysql_bin_log.signal_cnt || thd->killed)
      does not hold in a case of multiple dump threads. A waiting for binlog update
      thread can catch a broad-cast signal without the binlog having actually refreshed.
      
      The assert is removed.
[19 Jun 9:54] Bugs System
Pushed into 5.4.4-alpha (revid:zhenxing.he@sun.com-20090619074435-4mlfkqqol4nzpq10)
(version source revid:zhenxing.he@sun.com-20090619074435-4mlfkqqol4nzpq10) (merge vers:
5.4.4-alpha) (pib:11)
[24 Jun 13:09] Jon Stephens
Test failure only, no user-facing changes to document. Closed.
[27 Oct 10:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/88270

3137 He Zhenxing	2009-10-27
      Backport Bug #37714 rpl.rpl_heartbeat fails sporadically in pushbuild due to
timeout
      
      rpl_backup_multi revealed the assert
      DBUG_ASSERT(ret == 0 && signal_cnt != mysql_bin_log.signal_cnt || thd->killed)
      does not hold in a case of multiple dump threads. A waiting for binlog update
      thread can catch a broad-cast signal without the binlog having actually refreshed
      
      The assert is removed.
     @ sql/sql_repl.cc
        assert does not hold in a case of multiple dump threads. A waiting for binlog
update thread can catch a broad-cast signal without the binlog having actually refreshed.