Bug #58426 Crashing tests not failing as they are supposed to on Solaris 10 debug
Submitted: 23 Nov 2010 19:25 Modified: 20 Dec 2010 3:12
Reporter: Bjørn Munch Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Tests Severity:S3 (Non-critical)
Version:5.1+ OS:Solaris (S10, debug)
Assigned to: Tor Didriksen CPU Architecture:Any

[23 Nov 2010 19:25] Bjørn Munch
Description:
In 5.5-secuity since Oct 29, test binlog.binlog_index usually times out. It's actually a failure but is masked by Bug #58424 into a timeout.

The actual failure is like this:

---
mysqltest: At line 130: query 'purge binary logs TO 'master-bin.000006'' succeeded - should have failed with errno 2013...
---

It may also fail with a file-exists which either fails or succeeds unexpectedly; in those causes the test fails normally.

How to repeat:
I was able to repeat with binaries from a recent PB2 build on Solaris 10 x86 32 bit, running on skade11 (there was other load on the system)

Run ./mtr --mem binlog.binlog_index

If it hangs, kill it and then look at var/log/current_test

It's not consistently reproducable and the exact failure varies.
[23 Nov 2010 23:34] Luis Soares
This is not specific to binlog_index test case. This test case
is "unfortunate" to make heavy use of "crash the server" test
pattern. For instance, the same behavior is observed in another
test case (that crashes the server), main.crash_commit_before:

main.crash_commit_before                 [ fail ]
        Test ended at 2010-11-24 00:21:40

CURRENT_TEST: main.crash_commit_before
mysqltest: At line 22: query 'COMMIT' succeeded - should have failed with errno 2013...

The result from queries just before the failure was:
CREATE TABLE t1(a int) engine=innodb;
START TRANSACTION;
insert into t1 values(9);
SET SESSION debug="d,crash_commit_before";

 - saving '/export/home/tmp/ls229627/tests/mysql-5.5.8-solaris10-x86_64/mysql-test/var/log/main.crash_commit_before/' to '/export/home/tmp/ls229627/tests/mysql-5.5.8-solaris10-x86_64/mysql-test/var/log/main.crash_commit_before/'
------------------------------------------------------------
The servers were restarted 0 times
Spent 0.000 of 8 seconds executing testcases

Completed: Failed 1/1 tests, 0.00% were successful.

Failing test(s): main.crash_commit_before

How-to repeat:

1. Take a debug build for solaris10
2. run the test case main.crash_commit_before several times
3. Observe it failing sometimes with timeout (this is BUG#58424)
   and sometimes with the server not crashing where it was
   supposed to (this is BUG#58426 - this one).
[23 Nov 2010 23:57] Luis Soares
This may have to do with with the recent change on how crashes are
injected, DBUG_ABORT was replaced by DBUG_SUICIDE in BUG#52002.

DBUG_SUICIDE is implemented using SIGKILL + pause():
  
  #define DBUG_SUICIDE() (_db_flush_(), kill(getpid(), SIGKILL), pause())

Also, the behavior described in this bug report seems to match
closely on this review comments, which would happen when DBUG_SUICIDE
was implemented as without the pause() (which was one of the original
proposals):

  http://lists.mysql.com/commits/119761

This makes me think that even with pause(), there is still some code 
that runs before the process actually dies? or pause() wakes earlier 
than expected, due to some other signal!? 

As a consequence, crashing tests tend to exhibit this random behavior
on solaris10.
[26 Nov 2010 3:58] Anitha Gopi
Tor,
Assigning to you since this may have been introduced by your fix for Bug#52002
[26 Nov 2010 10:24] Tor Didriksen
with this definition of DBUG_SUICIDE
#define DBUG_SUICIDE() (_db_flush_(), fprintf(stderr, "SIGKILL myself\n"), fflush(stderr), kill(getpid(), SIGKILL), pause(), fprintf(stderr, "Continue after SIGKILL\n"), fflush(stderr))

we can actually get this server log:
SIGKILL myself
Continue after SIGKILL

So we have received some other signal (alarm maybe?) that interrupted the pause().
Simple fix: just loop forever waiting for SIGKILL to be delivered.
[26 Nov 2010 10:41] Tor Didriksen
Interesting, with this implementation of DBUG_SUICIDE
  do
  {
    fprintf(stderr, "SIGKILL myself\n");
    fflush(stderr);
    kill(getpid(), SIGKILL);
    pause();
  }
  while(1);

I sometimes see several hundred lines of
SIGKILL myself
in the server log.
[26 Nov 2010 14:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/125156

3521 Tor Didriksen	2010-11-26
      Bug #58426 Crashing tests not failing as they are supposed to on Solaris 10 debug
        
      On this platform we either
       - get lots of other signals
       - or maybe we lose the KILL signal
        
      Solution: loop with { kill; pause; } until process dies.
     @ include/my_dbug.h
        Let DBUG_SUICE loop forever until the KILL signal is delivered, and process dies.
[2 Dec 2010 11:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/125801

3512 Tor Didriksen	2010-11-26
      Bug #58426 Crashing tests not failing as they are supposed to on Solaris 10 debug
        
      On this platform we either
       - get lots of other signals
       - or maybe we lose the KILL signal
        
      Solution: loop with { kill; pause; } until process dies.
     @ include/my_dbug.h
        Let DBUG_SUICE loop forever until the KILL signal is delivered, and process dies.
[8 Dec 2010 8:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126292

3513 Tor Didriksen	2010-11-26
      Bug #58426 Crashing tests not failing as they are supposed to on Solaris 10 debug
        
      On this platform we either
       - get lots of other signals
       - or maybe we lose the KILL signal
        
      Solution: loop with { kill; pause; } until process dies.
     @ include/my_dbug.h
        Let DBUG_SUICE loop forever until the KILL signal is delivered, and process dies.
[13 Dec 2010 12:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126624

3518 Tor Didriksen	2010-12-13
      Bug #58426 Crashing tests not failing as they are supposed to on Solaris 10 debug
        
      On this platform we seem to get lots of other signals
      while waiting for SIGKILL to be delivered.
      
      Solution: use sigsuspend(<all signals blocked>)
     @ dbug/dbug.c
        New function _db_suicide_() which does kill(myself, -9) and then waits forever.
     @ include/my_dbug.h
        Let DBUG_SUICE wait forever until the KILL signal is delivered, and process dies.
[13 Dec 2010 16:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/126681

3519 Tor Didriksen	2010-12-13
      Bug #58426 Crashing tests not failing as they are supposed to on Solaris 10 debug
        
      On this platform we seem to get lots of other signals
      while waiting for SIGKILL to be delivered.
      
      Solution: use sigsuspend(<all signals blocked>)
     @ dbug/dbug.c
        New function _db_suicide_() which does kill(myself, -9) and then waits forever.
     @ include/my_dbug.h
        Let DBUG_SUICE wait forever until the KILL signal is delivered, and process dies.
[13 Dec 2010 16:18] Tor Didriksen
pushed to 5.1-bugteam, 5.5-bugteam, trunk-bugfixing.
[17 Dec 2010 12:48] Bugs System
Pushed into mysql-5.1 5.1.55 (revid:georgi.kodinov@oracle.com-20101217124435-9imm43geck5u55qw) (version source revid:tor.didriksen@oracle.com-20101213161301-81lprlbune7r98dl) (merge vers: 5.1.55) (pib:24)
[17 Dec 2010 12:52] Bugs System
Pushed into mysql-5.5 5.5.9 (revid:georgi.kodinov@oracle.com-20101217124733-p1ivu6higouawv8l) (version source revid:sergey.glukhov@oracle.com-20101214104600-v0ndu721rf61nbml) (merge vers: 5.5.8) (pib:24)
[17 Dec 2010 12:56] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:georgi.kodinov@oracle.com-20101217125013-y8pb3az32rtbplc9) (version source revid:sergey.glukhov@oracle.com-20101214111513-9j68fg7s48a986ng) (merge vers: 5.6.1) (pib:24)
[20 Dec 2010 3:12] Paul DuBois
Changes to test suite. No changelog entry needed.