Bug #22320 my_atomic-t unit test fails
Submitted: 13 Sep 2006 17:57 Modified: 14 Oct 2010 13:09
Reporter: Guilhem Bichot Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:Celosia (M3) OS:Linux (Ubuntu x86 debug)
Assigned to: Davi Arnaut CPU Architecture:Any
Tags: pb2, regression, test failure

[13 Sep 2006 17:57] Guilhem Bichot
Description:
On some machines, this unit test fails randomly like this:
mysys/my_atomic-t...dubious
	Test returned status 255 (wstat -1, 0xffffffff)
	test program seems to have generated a core
	after all the subtests completed successfully

How to repeat:
cd unittest
(while true; do HARNESS_VERBOSE=1 perl unit.pl run mysys/my_atomic-t || exit 1; done)
We (Kristian, I) could repeat it using only a few iterations.
[13 Sep 2006 18:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11878

ChangeSet@1.2323, 2006-09-13 19:58:57+02:00, guilhem@gbichot3.local +1 -0
  fixes for the my_atomic-t unit test:
  - compiler warning
  - detection of pthread_create failure (you will see this message
  only if you run with "make test-verbose" in unittest; otherwise
  unit.pl masks all messages from the test but "ok" ones.
  - the test fails randomly on some machines (I filed it as BUG#22320),
  on one host it looks like a crash at exit() which a sleep(2) makes
  disappear. So I add the sleep(2), which can be removed
  when BUG#22320 is fixed.
[13 Sep 2006 18:52] Guilhem Bichot
It may well be a problem with Test::Harness (i.e. not a crash of the test program itself), because this:
(while true; do HARNESS_VERBOSE=1 ./mysys/my_atomic-t || exit 1; done)    
never fails.
[13 Sep 2006 20:01] Kristian Nielsen
This seems to be a Perl problem. Here is a simpler way to reproduce:

perl -e 'for(;;) { open(FH, "mysys/my_atomic-t|") || die "open"; 1 while(<FH>); if(!close FH) { print "close() failed: ", 0+$!, ": $!\n"; exit 1}; print "ok: $?\n"; }'

I get this (fails randomly):

ok: 0
close() failed: 10: No child processes
[13 Sep 2006 22:12] Kristian Nielsen
I got curious, and investigated a bit more. It actually is not a Perl bug, looks more like a kernel/NPTL bug, since the problem can be repeated also with a C program (attached).

Basically, there seems to be a race where the parent fork()'s and exec()'s the my_atomit-t program, then calls waitpid(), but sometimes waitpid() fails with ECHILD, so the exit status of the child is lost.

My guess is that the main thread in my_atomic-t exits before one or more other threads. Then in the small window between the exit of the main thread and the exit of the last thread, the waitpid() wrongly fails, because the main thread (= child pid) is gone, and the exit status for the whole thread group (=pid) has not been stored yet.

Note that this failure in Pushbuild is only seen on hosts rh-x86-32 and rhas4-ia64, both of which have kernel 2.6.9-22.0.1.

Also note that it only fails when using NPTL:

$ LD_ASSUME_KERNEL=2.4.20  ~/bug22320 mysys/my_atomic-t
Child=3958 waitpid=3958
wait4: No child processes

$ LD_ASSUME_KERNEL=2.4.19  ~/bug22320 mysys/my_atomic-t
Child=9841 waitpid=9841
Child=10218 waitpid=10218
Child=10528 waitpid=10528
Child=10837 waitpid=10837
Child=11164 waitpid=11164
...

If my guess is correct, we can fix it by explicitly joining the threads in my_atomic-t, instead of spawning then PTHREAD_CREATE_DETACHED.

Or maybe a kernel upgrade ...
[13 Sep 2006 22:13] Kristian Nielsen
C program to expose the kernel/NPTL bug with my_atomic-t

Attachment: bug22320.c (text/x-csrc), 956 bytes.

[4 Dec 2006 10:13] Guilhem Bichot
if I remove the sleep(2) and use pthread_join() instead of a threads counter, it still fails on rh-x86-32:
mysys/my_atomic-t.............................# N CPUs: 2, atomic ops: dubious
... (all subtests say "ok") and then:
	Test returned status 255 (wstat -1, 0xffffffff)
	test program seems to have generated a core
	after all the subtests completed successfully
[4 Feb 2010 18:17] Alexander Nozdrin
my_atomic-t now fails in Celosia (M3) on 'Ubuntu x86 debug only'.
Symptoms:
mysys/my_atomic-t......FAILED--Further testing stopped: Signal 11 thrown

It seems that it does not fail in 5.1 anymore.
It also seems to work in Betony (M2). So, it might be regression.

Requesting re-triage.
[4 Mar 2010 13:52] Olav Sandstå
Running the my_atomic-t test manually gives the following output:

# N CPUs: 2, atomic ops: gcc-x86lock
1..6
ok 1 - my_atomic_initialize() returned 0
# Testing my_atomic_add32 with 30 threads, 3000 iterations...
ok 2 - tested my_atomic_add32 in 0.001289 secs (0)
# Testing my_atomic_fas32 with 30 threads, 3000 iterations...
ok 3 - tested my_atomic_fas32 in 0.001762 secs (0)
# Testing my_atomic_cas32 with 30 threads, 3000 iterations...
ok 4 - tested my_atomic_cas32 in 0.002988 secs (0)
Bail out! Signal 11 thrown
[4 Mar 2010 13:55] Olav Sandstå
Running the test in gdb gives the following call stack:

#0  0x0804d696 in my_atomic_cas64 (U_a={i = 0x807e860, u = 0x807e860}, U_cmp=
      {i = 0xbf870e78, u = 0xbf870e78}, U_set=
      {i = 1152956689784258560, u = 1152956689784258560})
    at ../../include/my_atomic.h:222
#1  0x0804d647 in my_atomic_add64 (U_a={i = 0x807e860, u = 0x807e860}, U_v=
      {i = 1152956689784258560, u = 1152956689784258560})
    at ../../include/my_atomic.h:230
#2  0x0804da86 in do_tests () at my_atomic-t.c:176
#3  0x0804d37f in main (argc=1, argv=0xbf870f74) at thr_template.c:79
[4 Mar 2010 14:05] Olav Sandstå
Note that this crash occurs when I compiled using gcc version 4.2.4 on a Ubuntu 8.04.3 LTS server.

When doing the same using gcc version 4.3.2 on a Ubuntu 8.04.02 server the test runs fine.
[5 Jul 2010 12:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/112882

3463 Davi Arnaut	2010-07-05
      Bug#22320: my_atomic-t unit test fails
      
      The atomic operations implementation on 5.1 has a few problems,
      which might cause tests to abort randomly. Since no code in 5.1
      uses atomic operations, simply remove the code.
[5 Jul 2010 13:26] Davi Arnaut
Removal queued to mysql-5.1-bugteam, null merged into mysql-trunk-merge.
[6 Jul 2010 0:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/112922

3086 Davi Arnaut	2010-07-05
      Bug#22320: my_atomic-t unit test fails
      Bug#52261: 64 bit atomic operations do not work on Solaris i386
                 gcc in debug compilation
      
      One of the various problems was that the source operand to
      CMPXCHG8b was marked as a input/output operand, causing GCC
      to use the EBX register as the destination register for the
      CMPXCHG8b instruction. This could lead to crashes as the EBX
      register is also implicitly used by the instruction, causing
      the value to be potentially garbaged and a protection fault
      once the value is used to access a position in memory.
      
      Another problem was the lack of proper clobbers for the atomic
      operations and, also, a discrepancy between the implementations
      for the Compare and Set operation. The specific problems are
      described and fixed by Kristian Nielsen patches:
      
      Patch: 1
      
      Fix bugs in my_atomic_cas*(val,cmp,new) that *cmp is accessed
      after CAS succeds.
      
      In the gcc builtin implementation, problem was that *cmp was
      read again after atomic CAS to check if old *val == *cmp;
      this fails if CAS is successful and another thread modifies
      *cmp in-between.
      
      In the x86-gcc implementation, problem was that *cmp was set
      also in the case of successful CAS; this means there is a
      window where it can clobber a value written by another thread
      after successful CAS.
      
      Patch 2:
      
      Add a GCC asm "memory" clobber to primitives that imply a
      memory barrier.
      
      This signifies to GCC that any potentially aliased memory
      must be flushed before the operation, and re-read after the
      operation, so that read or modification in other threads of
      such memory values will work as intended.
      
      In effect, it makes these primitives work as memory barriers
      for the compiler as well as the CPU. This is better and more
      correct than adding "volatile" to variables.
     @ include/atomic/gcc_builtins.h
        Do not read from *cmp after the operation as it might be
        already gone if the operation was successful.
     @ include/atomic/nolock.h
        Prefer system provided atomics over the broken x86 asm.
     @ include/atomic/x86-gcc.h
        Do not mark source operands as input/output operands.
        Add proper memory clobbers.
     @ include/my_atomic.h
        Add notes about my_atomic_add and my_atomic_cas behaviors.
     @ unittest/mysys/my_atomic-t.c
        Remove work around, if it fails, there is either a problem
        with the atomic operations code or the specific compiler
        version should be black-listed.
[8 Jul 2010 16:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/113160

3095 Davi Arnaut	2010-07-08
      Bug#22320: my_atomic-t unit test fails
      Bug#52261: 64 bit atomic operations do not work on Solaris i386
                 gcc in debug compilation
      
      One of the various problems was that the source operand to
      CMPXCHG8b was marked as a input/output operand, causing GCC
      to use the EBX register as the destination register for the
      CMPXCHG8b instruction. This could lead to crashes as the EBX
      register is also implicitly used by the instruction, causing
      the value to be potentially garbaged and a protection fault
      once the value is used to access a position in memory.
      
      Another problem was the lack of proper clobbers for the atomic
      operations and, also, a discrepancy between the implementations
      for the Compare and Set operation. The specific problems are
      described and fixed by Kristian Nielsen patches:
      
      Patch: 1
      
      Fix bugs in my_atomic_cas*(val,cmp,new) that *cmp is accessed
      after CAS succeds.
      
      In the gcc builtin implementation, problem was that *cmp was
      read again after atomic CAS to check if old *val == *cmp;
      this fails if CAS is successful and another thread modifies
      *cmp in-between.
      
      In the x86-gcc implementation, problem was that *cmp was set
      also in the case of successful CAS; this means there is a
      window where it can clobber a value written by another thread
      after successful CAS.
      
      Patch 2:
      
      Add a GCC asm "memory" clobber to primitives that imply a
      memory barrier.
      
      This signifies to GCC that any potentially aliased memory
      must be flushed before the operation, and re-read after the
      operation, so that read or modification in other threads of
      such memory values will work as intended.
      
      In effect, it makes these primitives work as memory barriers
      for the compiler as well as the CPU. This is better and more
      correct than adding "volatile" to variables.
     @ include/atomic/gcc_builtins.h
        Do not read from *cmp after the operation as it might be
        already gone if the operation was successful.
     @ include/atomic/nolock.h
        Prefer system provided atomics over the broken x86 asm.
     @ include/atomic/x86-gcc.h
        Do not mark source operands as input/output operands.
        Add proper memory clobbers.
     @ include/my_atomic.h
        Add notes about my_atomic_add and my_atomic_cas behaviors.
     @ unittest/mysys/my_atomic-t.c
        Remove work around, if it fails, there is either a problem
        with the atomic operations code or the specific compiler
        version should be black-listed.
[23 Jul 2010 12:28] Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:35] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[23 Jul 2010 21:30] Davi Arnaut
Queued to mysql-trunk-bugfixing
[4 Aug 2010 7:50] Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100731131027-1n61gseejyxsqk5d) (version source revid:marko.makela@oracle.com-20100621094008-o9fa153s3f09merw) (merge vers: 5.1.49) (pib:18)
[4 Aug 2010 8:10] Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alik@ibmvm-20100804080001-bny5271e65xo34ig) (version source revid:marko.makela@oracle.com-20100621094008-o9fa153s3f09merw) (merge vers: 5.1.49) (pib:18)
[4 Aug 2010 8:26] Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alik@ibmvm-20100804081533-c1d3rbipo9e8rt1s) (version source revid:marko.makela@oracle.com-20100621094008-o9fa153s3f09merw) (merge vers: 5.1.49) (pib:18)
[4 Aug 2010 9:05] Bugs System
Pushed into mysql-next-mr (revid:alik@ibmvm-20100804081630-ntapn8bf9pko9vj3) (version source revid:marko.makela@oracle.com-20100621094008-o9fa153s3f09merw) (pib:20)
[12 Aug 2010 19:43] Paul DuBois
Noted in 5.5.6 changelog.

Problems in the atomic operations implementation could lead to server crashes.
[19 Aug 2010 15:41] Bugs System
Pushed into mysql-5.1 5.1.51 (revid:build@mysql.com-20100819151858-muaaor6jojb5ouzj) (version source revid:build@mysql.com-20100819151858-muaaor6jojb5ouzj) (merge vers: 5.1.51) (pib:20)
[14 Oct 2010 8:37] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (version source revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[14 Oct 2010 8:52] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (version source revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (merge vers: 5.1.51-ndb-6.3.39) (pib:21)
[14 Oct 2010 9:07] Bugs System
Pushed into mysql-5.1-telco-6.2 5.1.51-ndb-6.2.19 (revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (version source revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (merge vers: 5.1.51-ndb-6.2.19) (pib:21)
[14 Oct 2010 13:09] Jon Stephens
Also noted in the 5.1.51 changelog. No additional changelog entries required. Closed.