Bug #56760 my_atomics failures on osx10.5-x86-64bit
Submitted: 13 Sep 2010 21:56 Modified: 10 Jan 2011 3:52
Reporter: Marc ALFF Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: General Severity:S2 (Serious)
Version:5.5.6-m3-release OS:MacOS (osx10.5-x86-64bit)
Assigned to: Davi Arnaut CPU Architecture:Any

[13 Sep 2010 21:56] Marc ALFF
Description:
This bug report is a follow up from bug#56521

While the fix for bug#56521 addressed one of the possible root cause,
the failure still can be observed after applying the fix.

Using this bug number to track the remaining issues.

At this point, after some analysis, the root cause is probably in the my_atomic implementation for this particular platform.

How to repeat:
See the internal build farm logs

Suggested fix:
N/A
[13 Sep 2010 22:19] Vladislav Vaintroub
there is a gcc-atomic implementation for the versions of GCC that do not have atomic builtins on x86 (and x64?). It used to be used on Solaris/GCC3.4 in the past, and it used to have problems (e.g Bug#52261).
[20 Sep 2010 21:38] Marc ALFF
Status update:

See related bug#52419.

However, even with the fix for bug#52419, the following platform still fail in the same assert related to atomics in mysql-5.5.6-m3-release:
- osx10.5-x86-64bit
- rhel4-x86-64bit

Note that the following platforms are ok:
- osx10.6-x86-64bit
- rhel5-x86-64bit

At this point, having the *exact* version of the compiler used on the working and failing x86-64bit platforms is needed to make any progress investigating this bug.
Also, please document the output of the unit tests for my_atomic, on the working and failing platforms.

An example for a platform that works: linux, x86-64, 4 core:

malff@linux-su11:mysql-5.5> gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]

malff@linux-su11:mysql-5.5> ./unittest/mysys/my_atomic-t
# N CPUs: 4, atomic ops: gcc-builtins-smp
1..6
ok 1 - my_atomic_initialize() returned 0
# Testing my_atomic_add32 with 30 threads, 3000 iterations...
ok 2 - tested my_atomic_add32 in 0.0030544 secs (0)
# Testing my_atomic_fas32 with 30 threads, 3000 iterations...
ok 3 - tested my_atomic_fas32 in 0.0024723 secs (0)
# Testing my_atomic_cas32 with 30 threads, 3000 iterations...
ok 4 - tested my_atomic_cas32 in 0.0110504 secs (0)
ok 5 - add64
# Testing my_atomic_add64 with 30 threads, 3000 iterations...
ok 6 - tested my_atomic_add64 in 0.0032054 secs (0)
[8 Oct 2010 15:55] Marc ALFF
Mark Leith also reported performance schema failures on a Mac OS X, 64 bits, 10.5

<Leith> Cerberus:trunk mark$ gcc --version
<Leith> i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5493)
[19 Oct 2010 9:35] Marc ALFF
See related:
Bug#57524 A few P_S tests crashed the server on OSX 10.5
[20 Oct 2010 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[5 Nov 2010 19:14] Davi Arnaut
I took a look at the generate assembly and it looks fine. Furthermore, the crashes are always during server shutdown or past the end of the test case, with stack traces that look very similar (exactly the same in some cases) to other bug reports about crashes in PS during server shutdown. Perhaps this is a duplicate of Bug#56666 and related bugs.
[16 Nov 2010 7:26] Marc ALFF
Analysis
========

Because of bug#56666, two threads can call mysql_mutex_destroy on the same mutex.

This will cause:
- a valid PFS_LOCK_ALLOCATED to PFS_LOCK_FREE transition,
in PFS_lock::allocated_to_free
- a broken PFS_LOCK_FREE to PFS_LOCK_FREE transition,
in PFS_lock::allocated_to_free

The assert will fail, and rightly so, because a free record was freed again.
So, the assert is correct, and this bug technical root cause is due to bug#56666.

However, I don't think we should close bug#56760 as a duplicate of bug#56666.
The consequences of this failure are far more severe that bug#56666, and are affecting mysql-test-run in the test suite in general, as well as debug builds shipped as part of debug packages.

The code needs to implement a work around now, before a fix for 56666 can be found.

The work around is to have a strict assert during normal server operations (when bug#56666 is not affecting the server), to enforce data integrity, and to relax the assert during shutdown, when bug#56666 is known to happen.
[16 Nov 2010 8:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/123994

3122 Marc Alff	2010-11-16
      Bug#56760 PFS_lock::allocated_to_free() assert failures on osx10.5-x86-64bit
      
      Before this fix, an assert could fail in PFS_lock::allocated_to_free(), during shutdown.
      The assert itself is valid, and detects an anomaly caused by bug 56666.
      
      While bug 56666 has no real consequences in production,
      the failure caused by this new assert in the code is negatively
      impacting the test suite with automated tests.
      
      This fix is a work around only, that relaxes the integrity checks 
      during the server shutdown.
[19 Nov 2010 19:17] Christopher Powers
Patch approved.
[21 Nov 2010 13:52] Marc ALFF
Pushed into:
- mysql-5.5-bugteam
- mysql-trunk-bugfixing
[26 Nov 2010 9:14] Marc ALFF
Even with the work around applied, the modified assert:
  DBUG_ASSERT(m_state == 2) || ready_to_exit)
still fails on the same platforms.

When it fails, it is now *during* a query and not at shutdown,
so the m_state internal PFS_lock member is definitively corrupted.

Note that, for the platforms (always "old", always x86-64) where this fails,
the units tests for my_atomic are also failing in the test suite in PB2 ...

The conclusion at this point is that this bug really is a bug in the my_atomic code, confirmed independently with the my_atomic-t unit test.

Example of unit test failure:

Running tests: .
debug/unittest/mysys/base64-t...........ok
debug/unittest/mysys/bitmap-t...........ok
debug/unittest/mysys/lf-t...............ok
debug/unittest/mysys/my_atomic-t........ok
debug/unittest/mysys/my_malloc-t........ok
debug/unittest/mysys/my_rdtsc-t.........ok
debug/unittest/mysys/my_vsnprintf-t.....ok
release/unittest/mysys/base64-t.........ok
release/unittest/mysys/bitmap-t.........ok
release/unittest/mysys/lf-t.............ok
release/unittest/mysys/my_atomic-t......dubious
	Test returned status 1 (wstat 256, 0x100)
DIED. FAILED test 4
	Failed 1/6 tests, 83.33% okay
release/unittest/mysys/my_malloc-t......ok
release/unittest/mysys/my_rdtsc-t.......ok
release/unittest/mysys/my_vsnprintf-t...ok
Failed Test                       Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
release/unittest/mysys/my_atomic-    1   256     6    1  16.67%  4
Failed 1/14 test scripts, 92.86% okay. 1/6212 subtests failed, 99.98% okay.

Re-opening this bug, and changing category.
[30 Nov 2010 14:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/125509

3142 Davi Arnaut	2010-11-30
      Bug#56760: my_atomics failures on osx10.5-x86-64bit
      
      The problem was due to a misuse of GCC asm constraints used to
      implement a atomic load. On x86_64, the load was implemented
      as a cmpxchg which implicitly uses the eax registers as a
      source and destination operand, yet the dummy value used for
      comparison wasn't being properly loaded into eax (and other
      problems).
      
      The core problem is that cmpxchg is unnecessary as a load
      on x86_64 as there are other simpler instructions such
      as xadd. Even though, such instructions are only used to
      have a memory barrier as load and stores are atomic by
      definition. Hence, the solution is to explicitly issue the
      required CPU and compiler barriers.
     @ include/atomic/x86-gcc.h
        Issue a synchronizing instruction before loading the value.
        Afterwards, issue a compiler barrier to prever reordering.
[30 Nov 2010 14:21] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/125511

3142 Davi Arnaut	2010-11-30
      Bug#56760: my_atomics failures on osx10.5-x86-64bit
      
      The problem was due to a misuse of GCC asm constraints used to
      implement a atomic load. On x86_64, the load was implemented
      as a cmpxchg which implicitly uses the eax register as a
      source and destination operand, yet the dummy value used for
      comparison wasn't being properly loaded into eax (and other
      problems).
      
      The core problem is that cmpxchg is unnecessary as a load
      on x86_64 as there are other simpler instructions such
      as xadd. Even though, such instructions are only used to
      have a memory barrier as load and stores are atomic by
      definition. Hence, the solution is to explicitly issue the
      required CPU and compiler barriers.
     @ include/atomic/x86-gcc.h
        Issue a synchronizing instruction before loading the value.
        Afterwards, issue a compiler barrier to prevent reordering.
[30 Nov 2010 14:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/125512

3142 Davi Arnaut	2010-11-30
      Bug#56760: my_atomics failures on osx10.5-x86-64bit
      
      The problem was due to a misuse of GCC asm constraints used to
      implement a atomic load. On x86_64, the load was implemented
      as a cmpxchg which implicitly uses the eax register as a
      source and destination operand, yet the dummy value used for
      comparison wasn't being properly loaded into eax (and other
      problems).
      
      The core problem is that cmpxchg is unnecessary as a load
      on x86_64 as there are other simpler instructions such
      as xadd. Even though, such instructions are only used to
      have a memory barrier as load and stores are atomic by
      definition. Hence, the solution is to explicitly issue the
      required CPU and compiler barriers.
     @ include/atomic/x86-gcc.h
        Issue a synchronizing instruction before loading the value.
        Afterwards, issue a compiler barrier to prevent reordering.
[30 Nov 2010 22:23] Daniel Fischer
I've ticked my box but there is a caveat; unaligned loads across two cache lines will no longer be atomic. We don't need this but it has to be said.
[30 Nov 2010 22:26] Davi Arnaut
Indeed. All use of the atomic functions should be upon properly aligned variables.
[30 Nov 2010 23:23] Davi Arnaut
Queued to mysql-5.5-bugteam and up.
[5 Dec 2010 12:39] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:alexander.nozdrin@oracle.com-20101205122447-6x94l4fmslpbttxj) (version source revid:alexander.nozdrin@oracle.com-20101205122447-6x94l4fmslpbttxj) (merge vers: 5.6.1) (pib:23)
[15 Dec 2010 0:44] Paul DuBois
Bug does not appear in any released 5.6.x version.

Setting report to Need Merge pending push to 5.5.x.
[16 Dec 2010 22:28] Bugs System
Pushed into mysql-5.5 5.5.9 (revid:jonathan.perkin@oracle.com-20101216101358-fyzr1epq95a3yett) (version source revid:jonathan.perkin@oracle.com-20101216101358-fyzr1epq95a3yett) (merge vers: 5.5.9) (pib:24)
[17 Dec 2010 12:51] Bugs System
Pushed into mysql-5.5 5.5.9 (revid:georgi.kodinov@oracle.com-20101217124733-p1ivu6higouawv8l) (version source revid:davi.arnaut@oracle.com-20101130231949-qlfy9rzpx0idrcrt) (merge vers: 5.5.8) (pib:24)
[10 Jan 2011 3:52] Paul DuBois
Noted in 5.5.9 changelog. Same changelog entry as for Bug#56521.