Bug #47213 InnoDB mutex/rw_lock should be conscious about memory ordering other than Intel
Submitted: 9 Sep 2009 9:46 Modified: 18 Jul 2014 9:20
Reporter: Yasufumi Kinoshita Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.0, 5.1.36 + plugin 1.0.3 OS:Linux (PowerPC SMP)
Assigned to: Yasufumi Kinoshita CPU Architecture:ARM
Tags: Contribution

[9 Sep 2009 9:46] Yasufumi Kinoshita
Description:
The Intel processors always maintain cache consistency automatically.
All InnoDB code (both of builtin and Plugin, using atomic or not)
depends on the automatic consistency of Intel SMP.

But some other processors don't maintain cache consistency automatically.

For example, PowerPC need "Memory Synchronization Instructions" to respect
the other processors' memory operations.

So, reading unprotected (without mutex/rw_lock) shared memory needs
explicit syncronization before the reading.

At least,

mutex->lock_word
mutex->waiters
lock->lock_word
lock->waiters

must synchronize before reading.
Otherwise, the wrong reading causes hangup of the mutex or rw_lock.
And the other variable or operation (writing?) may need synchronization also.

We can synchronize by using
__sync_synchronize()
builtin function of GCC.

The following patch for InnoDB-Plugin 1.0.3

innodb_1.0.3_sync_fix_for_ppc_smp.patch

seems more stable enough than unpatched
at our customer's ppc64-smp server

If it will need another patch to stabilize, I will upload the patch here.

How to repeat:
Heavy benchmarking (using concurrent many sessions) on PowerPC SMP server.
And you will meet hangup sometimes.

Suggested fix:
the following patch
[9 Sep 2009 9:47] Yasufumi Kinoshita
Current patch example for InnoDB Plugin 1.0.3

Attachment: innodb_1.0.3_sync_fix_for_ppc_smp.patch (text/x-patch), 8.33 KiB.

[9 Sep 2009 15:32] Vasil Dimov
Let's check whether __sync_synchronize() is available separately from other GCC builtin functions and define a macro like HAVE__SYNC_SYNCRONIZE if it is available instead of using HAVE_GCC_ATOMIC_BUILTINS.

Thank you!
[9 Sep 2009 17:52] Sergei Golubchik
Where did you read that PowerPC don't maintain cache coherence automatically ?
[10 Sep 2009 0:12] Yasufumi Kinoshita
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/F7E732FF811F783187256FDD004D3797
"The Programming Environments Manual for 64-bit Microprocessors"

5.1.1.2 Synchronize Instruction
[10 Sep 2009 6:39] Sergei Golubchik
I think you read it incorrectly. Check this one "970 User Manuals":

https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/DC3D43B729FDAD2C00257419006FB955

for example, it says explicitly "The 970FX automatically maintains the coherency of all data cached <...>" all other PowerPC User Manuals say basically the same. Some go into details explaining bus snooping and the particular protocol.

Synchronization instructions are different - that's for memory coherence between CPUs. And "sync" is just a glorified memory barrier (see p.532 in the "Programming Environment") which is important for mutexes, for example, (you want to be sure that all accesses to the mutex-protected memory are executed by the CPU *after* you locked the mutex and *before* you unlocked it). But it has nothing to do with caches.

On on Intel the situation is the same - caches are coherent. Accesses to the shared memory may need barriers.

What's different - on Intel in certain cases one needs to use LOCK prefix for an instruction to be really atomic, and LOCK prefix is an implicit full memory barrier. On PPC (probably on any non-Intel), I presume, one needs to add a memory barrier explicitly.
[10 Sep 2009 10:33] Yasufumi Kinoshita
OK.

I should use another expression.

"InnoDB depends on the memory ordering of Intel CPU."

gcc's __sync_synchronize() does nothing at x86, x86_64, but "sync" at ppc64.
I also think from my experience that Intel SMP doesn't need barrier for reading
the memory which is always changed by atomic operation from another processor.
(atomic operation affect to memory ordering of the another processor? by locking memory or..?)
It is I called "maintain consistency automatically".

But PowerPC SMP seems not. The atomic operation seems to "loop until success".
The operation seems more passive operation than Intel's. So, it may not affect
to the other processor's already-fetched values.
[10 Sep 2009 11:46] Sergei Golubchik
the fact that gcc __sync_synchronize() does nothing at x86, x86_64 is a bug.
See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793

You don't need a barrier if all your atomic instructions use LOCK prefix
(because LOCK prefix *is* an implicit full memory barrier).

But memory reads and writes are often atomic without any LOCK prefix, and then you need to write memory barriers explicitly.
[10 Sep 2009 19:00] Yasufumi Kinoshita
Sergei,

I know what the LOCK prefix is...

I am writing about the interaction between processors.
I don't writing about memory barrier of atomic operation itself.
(memory barrier is not affect to the other CPU affect only )

If intel-CPU,
when CPU-1 does atomic operations using LOCK prefix,
the target memory (and cached value in all CPU) is locked or arbitrated actively.
(So, it should affect (or be affected) to CPU-2's prefetch/cache of the memory ordering (CPU-2 respects CPU-1 automatically?))
At the time, if CPU-2 have to read the memory with consistency,
the barrier before the reading is not so needed by experience.
(Recently, InnoDB on x86 x86_64 SMP seems not to hangup...)

The problem here is whether the explicit barrier before the "CPU-2" reading is needed or not. (In other words, the atomic operation of CPU1 causes implicit local-memory barrier effect to CPU2 reading? or not?)

In ppc64 case, the atomic operation seems to be barriered trial loop.
So, I think it doesn't lock or arbitrate actively, and I suspect
doesn't affect to the another CPU's prefetched value (cache may be affected by the some cache coherent mechanism). So CPU2 needed barrier explicitly to read consistently.

Though it is my analogy from the each manual's description. But it fits to our experience and can explain the difference of hangup possibility between the architectures.

How do you think about the interaction? Or did you find explicit description about it in the manuals?
[11 Sep 2009 8:11] Sergei Golubchik
Memory barrier affects only one CPU but it is important for multi-CPU interaction. For example the first CPU does:

   mutex=LOCKED;
   var1=++;
   var2--;

while another CPU does

  if (mutex != LOCKED) {
    mutex=LOCKED;
    local=var1+var2;
    ...

if there will be no memory barriers between mutex and shared_var operations on both CPUs the second CPU may end up accessing var1 and/or var2 without mutex protection and seeing inconsistent data.

LOCK indeed locks the bus (that's why it's called "LOCK" :), but I doubt that it affect local CPU caches in any way. Nor it works as a memory barrier on other CPUs.

Whether you need an explicit memory barrier on intel depends on what you use. InnoDB - as far as I can see - uses only __sync_add_and_fetch() and __sync_bool_compare_and_swap() - both need LOCK prefix to work corectly on SMP, so every atomic operation in InnoDB is accompanied with a full memory barrier.
[11 Sep 2009 11:15] Yasufumi Kinoshita
Sergei,

Have you tested in real by yourself?
I am writing based on my trials to improve InnoDB's lock implementation in these years...

<other discussion in intel.com>

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/650...

http://software.intel.com/en-us/articles/single-producer-single-consumer-queue/

- "hardware fence is implicit on x86"

- __memory_barrier() is equal to "__asm__ __volatile__ ("" : : : "memory")"
  (not MFENCE, LFENCE or SFENCE, it affects only to compiler optimizing)

<in the developer's manual>

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide
http://www.intel.com/Assets/PDF/manual/253668.pdf

8.2.2 Memory Ordering in P6 and More Recent Processor Families

In a single-processor system
...
- Reads may be reordered with older writes to different locations but not with older
  writes to the same location.
...

In a multiple-processor system
...
- Individual processors use the same ordering principles as in a single-processor
  system.

- Writes by a single processor are observed in the same order by all processors.
...

So, writing from CPU1 is not reordered with reading same place of CPU2 in the CPU2 memory ordering.

In this case, memory barrier for reading (lfence or mfence) is not needed at x86, x86_64.
[14 Sep 2009 15:34] Sergei Golubchik
No, but I wrote above that InnoDB on x86 did not need explicit __sync_synchronize() or any other memory barriers. I hope that matches results of your experiments with InnoDB :)
[19 Oct 2009 7:13] Christopher Yeoh
Hi Yasufumi, Just wondering if you have an updated patch to what is attached to this bug report (since you mentioned it wasn't complete).
[20 Oct 2009 8:50] Yasufumi Kinoshita
The new more stable patch for InnoDB Plugin 1.0.3

Attachment: innodb_1.0.3_sync_fix_for_ppc_smp_2.patch (text/x-diff), 9.13 KiB.

[20 Oct 2009 8:51] Yasufumi Kinoshita
The new more stable patch for InnoDB Plugin 1.0.4

Attachment: innodb_1.0.4_sync_fix_for_ppc_smp_2.patch (text/x-diff), 8.65 KiB.

[20 Oct 2009 8:54] Yasufumi Kinoshita
Christopher,

The new patch was added.
In the end, we have to use asm() for "isync" instruction.
[31 Dec 2009 6:46] Sveta Smirnova
Thank you for the report.

Which benchmark did you use? I used sysbench oltp read-write test and got similar results with and without patch.
[31 Dec 2009 7:27] Yasufumi Kinoshita
Unfortunately, I discontinued to make effort about it, because the our customer gave up to use the PowerPC SMP server...

The engineer of IBM said that the my patch is only for the specific version of the PowerPC and it is wrong for the other PowerPC version...

We may have to various optimized codes for the each versions of PowerPC or rewrite whole of InnoDB in slow sync code for all shared variables on memory...
[31 Dec 2009 11:23] Sveta Smirnova
Thank you for the feedback.

I'll set status of the report to "Can't repeat" because above discussion and because I was not able to see difference between patched and not patched versions. Feel free to reopen the report when you have possibility to provide more information.
[4 May 2014 5:10] MySQL Verification Team
Two more references for us:

Stewart's report:  http://lists.mysql.com/internals/38786

The one I filed internally with help from Yasufumi:
Bug 17573535 - INVESTIGATE INNODB OWN ATOMICS/CACHE/MEMORY COHERENCY ISSUE ON MIPS64
[12 May 2014 4:46] Yasufumi Kinoshita
emperimental patch for 5.6

Attachment: memory_barrier_experimental_5.6_4.diff (text/x-patch), 9.24 KiB.

[12 May 2014 4:48] Yasufumi Kinoshita
This is memory model conscious experimental patch for mysql-5.6.
It should fix mutex/rw_lock problems about memory ordering and cache coherency.

But I don't have account to use the problematic server and I cannot confirm by myself.

If you feel something problematic on your server (!x86 && !x86_64),
please test the patch and confirm the patch fixes the problem.

After confirmed that the some cases are fixed actually,
this bug will be fixed in public releases.

Thanks.
[20 May 2014 5:05] Stewart Smith
Hi Yasufumi,

I'll have a look at and test your patch. Cursory glance looks good, I'll build+test and run through some workloads on a POWER8 machine.
[27 May 2014 1:47] Stewart Smith
Hi Yasufumi,

I've hammered on this patch fairly heavily now and I haven't found a problem with it.

I have other patches that fix other problems, but this seems to be all correct from my testing (and decently faster than my initial patch of using just heavyweight memory barriers).

I'll point to other bugs with the rest of my patches shortly.
[27 May 2014 2:08] Stewart Smith
Fix test and set/release of InnoDB mutex lock_word

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: mysql-power-sync-mutex-lockword.patch (text/x-patch), 1.16 KiB.

[27 May 2014 2:13] Stewart Smith
my patch here fixes an issue in the initial patch around getting and releasing the lock_word part of the mutex as well as fixing a missing barrier around reading the lock_word.

If we don't have the memory barrier around reading the lock word, we'll spin on a stale cacheline rather than obtain the lock.

I believe this was the missing bit that led to this comment being included in innodb:
 	/* In theory __sync_lock_release should be used to release the lock.
 	Unfortunately, it does not work properly alone. The workaround is
 	that more conservative __sync_lock_test_and_set is used instead. */
[27 May 2014 2:42] Yasufumi Kinoshita
Thank you for testing.

I have a question.

> If we don't have the memory barrier around reading the lock word, we'll spin on a stale cacheline rather than obtain the lock.

My patch already has the memory barrier just before reading the lock word in the critical paths (mutex_spin_wait() and sync_arr_cell_can_wake_up()). Because I'd like to reduce the expensive memory barrier as possible as we can.

But, did you actually still need adding extra memory barrier to mutex_get_lock_word()? It means POWER8 needs memory barrier instruction "twice".
[30 May 2014 7:08] Stewart Smith
Let me have a closer look at it (on Monday) and see if I can work out exactly what's going on.

I initially wrote my patch when I used my memory barrier patch rather than yours. My patch was much cruder (quite literally written in a few minutes) and likely missed something.

I think that the test-and_set vs sync_lock-release is something that may be needed... although I forget the conversation I had about it and will have to go look at the generated instructions :)

I made a (very quick) effort on porting your patch to 5.7 too, I'll attach it here. It has proven correct enough for benchmarks, but I am not yet happy with the stability of the set of patches I have for 5.7 and have not yet ruled out that I've missed something in this initial port of your memory barrier patch.
[30 May 2014 7:11] Stewart Smith
(very quick) port of Yasufumi patch to 5.7.4 - good enough for benchmarking!

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: memory_barrier-experimental_5.7.4.diff (text/x-patch), 8.53 KiB.

[20 Jun 2014 14:21] Daniel Price
Fixed as of the upcoming 5.6.20, 5.7.5 release, and here's the changelog entry:

Due to differences in memory ordering on non-Intel CPUs, some mutex and
read-write lock flags were not read. 

Thank you for the bug report.
[10 Jul 2014 17:29] Daniel Price
Fixed as of the upcoming 5.6.20, 5.7.5 release, and here's the changelog
entry:

Due to differences in memory ordering on different processor types, some
mutex and read-write lock flags were not read consistently.
[7 Aug 2014 6:43] Laurynas Biveinis
$ bzr log -n0 -r 6004
------------------------------------------------------------
revno: 6004
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Thu 2014-06-19 07:33:57 +0200
message:
  Bug#11755438: Bug#47213: INNODB MUTEX/RW_LOCK SHOULD BE CONSCIOUS ABOUT MEMORY ORDERING OTHER THAN INTEL
  
  Because of difference about memory ordering, some critical flags of mutex/rw_lock might be missed to read on non-Intel CPUs.
  Even for Intel-CPUs, the explicit memory barrier instruction might cause positive effects for performance.
  
  Approved by Kevin Lewis in rb#5561
[7 Aug 2014 6:49] Laurynas Biveinis
$ bzr log -n0 -r 6016
------------------------------------------------------------
revno: 6016
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2014-06-24 14:22:27 +0200
message:
  Bug #19048563 : PERFORMANCE REGRESSION IN UPDATE OPS CAUSED BY TRUNK REVNO 8242 (BUG#11755438)
  
  Some of the added memory barrier (internal of spin loops of mutex/rw_lock) was too expensive. Removed.
  This is partial reverting of the fix for Bug#11755438.
  
  ------------------------------------------------------------
  revno: 6004
  committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
  branch nick: mysql-5.6
  timestamp: Thu 2014-06-19 07:33:57 +0200
  message:
    Bug#11755438: Bug#47213: INNODB MUTEX/RW_LOCK SHOULD BE CONSCIOUS ABOUT MEMORY ORDERING OTHER THAN INTEL
  
    Because of difference about memory ordering, some critical flags of mutex/rw_lock might be missed to read on non-Intel CPUs.
    Even for Intel-CPUs, the explicit memory barrier instruction might cause positive effects for performance.
  
    Approved by Kevin Lewis in rb#5561
  ------------------------------------------------------------
[7 Aug 2014 6:49] Laurynas Biveinis
$ bzr log -n0 -r 6017
------------------------------------------------------------
revno: 6017
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2014-06-24 23:57:39 +0200
message:
  fix build fail for win32.
  follow up for Bug#11755438: Bug#47213: INNODB MUTEX/RW_LOCK SHOULD BE CONSCIOUS ABOUT MEMOR
[7 Aug 2014 6:50] Laurynas Biveinis
$ bzr log -n0 -r 6018
------------------------------------------------------------
revno: 6018
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Wed 2014-06-25 05:36:54 +0200
message:
  Disable memory barrier only for Intel CPU, because performance regression was observed at some conditions for Intel CPU.
  follow up for Bug#11755438: Bug#47213: INNODB MUTEX/RW_LOCK SHOULD BE CONSCIOUS ABOUT MEMORY ORDERING OTHER THAN INTEL