Bug #72805 mutex_delay() creating excess memory traffic, GCC mem barrier needed
Submitted: 30 May 2014 6:17 Modified: 29 Apr 2015 0:14
Reporter: Stewart Smith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Compiling Severity:S3 (Non-critical)
Version:5.7.7 OS:Any
Assigned to: CPU Architecture:Any
Tags: MySQL, performance, PowerPC

[30 May 2014 6:17] Stewart Smith
Description:
the mutex_delay() function in mysys/thr_mutex.c uses a volatile variable to force GCC not to optimize away the loop, which is used when spinning on a mutex.

However, this use of volatile needlessly creates memory traffic which I've seen show up on a profile.

The better thing to do is use a GCC memory barrier:
asm volatile("":::"memory");

patch attached.

This likely also affects Intel but I've done my analysis on POWER8.

How to repeat:
Code analysis and profiling.

Suggested fix:
see attached patch:

--- mysql-5.7.4-m14.orig/mysys/thr_mutex.c
+++ mysql-5.7.4-m14/mysys/thr_mutex.c
@@ -333,12 +333,15 @@ void safe_mutex_end(FILE *file __attribu
 ulong mutex_delay(ulong delayloops)
 {
   ulong        i;
-  volatile ulong j;
+  ulong j;
 
   j = 0;
 
   for (i = 0; i < delayloops * 50; i++)
+  {
     j += i;
+    asm volatile("":::"memory");
+  }
 
   return(j); 
 }
[30 May 2014 6:20] Stewart Smith
use GCC memory barrier in mutex_delay instead of generating memory traffic.

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: mutex_delay_gcc_barrier.patch (text/x-patch), 471 bytes.

[30 May 2014 8:04] MySQL Verification Team
Hello Stewart,

Thank you for the report and contribution.

Thanks,
Umesh
[13 Nov 2014 6:17] Stewart Smith
This is still present in 5.7.5 - and is also an issue on x86.

When the server is built with fast mutexes (cmake -DWITH_FAST_MUTEXES=1), we get mutex_delay in mysys/thr_mutex.cc.

As it stands, it looks like this:
0000000000000000 <.mutex_delay>:
   0:   fb e1 ff f8     std     r31,-8(r1)
   4:   1c 63 00 32     mulli   r3,r3,50
   8:   f8 21 ff b1     stdu    r1,-80(r1)
   c:   39 20 00 00     li      r9,0
  10:   7c 69 03 a6     mtctr   r3
  14:   2f a3 00 00     cmpdi   cr7,r3,0
  18:   7c 3f 0b 78     mr      r31,r1
  1c:   f9 3f 00 30     std     r9,48(r31)
  20:   41 9e 00 34     beq     cr7,54 <.mutex_delay+0x54>
  24:   48 00 00 1c     b       40 <.mutex_delay+0x40>
  28:   60 00 00 00     nop
  2c:   60 00 00 00     nop
  30:   60 00 00 00     nop
  34:   60 00 00 00     nop
  38:   60 00 00 00     nop
  3c:   60 00 00 00     nop
  40:   e9 5f 00 30     ld      r10,48(r31)
  44:   7d 49 52 14     add     r10,r9,r10
  48:   39 29 00 01     addi    r9,r9,1
  4c:   f9 5f 00 30     std     r10,48(r31)
  50:   42 00 ff f0     bdnz    40 <.mutex_delay+0x40>
  54:   38 3f 00 50     addi    r1,r31,80
  58:   e8 7f 00 30     ld      r3,48(r31)
  5c:   eb e1 ff f8     ld      r31,-8(r1)
  60:   4e 80 00 20     blr

Which (just trust me on this): isn't what you want. As mentioned in this bug report (and in http://bugs.mysql.com/bug.php?id=72806 ) it misses relaxing the CPU and is just using a work around rather than an actual solution to avoiding the compiler from optimizing away the loop.

My updated patch (attached) works well for POWER. Adding the pause instruction for x86 is left as an exercise for the reader.

with my patch applied:

0000000000000000 <.mutex_delay>:
   0:   fb e1 ff f8     std     r31,-8(r1)
   4:   f8 21 ff c1     stdu    r1,-64(r1)
   8:   7c 3f 0b 78     mr      r31,r1
   c:   7c 21 0b 78     mr      r1,r1
  10:   1c 63 00 32     mulli   r3,r3,50
  14:   7c 69 03 a6     mtctr   r3
  18:   2f a3 00 00     cmpdi   cr7,r3,0
  1c:   41 9e 00 08     beq     cr7,24 <.mutex_delay+0x24>
  20:   42 00 00 00     bdnz    20 <.mutex_delay+0x20>
  24:   7c 42 13 78     mr      r2,r2
  28:   38 3f 00 40     addi    r1,r31,64
  2c:   eb e1 ff f8     ld      r31,-8(r1)
  30:   4e 80 00 20     blr

Which is pretty much *exactly* what we want - it sets low priority on the thread, spins for a while, restores priority and exits.
[13 Nov 2014 6:18] Stewart Smith
Updated patch: sets thread priority on POWER, uses proper compiler barrier

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: mutex_delay_gcc_barrier.patch (text/x-patch), 969 bytes.

[10 Apr 2015 7:54] Stewart Smith
I still see this in 5.7.7.
[29 Apr 2015 0:14] Paul DuBois
Noted in 5.7.8 changelog.

The so-called "fast mutex" code has been removed from the server
sources. It provides no measurable benefit, complicates the code, and
is problematic for certain architectures such as POWER8. The
(undocumented) WITH_FAST_MUTEXES CMake option has also been removed.