Bug #74832 ut_delay missing compiler barrier, UT_RELAX_CPU isn't at all relaxing
Submitted: 13 Nov 2014 5:54 Modified: 5 Jan 2016 20:45
Reporter: Stewart Smith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7.7 OS:Any
Assigned to: CPU Architecture:Any
Tags: barrier, compiler, innodb, PowerPC

[13 Nov 2014 5:54] Stewart Smith
Description:
The code here has seemingly changed in 5.7.5 compared to previous versions. I'm talking about 5.7.5-m15 here.

The purpose of ut_delay() is to spin for a while before InnoDB attempts to again acquire a mutex. Optimizations include (on x86) calling the pause instruction inside the spin loop and (on POWER) setting the thread priority to low for the duration of ut_delay.

Here is the current (MySQL 5.7.5) implementation of ut_delay:
ulint
ut_delay(
/*=====*/
        ulint   delay)  /*!< in: delay in microseconds on 100 MHz Pentium */
{
        ulint   i, j;

        UT_LOW_PRIORITY_CPU();

        j = 0;

        for (i = 0; i < delay * 50; i++) {
                j += i;
                UT_RELAX_CPU();
        }

        if (ut_always_false) {
                ut_always_false = (ibool) j;
        }

        UT_RESUME_PRIORITY_CPU();

        return(j);
}

There are a couple of problems with this code:
1) ut_always_false could quite legitimately be compiled away by the compiler
2) j is actually unneeded and if UT_RELAX_CPU() was not implemented, then the compiler could legitimately completely optimize away the loop

But there's another problem that's a bit hidden.... 

In ut0ut.h we have the following:
#ifndef UNIV_HOTBACKUP
# if defined(HAVE_PAUSE_INSTRUCTION)
   /* According to the gcc info page, asm volatile means that the                
   instruction has important side-effects and must not be removed.               
   Also asm volatile may trigger a memory barrier (spilling all registers        
   to memory). */
#  ifdef __SUNPRO_CC
#   define UT_RELAX_CPU() asm ("pause" )
#  else
#   define UT_RELAX_CPU() __asm__ __volatile__ ("pause")
#  endif /* __SUNPRO_CC */

# elif defined(HAVE_FAKE_PAUSE_INSTRUCTION)
#  define UT_RELAX_CPU() __asm__ __volatile__ ("rep; nop")
# elif defined(HAVE_ATOMIC_BUILTINS)
#  define UT_RELAX_CPU() do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)
# elif defined(HAVE_WINDOWS_ATOMICS)
   /* In the Win32 API, the x86 PAUSE instruction is executed by calling         
   the YieldProcessor macro defined in WinNT.h. It is a CPU architecture-        
   independent way by using YieldProcessor. */
#  define UT_RELAX_CPU() YieldProcessor()
# else
#  define UT_RELAX_CPU() ((void)0) /* avoid warning for an empty statement */
# endif

Which if HAVE_PAUSE_INSTRUCTION or HAVE_FAKE_PAUSE_INSTRUCTION are defined (i.e. recent x86), you'll get the desired effect, there will be a pause instruction.

However, if you HAVE_ATOMIC_BUILTINS, then you get this:
do { \
     volatile lint      volatile_var; \
     os_compare_and_swap_lint(&volatile_var, 0, 1); \
   } while (0)

Which is anything *but* relaxing. So, on POWER, where we have atomics but not pause instruction, we get that instead of an empty statement.

This likely affects other platforms too (e.g. SPARC, MIPS, ARM, ia64, mips, m68k... basically everything that isn't x86).

What we really want here is instead of that, just a compiler barrier, so that it knows that it cannot optimize away the loop.

Back to ut_delay, if we look at the original PowerPC assembler for this, it's rather larger than it needs to be:

0000000000000380 <._Z8ut_delaym>:
     380:       fb e1 ff f8     std     r31,-8(r1)
     384:       f8 21 ff b1     stdu    r1,-80(r1)
     388:       7c 3f 0b 78     mr      r31,r1
     38c:       7c 21 0b 78     mr      r1,r1
     390:       1d 03 00 32     mulli   r8,r3,50
     394:       38 60 00 00     li      r3,0
     398:       2f a8 00 00     cmpdi   cr7,r8,0
     39c:       41 9e 00 44     beq     cr7,3e0 <._Z8ut_delaym+0x60>
     3a0:       39 20 00 00     li      r9,0
     3a4:       38 e0 00 01     li      r7,1
     3a8:       60 00 00 00     nop
     3ac:       60 00 00 00     nop
     3b0:       7c 00 04 ac     sync    
     3b4:       7c 63 4a 14     add     r3,r3,r9
     3b8:       38 df 00 30     addi    r6,r31,48
     3bc:       7d 40 30 a8     ldarx   r10,0,r6
     3c0:       2c 2a 00 00     cmpdi   r10,0
     3c4:       40 82 00 0c     bne     3d0 <._Z8ut_delaym+0x50>
     3c8:       7c e0 31 ad     stdcx.  r7,0,r6
     3cc:       40 a2 ff ec     bne     3b8 <._Z8ut_delaym+0x38>
     3d0:       4c 00 01 2c     isync
     3d4:       39 29 00 01     addi    r9,r9,1
     3d8:       7f a9 40 40     cmpld   cr7,r9,r8
     3dc:       40 9e ff d4     bne     cr7,3b0 <._Z8ut_delaym+0x30>
     3e0:       3c c2 00 00     addis   r6,r2,0
     3e4:       e9 26 00 00     ld      r9,0(r6)
     3e8:       2f a9 00 00     cmpdi   cr7,r9,0
     3ec:       41 9e 00 08     beq     cr7,3f4 <._Z8ut_delaym+0x74>
     3f0:       f8 66 00 00     std     r3,0(r6)
     3f4:       7c 42 13 78     mr      r2,r2
     3f8:       38 3f 00 50     addi    r1,r31,80
     3fc:       eb e1 ff f8     ld      r31,-8(r1)
     400:       4e 80 00 20     blr

The bits that stare at me are the sync and isync instructions. We're executing memory barriers in there! In a loop! When we're meant to be relaxing!

So, once I remove the buggy UT_RELAX_CPU() implementation and simplify ut_delay (patch attached), I end up with:

0000000000000380 <._Z8ut_delaym>:
     380:       fb e1 ff f8     std     r31,-8(r1)
     384:       f8 21 ff c1     stdu    r1,-64(r1)
     388:       7c 3f 0b 78     mr      r31,r1
     38c:       7c 21 0b 78     mr      r1,r1
     390:       1c 63 00 32     mulli   r3,r3,50
     394:       7c 69 03 a6     mtctr   r3
     398:       2f a3 00 00     cmpdi   cr7,r3,0
     39c:       41 9e 00 08     beq     cr7,3a4 <._Z8ut_delaym+0x24>
     3a0:       42 00 00 00     bdnz    3a0 <._Z8ut_delaym+0x20>
     3a4:       7c 42 13 78     mr      r2,r2
     3a8:       38 3f 00 40     addi    r1,r31,64
     3ac:       eb e1 ff f8     ld      r31,-8(r1)
     3b0:       4e 80 00 20     blr
     3b4:       00 00 00 00     .long 0x0
     3b8:       00 09 00 00     .long 0x90000
     3bc:       80 01 00 00     lwz     r0,0(r1)

Which is *exactly* what we should be doing - we go into low priority (mr r1,r1), spin for a while, then resume normal priority (mr r2, r2) and return. We also avoid doing unnecessary work (which is good).

This also may have a positive performance impact on x86 as the extra math and work around there would have to be done, and IIRC modern KVM on x86 will trap the pause instruction and attempt to schedule a vcpu that may hold the lock that we're spinning for.

How to repeat:
look at profiles, or disassemble code and examine it (like I've done above)

Suggested fix:
merge my patch (attached) that fixes this.
[13 Nov 2014 5:55] Stewart Smith
fix UT_RELAX_CPU and ut_delay for POWER and others

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: innodb-gcc-barrier.patch (text/x-patch), 2.38 KiB.

[17 Nov 2014 17:00] MySQL Verification Team
Hello Stewart,

Thank you for the report and contribution.

Thanks,
Umesh
[10 Apr 2015 8:03] Stewart Smith
I still see this showing up on profiles. On some workloads, it's rather significant - several percent of CPU time (and with my number of CPUs, that's several CPUs worth).

By instead using my patch, you do a bunch less needless work.
[27 Aug 2015 7:06] Stewart Smith
We just hit this *again* when profiling real world applications.

Any action on this patch?
[27 Aug 2015 21:39] Stewart Smith
Mirrored to https://mariadb.atlassian.net/browse/MDEV-8684 as we're hitting it in MariaDB too.
[27 Aug 2015 21:56] Stewart Smith
script to run concurrent queries

Attachment: psdoit (application/octet-stream, text), 614 bytes.

[27 Aug 2015 21:56] Stewart Smith
Test SQL

Attachment: sql (application/octet-stream, text), 313.90 KiB.

[27 Aug 2015 21:58] Stewart Smith
Attached are SQL and small script to reproduce this. This data and query was being run on MariaDB 10.0.17 and while I haven't tried *this* test on MySQL, I have seen it previously on MySQL so it's a problem in MySQL too.

If you look at the perf output, you can see that we basically do a tonne of time "relaxing", which isn't at all relaxing to the chip.

(Pasting in from Anton, who has been finding this most recently)

# mysql -u root < sql

 # time ./psdoit 1
 0.022s

 # time ./psdoit 10
 0.112s

 # time ./psdoit 100
 7.150s

 # time ./psdoit 200
 30.422s

 # time ./psdoit 400
 190.378s

 Things get very bad, very fast. We spend almost all our time in a locking
 food fight:

     37.42%  mysqld         mysqld              [.] ut_delay(unsigned long)                                                                                     
		    |
		    ---ut_delay(unsigned long)
		       |          
		       |--99.74%-- mutex_spin_wait(void*, bool, char const*,
 unsigned long)
		       |          |          
		       |          |--74.70%--
 pfs_mutex_enter_func(ib_mutex_t*, char const*, unsigned long) [clone
 .constprop.71]
		       |          |         
 lock_rec_convert_impl_to_expl(buf_block_t const*, unsigned char const*,
 dict_index_t*, unsigned long const*)
		       |          |         
 lock_clust_rec_read_check_and_lock(unsigned long, buf_block_t const*,
 unsigned char const*, dict_index_t*, unsigned long const
		       |          |          sel_set_rec_lock(buf_block_t
 const*, unsigned char const*, dict_index_t*, unsigned long const*, unsigned
 long, unsigned long, 
		       |          |          row_search_for_mysql(unsigned
 char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
		       |          |          |          
		       |          |          |--76.07%--
 ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int,
 ha_rkey_function)
		       |          |          |         
 handler::index_read_map(unsigned char*, unsigned char const*, unsigned long,
 ha_rkey_function)
		       |          |          |         
 handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned
 long, ha_rkey_function)
		       |          |          |          join_read_key2(THD*,
 st_join_table*, TABLE*, st_table_ref*)
		       |          |          |          sub_select(JOIN*,
 st_join_table*, bool)
		       |          |          |

 and:

     27.96%  mysqld         [kernel.kallsyms]   [k] _raw_spin_lock                                                                                   
		    |
		    ---_raw_spin_lock
		       |          
		       |--58.11%-- futex_wait_setup
		       |          |          
		       |          |--99.96%-- 0
		       |          |          futex_wait
		       |          |          do_futex
		       |          |          sys_futex
		       |          |          system_call
		       |          |          |          
		       |          |          |--64.91%-- __lll_lock_wait
		       |          |          |          |          
		       |          |          |          |--83.30%--
 pthread_mutex_lock
		       |          |          |          |          |          
		       |          |          |          |          |--52.72%--
 os_event_reset(os_event*)
		       |          |          |          |          |         
 sync_array_reserve_cell(sync_array_t*, void*, unsigned long, char const*,
 unsigned long, unsi
		       |          |          |          |          |         
 mutex_spin_wait(void*, bool, char const*, unsigned long)
		       |          |          |          |          |         
 |          
		       |          |          |          |          |         
 |--75.08%-- pfs_mutex_enter_func(ib_mutex_t*, char const*, unsigned long)
 [clone .constprop.7
		       |          |          |          |          |         
 |          lock_rec_convert_impl_to_expl(buf_block_t const*, unsigned char
 const*, dict_index
		       |          |          |          |          |         
 |          lock_clust_rec_read_check_and_lock(unsigned long, buf_block_t
 const*, unsigned cha
		       |          |          |          |          |         
 |          sel_set_rec_lock(buf_block_t const*, unsigned char const*,
 dict_index_t*, unsigned
		       |          |          |          |          |         
 |          row_search_for_mysql(unsigned char*, unsigned long,
 row_prebuilt_t*, unsigned long
[31 Aug 2015 4:28] Stewart Smith
One comment from MariaDB devs was that the patch could reduce some performance. I delved a bit deeper and:

Using the microbenchmarks from https://github.com/stewart-ibm/microbenchmarks/commit/9210e849374f14e1ffd652ed869637f3d70b...

Here I use the timebase register on ppc to work out how long in wall time things take (timebase is 512Mhz and is "good enough" for this purpose).

[stewart@p82 ~]$ gcc -O3 ut_delay_optim.c -o ut_delay_optim
[stewart@p82 ~]$ gcc -O3 ut_delay_mysql.c -o ut_delay_mysql
[stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_mysql ; done
tb change (avg over 1000000): 3209
[stewart@p82 ~]$ ./ut_delay_optim 
tb change (avg over 1000000): 225

so with the optimized ut_delay, we're actually delaying for less than a tenth of the time.

So, for tuning that to be approximately the same amount of wall time, we try changing the multiplication for the loop in ut_delay from *50 to *1024 (which should just optim down to a bit shift).

With this (ut_delay_optim2.c) I compare the run of the MySQL one to the optim2 one (see https://github.com/stewart-ibm/microbenchmarks/commit/fa26a1cb71b5799ae846e49b951bb54d55a6... )

[stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_optim2 ; done
tb change (avg over 1000000): 2771
tb change (avg over 1000000): 2811
tb change (avg over 1000000): 2640
tb change (avg over 1000000): 2571
tb change (avg over 1000000): 2789
tb change (avg over 1000000): 2776
tb change (avg over 1000000): 2859
tb change (avg over 1000000): 2739
tb change (avg over 1000000): 2665
tb change (avg over 1000000): 2799

[stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_mysql ; done
tb change (avg over 1000000): 2882
tb change (avg over 1000000): 2880
tb change (avg over 1000000): 2882
tb change (avg over 1000000): 2878
tb change (avg over 1000000): 2880
tb change (avg over 1000000): 2878
tb change (avg over 1000000): 3117
tb change (avg over 1000000): 3071
tb change (avg over 1000000): 2880
tb change (avg over 1000000): 3209

So it's probably worth to try that implementation to see if there's still negative performance impact. It should burn the same amount of (wall) CPU time without placing additional stress on CPU parts.
[4 Jan 2016 6:43] Marko Mäkelä
Posted by developer:
 
The pushed patch differs from the contributed patch. It does the following:

1. The fall-back implementation of UT_RELAX_CPU was changed (affecting non-Windows and non-x86 platforms):
-#  define UT_RELAX_CPU() do { \
-     volatile lint     volatile_var; \
-     os_compare_and_swap_lint(&volatile_var, 0, 1); \
-   } while (0)
+#  define UT_RELAX_CPU() __asm__ __volatile__ ("":::"memory")

2. The run-time constant ut_always_false was removed from ut_delay().

No adjustment was made on the ut_delay() iteration loop.
[5 Jan 2016 20:44] Daniel Price
Posted by developer:
 
commit a053720c905437d533bebd5b89c54f6670dfc21c
Author: Marko Mäkelä <marko.makela@oracle.com>
Date:   Tue Dec 29 17:19:41 2015 +0200

    Bug#20045167 UT_DELAY MISSING COMPILER BARRIER
    
    UT_RELAX_CPU(): Use a compiler barrier.
    
    ut_delay(): Remove the dummy global variable ut_always_false.
[5 Jan 2016 20:45] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 5.7.11, 5.8.0 release, and here's the changelog entry:

A compiler barrier was added to ut_relax_cpu(). The ut_always_false dummy
global variable was removed from ut_delay(). 

Thank you for the bug report.
[26 Sep 2017 11:01] Sandeep Sethia
For Arm platform I see a huge degradation in performance if we are using current compiler barrier ie define UT_RELAX_CPU() __asm__ __volatile__ ("":::"memory") . Performance drop significantly from 32 threads onwards.

if we revert back to the explicit hardware barrier as  os_atomic_compare_exchange_n(a, cmp, set, 0,__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ,I see a good jump in performance for threads from 32-64-128 and the scalibility is restored.

Also in ib0mutex.h if we replace
- ut_delay(ut_rnd_interval(0, max_delay));
+ ut_delay(max_delay); i see good jump in x86 and Arm architecure for threads 128 and 256 . On Arm i see good improvement from 32 threads onwards only.

Please let me know your thoughts ..
[27 Sep 2017 15:56] Sandeep Sethia
Then above scenario for ARM platform is tested on Latest MYSQL version 5.7.19 and new lua scripts using sysbench using update_index and update_nonindex tests.
[1 Oct 2017 13:59] Sandeep Sethia
There seems to be some issue with the setup I tested. It is not reproducible on some other Arm machine. Hence closing the thread.