Bug #87933 Scalibility issue on Arm platform with the current UT_RELAX_CPU () code.
Submitted: 2 Oct 2017 4:06 Modified: 12 Dec 2018 20:59
Reporter: Sandeep Sethia (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.7.19 OS:CentOS
Assigned to: CPU Architecture:ARM
Tags: aarch64, arm, barrier, innodb

[2 Oct 2017 4:06] Sandeep Sethia
Description:
Hi All,

I was evaluating UT_RELAX_CPU() code on non x86 platform(Arm and PowerPC) and it seems the current compiler barrier for Arm platform is not sufficient enough to give the require delay before InnoDB attempts to again acquire a mutex. Optimizations include (on x86) calling the pause instruction inside the spin loop and (on POWER) setting the thread priority to low for the duration of ut_delay. However for ARM platform i see  we are using current compiler barrier ie define UT_RELAX_CPU() __asm__ __volatile__ ("":::"memory") . 

Using sysbench for write operations if we go beyond 32 threads i see scalibility issue in TPS. If we revert back to the explicit hardware barrier as  os_atomic_compare_exchange_n(a, cmp, set, 0,__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ,I see a good jump in performance for threads from 32-64-128 and the scalibility is restored.

Also in ib0mutex.h if we replace
- ut_delay(ut_rnd_interval(0, max_delay));
+ ut_delay(max_delay); i see good jump in x86 and Arm architecure for threads 128 and 256 and beyond for all write operations using sysbench. On Arm i see good improvement from 32 threads onwards only.

In Mariadb , I see the compiler barrier patch was rejected and replaced with hardware barrier.

https://jira.mariadb.org/browse/MDEV-8684

In a nutshell: the patch causing this scalability regression was rejected by
MariaDB, but apparently it was accepted by MySQL.

Please let me know your thoughts. I am testing the same scenario on more Arm platform and will revert back with the findings.

How to repeat:
Run the current MYSQL version on Arm platform with latest sysbench lua scripts.
Using any write operations for tests. 

Suggested fix:
Replace the compiler barrier with  a hardware barrier. I can provide the patch files.
[2 Oct 2017 4:19] Sandeep Sethia
Added tags
[5 Oct 2017 8:26] Sandeep Sethia
Any update on the issue?
[7 Nov 2017 3:42] Daniel Black
UT_RELAX_CPU just needs to be a CPU instruction(s) that takes a bit of time to exectute, it doesn't needs to be a hardware barrier. Obviously a compile barrier takes no amount of time and ends up being and empty loop.

On Power the following is suggested as this is a moderately slow instruction that has no impacts on other CPU core elements. (note - please include if changing - OCA submitted by me)

#elif defined(__powerpc__) && defined __GLIBC__
# include <sys/platform/ppc.h>
# define UT_RELAX_CPU() __ppc_get_timebase()

For ARM I suggest looking for a similar slow instruction. The reverting to os_atomic operations have effects on other threads and cores. Alexey Kopytov was evaluating however I didn't see an end result (https://github.com/MariaDB/server/pull/168#r58172391 ).

A compile barrier is needed always needed (Windows - _ReadWriteBarrier() ) as without it the loop could be removed in ut_delay if/when the compiler realises that the UT_RELAX_CPU implementation isn't required to be executed delay*50 times and the return value can a static number. In fact the entire j += i and return j (never used) can be removed when a compile barrier is there (and a non 0 UT_RELAX_CPU implementation) as this is exceptionally quick compared to UT_RELAX_CPU (and it ties up ALU CPU aspects if shared with concurrent threads).

On replacing ut_delay(ut_rnd_interval(0, max_delay)); with just max_delay, this is probably because a larger than default value of innodb_spin_wait_delay suites this test better. Or alternately (perhaps less likely) ut_rnd_interval and the functions it calls are for some reason is exceptionally slow - check with a perf record -g/report -g --no-children. This could be the case if there is any synchronisation around ut_rnd_ulint_counter (because its not needed).

When you mention x86 are you talking 32 or 64 bit?

I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed. Because "The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++." is clearly not relevant any more.
[12 Dec 2018 20:59] Bogdan Kecman
I'm having issues reproducing problem on my hardware but reviewing the code I see that 8.0.13 is still with

  for (i = 0; i < delay * 50; i++) {
    j += i;
    UT_RELAX_CPU();
  }

  UT_RESUME_PRIORITY_CPU();

and while bug #74832 states that  5.7.11, 5.8.0 have this solved by "A compiler barrier was added to ut_relax_cpu()" I don't see it in 8.0.13.

Bogdan Kecman