Bug #100664 Contribution: Aarch64 support
Submitted: 27 Aug 2020 19:42 Modified: 30 Sep 2020 18:10
Reporter: OCA Admin (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Compiling Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: CPU Architecture:ARM

[27 Aug 2020 19:42] OCA Admin
Description:
This bug tracks a contribution by Tzachi Zidenberg (Github user: tsahee) as described in http://github.com/mysql/mysql-server/pull/305

How to repeat:
See description

Suggested fix:
See contribution code attached
[27 Aug 2020 19:42] OCA Admin
Contribution submitted via Github - Aarch64 support 
(*) Contribution by Tzachi Zidenberg (Github tsahee, mysql-server/pull/305#issuecomment-680690709): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_473426209.txt (text/plain), 2.13 KiB.

[27 Aug 2020 20:21] MySQL Verification Team
Thank you for the contribution.
[28 Aug 2020 11:54] Jakub Lopuszanski
Hello Tzachi Zidenberg, 
thank you for your contribution!

Could you please share some reasoning/experiments which lead you to these solutions?

In particular, I'm interested why use `ibs` instead of, say, `yield`?

Also, it seems that "With GCC 10.1+, out-of-line atomics are enabled by default" (source: https://en.opensuse.org/ARM_architecture_support), so I wonder if it is worth introducing this compiler-and-platform-specific logic (given it is not needed in modern ARM and modern GCC) - what gains motivated your patch?
[30 Aug 2020 10:59] Tzachi Zidenberg
Thank you for your questions!

outline atomics:
gcc 8 & 9 are used quite often for aarch64, and both now support outline-atomics without setting it as default (gcc-8.5, gcc-9.4). The difference in performance is very high - we've measured almost 3x performance improvement for sysbench read_only, on m6g.12xlarge.
It is of course possible for a user to add compilation flags even without this patch, but I think making it default would drive a much better user experience.

isb:
The pause instruction used in x86 has two roles: one is to hint at the OS that it might be ready to be swapped out, and the other is to create a small delay. That delay is useful as backoff from attempting to capture spin-locks, which improves the behavior of the system and allows more efficient lock acquisition.
a "yield" instruction in aarch64 is essentially a nop, and does not cause enough delay to help backoff. "isb" is a barrier that, especially inside a loop, creates a small delay without consuming ALU resources.
In our experiments, we have found adding the isb instruction improves stability and reduces result jitter, in alignment with our expectation. We have also tried adding more delay to the UT_RELAX_CPU than a single isb, and found it reduces performance.
A combined solution, defining UT_RELAX_CPU as "isb\n yield", might be interesting. In simple unit-test it does not seem to be any different from isb-only, but it could be considered more complete and might prove beneficial on some (future?) systems.
[3 Sep 2020 9:36] Jakub Lopuszanski
Hello Tsahi Zidenberg, 

thanks for answering my questions, it gives me a better understanding of the background.

Would it be possible for you to run on your machine a this simple diagnostic program, just to asses the impact of LSE on various atomic operations on your machine?

#include<atomic>
std::atomic<int> x{0};
int main(){
  for(int i=0;i<100000000;++i){
#ifdef EXCHANGE
     x.exchange(true);
#endif
#ifdef LOAD
     x.load();
#endif
#ifdef STORE
     x.store(true);
#endif
#ifdef FENCE
     std::atomic_thread_fence(std::memory_order_acquire);
#endif
#ifdef BAD_CAS
     int e{1};
     x.compare_exchange_strong(e,false, std::memory_order_acq_rel);
#endif
#ifdef OK_CAS
     int e{0};
     x.compare_exchange_strong(e,false, std::memory_order_acq_rel);
#endif
#ifdef FETCH_OR
     x.fetch_or(0);
#endif
#ifdef FETCH_ADD
     x.fetch_add(1);
#endif
  }
  return 0;
}

I'm on gcc 9.1.1 right now, so I don't have access to `-moutline-atomics`, but I believe that in the best case it should behave as fast as `-march=armv8-a+lse` which forces usage of LSE even in gcc 9.1.1.
When I run this test like this on our arm machine:

for op in EXCHANGE LOAD STORE FENCE BAD_CAS OK_CAS FETCH_OR FETCH_ADD;
do
  echo $op
  for march in "" "-march=armv8-a" "-march=armv8-a+lse" "-march=native";
  do
    printf "%20s " $march
    g++ -o bin/one -O2 -std=c++14 -D$op $march justonething.cpp
    (time bin/one)2>&1 | xargs -l3 echo
  done
done

I get:

EXCHANGE
                     real 0m1.870s user 0m1.870s sys 0m0.000s
      -march=armv8-a real 0m1.871s user 0m1.869s sys 0m0.001s
  -march=armv8-a+lse real 0m1.460s user 0m1.460s sys 0m0.000s
       -march=native real 0m1.460s user 0m1.459s sys 0m0.001s
LOAD
                     real 0m0.047s user 0m0.046s sys 0m0.001s
      -march=armv8-a real 0m0.047s user 0m0.047s sys 0m0.000s
  -march=armv8-a+lse real 0m0.047s user 0m0.047s sys 0m0.000s
       -march=native real 0m0.047s user 0m0.046s sys 0m0.001s
STORE
                     real 0m0.867s user 0m0.866s sys 0m0.001s
      -march=armv8-a real 0m0.867s user 0m0.867s sys 0m0.000s
  -march=armv8-a+lse real 0m0.867s user 0m0.866s sys 0m0.001s
       -march=native real 0m0.867s user 0m0.866s sys 0m0.001s
FENCE
                     real 0m1.369s user 0m1.368s sys 0m0.000s
      -march=armv8-a real 0m1.369s user 0m1.367s sys 0m0.002s
  -march=armv8-a+lse real 0m1.369s user 0m1.368s sys 0m0.001s
       -march=native real 0m1.369s user 0m1.368s sys 0m0.001s
BAD_CAS
                     real 0m0.964s user 0m0.962s sys 0m0.001s
      -march=armv8-a real 0m0.959s user 0m0.958s sys 0m0.001s
  -march=armv8-a+lse real 0m2.280s user 0m2.278s sys 0m0.002s
       -march=native real 0m2.280s user 0m2.279s sys 0m0.001s
OK_CAS
                     real 0m1.870s user 0m1.870s sys 0m0.000s
      -march=armv8-a real 0m1.870s user 0m1.869s sys 0m0.001s
  -march=armv8-a+lse real 0m2.280s user 0m2.279s sys 0m0.001s
       -march=native real 0m2.281s user 0m2.279s sys 0m0.001s
FETCH_OR
                     real 0m1.870s user 0m1.869s sys 0m0.001s
      -march=armv8-a real 0m1.870s user 0m1.869s sys 0m0.001s
  -march=armv8-a+lse real 0m1.460s user 0m1.460s sys 0m0.000s
       -march=native real 0m1.460s user 0m1.459s sys 0m0.001s
FETCH_ADD
                     real 0m1.870s user 0m1.869s sys 0m0.001s
      -march=armv8-a real 0m1.871s user 0m1.870s sys 0m0.000s
  -march=armv8-a+lse real 0m1.460s user 0m1.460s sys 0m0.000s
       -march=native real 0m1.460s user 0m1.459s sys 0m0.001s

Few things I notice:
1. we can focus on `real` column, as it almost equals `user` and `sys` is almost zero
2. "" (the default) and "-march=armv8-a" are equal, which I interpret as LSE not being used by default (makes sense and I can confirm it looking into assembly)
3. "-march=native" and "-march=armv8-a+lse" are equal, which I interpted as LSE being available on this platform (makes sense and I can confirm it looking into assembly)
4. EXCHANGE, FETCH_OR and FETCH_ADD all behave similarly and run **faster** with LSE
5. BAD_CAS and OK_CAS both appear to be **slower** with LSE

This results suggest that enabling LSE poses a trade-off: 
it can make it worse for programs using atomic<T>::compare_exchange_strong(..) often, 
while it certainly can help apps using a lot of atomic<T>::fetch_add(..), or atomic<T>::exchange(..).
The reason this trade-off is relevant to InnoDB is because InnoDB's rw_lock_t implementation uses both of these operations (see storage/innobase/include/sync0rw.ic).

Therefore, I wonder if you see similar behaviour (LSE causing faster fetch_add and slower compare_exchange_strong) on your machine?
[4 Sep 2020 14:10] Jakub Lopuszanski
Hello Tzachi Zidenberg!

You wrote

> gcc 8 & 9 are used quite often for aarch64, and both now support outline-atomics without setting it as default (gcc-8.5, gcc-9.4). 

According to https://gcc.gnu.org/releases.html 8.5 and 9.4 were not released yet, so how can I get access to them?
Also, I see https://en.opensuse.org/ARM_architecture_support mentions gcc 9.3.1+ has `-moutline-atomics` support, but I can't find official documentation for gcc 9.3.1. Was 9.3.1 officially released?
The latest 9.3.x for which I can see official documentation is 9.3.0 https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/AArch64-Options.html#AArch64-Options which doesn't mention `-moutline-atomics`.

Could you please help clear my confusion?
[9 Sep 2020 8:34] Tzachi Zidenberg
I made a few modifications in the program and script you gave me.

First - I moved the atomic call to a different function, in a different file:

-------
#include<atomic>
int global;

void one_action(std::atomic<int> &x){
	global = 0;
#ifdef EXCHANGE
     x.exchange(true);
#endif
#ifdef
 .... all the rest
#endif
}
----

Setting the global to 0 adds a simple store instruction to each loop. It is required to improve comparison due to the way the arm core behaves. In some synchronization points, all stores must be "done" before the core may proceed, which means a loop of synchronization events without any stores between them is not representative. The outline atomic code adds a store to the stack for register backup during function call. The difference between 2-stores and one-store, as we'll see, is not huge.

The other thing I did was make this benchmark multithreaded. load-exclusive/store-exclusive instructions are bad for performance because they have to compete with other cores.

I didn't especially synchronize the threads, but they do run long enough on an otherwise-idle system to ensure high co-synchronization.

-----
#include <pthread.h>

std::atomic<int> x{0};

void one_action(std::atomic<int> &x);

void *thread_main(void *){

  for(int i=0;i<100000000;++i){
          one_action(x);
  }
  return 0;
}

int main() {
        int i;
        pthread_t threads[THREADS];

        for (i=0; i< THREADS; i++) {
                pthread_create(&threads[i], NULL, thread_main, NULL);
        }
        for (i=0; i< THREADS; i++) {
                pthread_join(threads[i], NULL);
        }
        return 0;
}
-------

The last change made for readability was narrowing down to 3 cases. Any flag that shows support for lse insructions should provide the same code for atomic operations, if it's -march=armv8.1+lse, -march=native (on a new core), -march=armv8.2 etc. I have verified this, but kept these results away.

-------

results for 1 Thread:

EXCHANGE
      -march=armv8-a real 0m1.088s user 0m1.082s sys 0m0.000s
  -march=armv8-a+lse real 0m0.847s user 0m0.842s sys 0m0.000s
   -moutline-atomics real 0m0.927s user 0m0.922s sys 0m0.000s
LOAD
      -march=armv8-a real 0m0.126s user 0m0.121s sys 0m0.000s
  -march=armv8-a+lse real 0m0.127s user 0m0.121s sys 0m0.000s
   -moutline-atomics real 0m0.127s user 0m0.117s sys 0m0.004s
STORE
      -march=armv8-a real 0m0.128s user 0m0.121s sys 0m0.000s
  -march=armv8-a+lse real 0m0.129s user 0m0.121s sys 0m0.000s
   -moutline-atomics real 0m0.127s user 0m0.121s sys 0m0.000s
FENCE
      -march=armv8-a real 0m0.327s user 0m0.321s sys 0m0.000s
  -march=armv8-a+lse real 0m0.327s user 0m0.317s sys 0m0.004s
   -moutline-atomics real 0m0.327s user 0m0.321s sys 0m0.000s
BAD_CAS
      -march=armv8-a real 0m0.246s user 0m0.241s sys 0m0.000s
  -march=armv8-a+lse real 0m0.928s user 0m0.922s sys 0m0.000s
   -moutline-atomics real 0m0.967s user 0m0.962s sys 0m0.000s
OK_CAS
      -march=armv8-a real 0m1.087s user 0m1.082s sys 0m0.000s
  -march=armv8-a+lse real 0m0.928s user 0m0.922s sys 0m0.000s
   -moutline-atomics real 0m0.968s user 0m0.962s sys 0m0.000s
FETCH_OR
      -march=armv8-a real 0m1.089s user 0m1.081s sys 0m0.000s
  -march=armv8-a+lse real 0m0.848s user 0m0.842s sys 0m0.000s
   -moutline-atomics real 0m0.928s user 0m0.922s sys 0m0.000s
FETCH_ADD
      -march=armv8-a real 0m1.085s user 0m1.079s sys 0m0.000s
  -march=armv8-a+lse real 0m0.850s user 0m0.842s sys 0m0.000s
   -moutline-atomics real 0m0.928s user 0m0.922s sys 0m0.000s

---
Result for 16 threads:

EXCHANGE
      -march=armv8-a real 2m46.350s user 38m23.281s sys 0m0.004s
  -march=armv8-a+lse real 0m31.490s user 7m54.585s sys 0m0.004s
   -moutline-atomics real 0m48.393s user 10m54.689s sys 0m0.000s
LOAD
      -march=armv8-a real 0m2.233s user 0m34.798s sys 0m0.000s
  -march=armv8-a+lse real 0m2.196s user 0m34.165s sys 0m0.004s
   -moutline-atomics real 0m2.202s user 0m33.677s sys 0m0.004s
STORE
      -march=armv8-a real 0m2.162s user 0m33.079s sys 0m0.000s
  -march=armv8-a+lse real 0m2.150s user 0m32.821s sys 0m0.000s
   -moutline-atomics real 0m2.150s user 0m33.192s sys 0m0.000s
FENCE
      -march=armv8-a real 0m0.535s user 0m8.073s sys 0m0.000s
  -march=armv8-a+lse real 0m0.515s user 0m8.099s sys 0m0.004s
   -moutline-atomics real 0m0.523s user 0m8.225s sys 0m0.000s
BAD_CAS
      -march=armv8-a real 0m4.167s user 1m3.777s sys 0m0.000s
  -march=armv8-a+lse real 0m27.708s user 6m55.848s sys 0m0.000s
   -moutline-atomics real 0m39.459s user 9m37.715s sys 0m0.008s
OK_CAS
      -march=armv8-a real 3m9.888s user 42m23.732s sys 0m0.000s
  -march=armv8-a+lse real 0m33.154s user 8m17.106s sys 0m0.004s
   -moutline-atomics real 0m54.817s user 11m45.582s sys 0m0.004s
FETCH_OR
      -march=armv8-a real 2m29.523s user 33m51.771s sys 0m0.004s
  -march=armv8-a+lse real 0m33.673s user 8m0.450s sys 0m0.000s
   -moutline-atomics real 0m39.512s user 10m2.263s sys 0m0.004s
FETCH_ADD
      -march=armv8-a real 3m34.322s user 50m9.499s sys 0m0.000s
  -march=armv8-a+lse real 0m33.639s user 7m58.357s sys 0m0.000s
   -moutline-atomics real 0m45.689s user 10m42.517s sys 0m0.004s
-----

These results show that exclusive access provides same-or-better performance for single-threaded, with a very large gap when increasing the actual contention on the same value.
The one exception to this rule is BAD_CAS. In this test, an attempt to compare-and-swap is unsuccessful, as the compare fails. With old instructions, lrdx/strx - no real exclusive access is actually generated. ldrx sets an internal monitor, but an strx is never attempted. With LSE instructions, the exclusive compare-and-swap instruction is attempted, just as it would have been on a successful store. That's why with LSE the BAD_CAS and OK_CAS results are very close to each other, and without lse they are so far apart. I'd argue the operation of LSE is correct, swaps should only be attempted if they should succeed.
[10 Sep 2020 6:49] Tzachi Zidenberg
As for availability of outline-atomics flag, currently the best way is probably to use either Ubuntu 20.04 or Amazon Linux 2. I'd assume current gcc-9 and 8 brnahces are also stable enough even if not released, though I didn't test them.

The moutline-atomics flag showed substantial enough improvements
that it has been backported to GCC 9, 8 and there is a gcc-7 branch in
the works.
Ubuntu has integrated this in 20.04, Amazon Linux 2 supports it,
with other distributions including Ubuntu 18.04 and Debian on the way.
all distributions, including the upcoming Ubuntu with GCC-10, have
moutline-atomics turned off by default.

Thank you!
tsahi
[11 Sep 2020 8:41] Jakub Lopuszanski
Hello Tzachi Zidenberg,
Thank you for conducting the tests, expanding them and sharing results.

I interpret the results for THREADS=1 as showing similar pattern to the one in my testing. 
Results for THREADS=16 are indeed interesting - I'll try them on our machine, too.

As I've wrote earlier, my only concern is that BAD_CAS is not necessarily as rare case as you suggest. If you take a look at our rw_lock_t implementation (in sync0rw.h, sync0rw.ic, sync0rw.cc) you'll see that compare_exchange is heavily used in `rw_lock_lock_word_decr(lock,amount,threshold)` to acquire a lock (which requires a thread to "decrease `lock->lock_word` by `amount` but only if it was larger than `threshold`" which is implemented using a CAS loop). In situations where many threads try to acquire the same rw_lock_t instance in parallel, then they may invalidate each others attempts in several ways:
- the most obvious is when one of them takes an eXclusive lock, because then others must wait, spinning. This situation doesn't "count" as "BAD_CAS", because we don't even attempt CAS when lock_word is below the `threshold`, but...
- even if all of the threads are trying to get Shared lock (i.e. nobody is trying to get eXclusive lock), each time one of them succeeds means that the value got decremented, and thus other threads which were doing the CAS loop will fail (once) as the value will be no longer equal to the expected value (or in case of ARM64 without LSE: because modification of the address was observed by the monitor). This is the situation I'm worried about.

You wrote about conducting "sysbench read_only" tests and observing big gains.
I wonder what is the situation for sysbench oltp rw with pareto distribution, which should cause larger congestion in several places, and thus perhaps "BAD_CAS" will happen more often.
I plan to conduct such tests. 
Please, let me know if you already have some data about impact of LSE on TPS during sysbench with pareto distribution.
[11 Sep 2020 15:17] Jakub Lopuszanski
Results for THREADS=1 are quite similar to the ones I've presented earlier (LOAD is slower, probably due to global=0 ?)

For THREADS= 16 I get 

EXCHANGE
      -march=armv8-a real 7m53.587s user 119m32.916s sys 0m0.005s
  -march=armv8-a+lse real 1m25.404s user 19m13.627s sys 0m0.006s
LOAD
      -march=armv8-a real 0m0.185s user 0m2.920s sys 0m0.002s
  -march=armv8-a+lse real 0m0.192s user 0m2.927s sys 0m0.003s
STORE
      -march=armv8-a real 0m45.474s user 10m51.401s sys 0m0.009s
  -march=armv8-a+lse real 0m43.414s user 10m21.016s sys 0m0.003s
FENCE
      -march=armv8-a real 0m1.371s user 0m21.883s sys 0m0.005s
  -march=armv8-a+lse real 0m1.378s user 0m21.893s sys 0m0.001s
BAD_CAS
      -march=armv8-a real 0m1.006s user 0m16.052s sys 0m0.001s
  -march=armv8-a+lse real 1m37.292s user 21m35.103s sys 0m0.010s
OK_CAS
      -march=armv8-a real 10m15.640s user 157m13.736s sys 0m0.055s
  -march=armv8-a+lse real 1m31.202s user 20m23.168s sys 0m0.004s
FETCH_OR
      -march=armv8-a real 10m18.043s user 157m1.701s sys 0m0.006s
  -march=armv8-a+lse real 1m36.678s user 22m1.005s sys 0m0.002s
FETCH_ADD
      -march=armv8-a real 8m49.733s user 131m10.761s sys 0m0.004s
  -march=armv8-a+lse real 1m35.588s user 21m57.004s sys 0m0.001s

Above is a result of one example run. For the EXCHANGE, OK_CAS, FETCH_OR, FETCH_ADD with -march=armv8-a "real" time varies from 7m50s to 10m20s from run to run. For -march=armv8-a+lse it seems to vary in 1m20s to 1m40s range.
I'd say the overall pattern matches previous observations of Tsahi Zidenberg.

I've added one more test, intended to give a taste of rw_lock_lock_word_decr:

#ifdef RW_LOCK_LIKE
    int local_lock_word = x;
    while (local_lock_word > -2e9) {
      if (x.compare_exchange_strong(local_lock_word,
                                    local_lock_word - 1)) {
        return;
      }
    }
#endif

Example output:

THREADS=1
RW_LOCK_LIKE
      -march=armv8-a real 0m1.870s user 0m1.869s sys 0m0.001s
  -march=armv8-a+lse real 0m2.418s user 0m2.417s sys 0m0.000s
THREADS=2
RW_LOCK_LIKE
      -march=armv8-a real 0m10.862s user 0m20.117s sys 0m0.001s
  -march=armv8-a+lse real 0m8.534s user 0m15.046s sys 0m0.002s
THREADS=4
RW_LOCK_LIKE
      -march=armv8-a real 1m30.890s user 5m36.524s sys 0m0.000s
  -march=armv8-a+lse real 0m36.601s user 1m56.028s sys 0m0.002s
THREADS=8
RW_LOCK_LIKE
      -march=armv8-a real 2m32.342s user 18m31.389s sys 0m0.000s
  -march=armv8-a+lse real 3m29.766s user 25m36.240s sys 0m0.002s
THREADS=16
RW_LOCK_LIKE
      -march=armv8-a real 8m44.916s user 129m33.593s sys 0m0.004s
  -march=armv8-a+lse real 9m54.497s user 134m52.253s sys 0m0.003s

Looks a bit like for larger number of threads the LSE does more harm then helps.
In what follows, to make testing faster I changed number of iterations in thread_main() from 100M to 10M. 
As expected "real" times are roughly 10 times smaller.

To know how reliable are above numbers, I've run THREADS=16 RW_LOCK_LIKE case several times:

THREADS=16
RW_LOCK_LIKE
      -march=armv8-a real 0m51.083s user 12m40.401s sys 0m0.006s
  -march=armv8-a+lse real 1m19.217s user 17m50.806s sys 0m0.005s
      -march=armv8-a real 0m56.568s user 13m42.542s sys 0m0.002s
  -march=armv8-a+lse real 1m23.774s user 18m40.078s sys 0m0.003s
      -march=armv8-a real 1m6.718s user 17m6.422s sys 0m0.001s
  -march=armv8-a+lse real 1m18.046s user 17m51.447s sys 0m0.002s
      -march=armv8-a real 1m1.837s user 15m38.503s sys 0m0.002s
  -march=armv8-a+lse real 1m11.727s user 16m55.597s sys 0m0.004s
      -march=armv8-a real 1m2.058s user 15m35.530s sys 0m0.005s
  -march=armv8-a+lse real 1m27.969s user 19m59.876s sys 0m0.002s
      -march=armv8-a real 0m51.430s user 12m55.406s sys 0m0.002s
  -march=armv8-a+lse real 1m24.653s user 19m56.115s sys 0m0.008s
      -march=armv8-a real 0m58.490s user 14m31.811s sys 0m0.003s
  -march=armv8-a+lse real 1m22.409s user 20m42.513s sys 0m0.004s

So, armv8-a is from 51s to 66s,
armv8-a+lse is from 76s to 87s.

Second, I wanted to know how it looks like as I increase number of threads from 1 to 100.
(results below are noisy, as I run for each number of THREADS just once)
           armv8-a  armv8-a+lse
 THREADS=1   0.189   0.244
 THREADS=2   0.983   1.066
 THREADS=3   6.179   2.376
 THREADS=4   5.504   3.628
 THREADS=5   8.422  10.733
 THREADS=6   9.998   7.725
 THREADS=7  13.672  14.253
 THREADS=8  14.977  29.624
 THREADS=9  21.003  37.819
THREADS=10  26.572  33.748
THREADS=11  38.293  41.528
THREADS=12  37.337  56.518
THREADS=13  47.236  57.059
THREADS=14  49.195  70.594
THREADS=15  50.669  77.075
THREADS=16  58.962  75.729
THREADS=17  61.518 100.591
THREADS=18  82.074 100.862
THREADS=19  79.871 136.003
THREADS=20  98.232 122.259
THREADS=21 102.281 150.949
THREADS=22 117.569 148.648
THREADS=23 117.375 162.083
THREADS=24 161.680 165.841
THREADS=25 123.846 179.350
THREADS=26 185.146 217.032
THREADS=27 174.986 158.382
THREADS=28 193.491 266.707
THREADS=29 176.774 354.268
THREADS=30 193.289 360.768
...
I'll let it continue to run over the weekend, but so far it looks like LSE is slower most of the time.
I'm not sure how representative this artificial micro-benchmark is of real rw_lock_t performance,
but at least these results shake my confidence in the patch.
[14 Sep 2020 7:20] Jakub Lopuszanski
how many seconds it takes THREADS to do 10M "rwlock S-locks" with(out) LSE

Attachment: rwlocklike_duration_by_THREADS.png (image/png, text), 20.99 KiB.

[14 Sep 2020 12:02] Jakub Lopuszanski
Terje upgraded gcc from 9.1.1 to 9.3.1 which apparently has `-moutline-atomics` support, thanks!

Here are results for 1,16 and 64 threads on the same machine as before:

THREADS=1
EXCHANGE
      -march=armv8-a real 0m0.189s user 0m0.188s sys 0m0.001s
  -march=armv8-a+lse real 0m0.148s user 0m0.148s sys 0m0.000s
   -moutline-atomics real 0m0.198s user 0m0.196s sys 0m0.002s
LOAD
      -march=armv8-a real 0m0.020s user 0m0.019s sys 0m0.001s
  -march=armv8-a+lse real 0m0.020s user 0m0.019s sys 0m0.001s
   -moutline-atomics real 0m0.020s user 0m0.019s sys 0m0.001s
STORE
      -march=armv8-a real 0m0.088s user 0m0.087s sys 0m0.002s
  -march=armv8-a+lse real 0m0.089s user 0m0.087s sys 0m0.002s
   -moutline-atomics real 0m0.089s user 0m0.088s sys 0m0.000s
FENCE
      -march=armv8-a real 0m0.139s user 0m0.138s sys 0m0.001s
  -march=armv8-a+lse real 0m0.139s user 0m0.137s sys 0m0.002s
   -moutline-atomics real 0m0.139s user 0m0.137s sys 0m0.001s
BAD_CAS
      -march=armv8-a real 0m0.107s user 0m0.107s sys 0m0.000s
  -march=armv8-a+lse real 0m0.230s user 0m0.227s sys 0m0.003s
   -moutline-atomics real 0m0.230s user 0m0.229s sys 0m0.001s
OK_CAS
      -march=armv8-a real 0m0.189s user 0m0.188s sys 0m0.001s
  -march=armv8-a+lse real 0m0.230s user 0m0.227s sys 0m0.003s
   -moutline-atomics real 0m0.230s user 0m0.228s sys 0m0.002s
FETCH_OR
      -march=armv8-a real 0m0.189s user 0m0.189s sys 0m0.000s
  -march=armv8-a+lse real 0m0.148s user 0m0.147s sys 0m0.001s
   -moutline-atomics real 0m0.198s user 0m0.197s sys 0m0.001s
FETCH_ADD
      -march=armv8-a real 0m0.189s user 0m0.187s sys 0m0.002s
  -march=armv8-a+lse real 0m0.148s user 0m0.147s sys 0m0.001s
   -moutline-atomics real 0m0.198s user 0m0.197s sys 0m0.001s
RW_LOCK_LIKE
      -march=armv8-a real 0m0.189s user 0m0.188s sys 0m0.001s
  -march=armv8-a+lse real 0m0.244s user 0m0.243s sys 0m0.001s
   -moutline-atomics real 0m0.239s user 0m0.238s sys 0m0.001s
THREADS=16
EXCHANGE
      -march=armv8-a real 0m54.678s user 13m50.194s sys 0m0.006s
  -march=armv8-a+lse real 0m8.498s user 2m0.155s sys 0m0.011s
   -moutline-atomics real 0m16.318s user 4m10.134s sys 0m0.011s
LOAD
      -march=armv8-a real 0m0.022s user 0m0.302s sys 0m0.006s
  -march=armv8-a+lse real 0m0.022s user 0m0.297s sys 0m0.011s
   -moutline-atomics real 0m0.022s user 0m0.298s sys 0m0.010s
STORE
      -march=armv8-a real 0m4.589s user 1m4.530s sys 0m0.134s
  -march=armv8-a+lse real 0m4.124s user 0m59.029s sys 0m0.142s
   -moutline-atomics real 0m4.479s user 1m6.278s sys 0m0.142s
FENCE
      -march=armv8-a real 0m0.141s user 0m2.194s sys 0m0.011s
  -march=armv8-a+lse real 0m0.141s user 0m2.195s sys 0m0.010s
   -moutline-atomics real 0m0.141s user 0m2.194s sys 0m0.013s
BAD_CAS
      -march=armv8-a real 0m0.105s user 0m1.613s sys 0m0.008s
  -march=armv8-a+lse real 0m8.636s user 2m6.878s sys 0m0.011s
   -moutline-atomics real 0m16.267s user 3m57.898s sys 0m0.010s
OK_CAS
      -march=armv8-a real 0m58.947s user 14m54.196s sys 0m0.010s
  -march=armv8-a+lse real 0m8.677s user 2m1.059s sys 0m0.014s
   -moutline-atomics real 0m15.909s user 3m52.825s sys 0m0.012s
FETCH_OR
      -march=armv8-a real 0m56.816s user 14m28.119s sys 0m0.012s
  -march=armv8-a+lse real 0m9.280s user 2m18.993s sys 0m0.009s
   -moutline-atomics real 0m15.463s user 3m46.585s sys 0m0.016s
FETCH_ADD
      -march=armv8-a real 0m51.979s user 13m12.699s sys 0m0.005s
  -march=armv8-a+lse real 0m9.348s user 2m22.660s sys 0m0.011s
   -moutline-atomics real 0m16.150s user 3m56.887s sys 0m0.008s
RW_LOCK_LIKE
      -march=armv8-a real 1m1.216s user 15m29.153s sys 0m0.006s
  -march=armv8-a+lse real 1m31.247s user 21m16.878s sys 0m0.008s
   -moutline-atomics real 2m21.564s user 35m55.447s sys 0m0.008s
THREADS=64
EXCHANGE
      -march=armv8-a real 10m45.949s user 590m29.337s sys 0m0.445s
  -march=armv8-a+lse real 0m33.898s user 32m12.009s sys 0m0.256s
   -moutline-atomics real 1m3.809s user 62m8.990s sys 0m0.262s
LOAD
      -march=armv8-a real 0m0.036s user 0m1.319s sys 0m0.030s
  -march=armv8-a+lse real 0m0.036s user 0m1.329s sys 0m0.034s
   -moutline-atomics real 0m0.035s user 0m1.320s sys 0m0.031s
STORE
      -march=armv8-a real 0m17.372s user 16m15.012s sys 0m1.777s
  -march=armv8-a+lse real 0m17.163s user 16m36.034s sys 0m2.469s
   -moutline-atomics real 0m17.378s user 15m53.601s sys 0m1.705s
FENCE
      -march=armv8-a real 0m0.147s user 0m8.856s sys 0m0.085s
  -march=armv8-a+lse real 0m0.147s user 0m8.854s sys 0m0.087s
   -moutline-atomics real 0m0.147s user 0m8.850s sys 0m0.086s
BAD_CAS
      -march=armv8-a real 0m0.111s user 0m6.559s sys 0m0.037s
  -march=armv8-a+lse real 0m34.347s user 32m32.051s sys 0m0.249s
   -moutline-atomics real 1m3.787s user 62m9.418s sys 0m0.291s
OK_CAS
      -march=armv8-a real 10m47.816s user 601m14.334s sys 0m0.342s
  -march=armv8-a+lse real 0m34.119s user 32m26.329s sys 0m0.198s
   -moutline-atomics real 1m4.361s user 62m35.566s sys 0m0.274s
FETCH_OR
      -march=armv8-a real 10m33.839s user 570m16.317s sys 0m0.299s
  -march=armv8-a+lse real 0m33.963s user 31m52.708s sys 0m0.224s
   -moutline-atomics real 1m3.891s user 62m34.792s sys 0m0.279s
FETCH_ADD
      -march=armv8-a real 10m38.169s user 581m12.259s sys 0m0.359s
  -march=armv8-a+lse real 0m33.667s user 31m51.731s sys 0m0.275s
   -moutline-atomics real 1m3.638s user 62m5.553s sys 0m0.258s
RW_LOCK_LIKE
      -march=armv8-a real 11m0.420s user 606m42.025s sys 0m0.370s
  -march=armv8-a+lse real 20m15.770s user 1218m15.218s sys 0m1.579s
   -moutline-atomics real 25m40.590s user 1377m42.852s sys 0m0.971s

Results for `-march=armv8-a` and `-march=armv8-a+lse` seem to match those for 9.1.1.
Results for `-moutline-atomics` seem to be generally slower than native LSE (which is somewhat expected given how tight the loops are).

Having access to 9.3.1, I'll be now able to run sysbench, which should be more important than those synthetic tests.
[14 Sep 2020 13:16] Tzachi Zidenberg
Hello Jakub!

interesting behaviour of rw_lock.. I wonder if it could be converted to more classical atomics combining __sync_fetch_and_sub with __sync_add_and_fetch.

Happy to hear you got gcc with outline-atomics support. Would be happy to hear how benchmarking goes.
Also, when you do run benchmarks - make sure to look at fairness as well, which should be an important advantage of LSE locks.
[18 Sep 2020 9:27] Krunal Bauskar
the said discussion evoked curiosity given I am using an ARM machine too so thought of trying it

16threads
RW_LOCK_LIKE
      -march=armv8-a real 0m37.546s user 9m0.330s sys 0m0.000s
  -march=armv8-a+lse real 1m1.291s user 12m34.864s sys 0m0.010s

24 threads
RW_LOCK_LIKE
      -march=armv8-a real 1m15.551s user 25m25.279s sys 0m0.000s
  -march=armv8-a+lse real 1m54.496s user 36m34.830s sys 0m0.000s

as expected with rw-lock lse optimization fails to perform.

-----
Jakub,
For result/number stability maybe you want to try taskset
[18 Sep 2020 16:32] Geoffrey Blake
Jakub, Krunal,

Have you tried adding some 'work' to your micro-benchmarks to better emulate the fact mysql will hold this rw_lock for a bit before releasing?  Or in the case of using a CAS to update a counter, it won't update the counter repeatedly without doing something else in-between?

I ask because, its possible for these tight-loop cases for LDXR/STXR to outperform LSE because the core can get the cache-line into its L1 and then always 'win' the atomic update without losing its exclusive monitor while starving all the other cores.  With some work in between the atomic ops, there's more chance for LDXR/STXR to start contending as the cache line bounces around and will see failures of the STXR's for all core's leading to a collapse in performance.  

Another benchmark to try is: https://github.com/ARM-software/synchronization-benchmarks from Arm itself.  You can tune some simulated work and # of threads to see how the system behaves.

-Geoff
[24 Sep 2020 8:36] Jakub Lopuszanski
I've run test on ellex04 for following versions of code compiled with gcc 9.3.1,
all patches based on mysql-trunk@4e59f42b, with -falign-{functions,jumps}=64:
05ba503 "mysql-trunk@4e59f42b + ela + RELAX=isb + outline atomics"
7865c90 "mysql-trunk@4e59f42b + ela + RELAX=isb"
9e3f4f9 "mysql-trunk@4e59f42b + ela"
0c9b817 "mysql-trunk@4e59f42b + RELAX=isb + outline atomics"
af0f2bd "mysql-trunk@4e59f42b + RELAX=isb"
f730c2d "mysql-trunk@4e59f42b"
The code was aligned to 64-bytes boundaries in a way which should add `nop`s 
only in places which aren't executed - this was done to minimize effect of code 
alignment changes.
There's ongoing work by Ela to replace built-in atomics with std::atomics, so I 
wanted to know how it interacts with the proposed patch.
Also, I want to know what's the impact of each of the two independent changes
proposed in the contribution (RELAX=isb, outline atomics).

I've run following scripts:
RW = /BMK/sb_exec/sb11-OLTP_RW_10M_8tab-{uniform,pareto}-ps-trx.sh
RO = /BMK/sb_exec/sb11-OLTP_RO_10M_8tab-{uniform,pareto}-ps-notrx.sh
PS = /BMK/sb_exec/sb11-OLTP_RO_10M_8tab-{uniform,pareto}-ps-p_sel1-notrx.sh
For 128 and 512 clients for RW and RO, and only for 512 clients for PS.
For RW I looked at avg TPS over 5 minutes (after 1 minute warm-up).
For RO and PS I looked at avg QPS over 5 minutes (after 1 minute warm-up).
Each combination was run 15 times (so for example I got 15 numbers for sysbench
"PS 512 uniform" scenario for commit 05ba503b3).
The load was generated on the same machine which served it (sorry).

I wanted to get answer to several questions of the form:
"Is the code in version b better than in version a?"
I've used Bayesian analysis, and below you'll find "answers" to each question I
was interested in.

Meaning of columns:
[testcase] [P(E(a)>E(b)+1%)] [P(E(a)=E(b)+-1%)] [P(E(a)+1%<E(b))] [P(a<b)]

Does the whole patch help if applied to future trunk after Ela's done? 
a=9e3f4f945e b=05ba503b3c (15 runs each)
PS 512 uniform    2%     43%     54%     61%
PS 512 pareto    12% :(  80%      8%     49%  <-- looked bad, and VERY noisy, 
                  0%     67%     33%     60%  <-- so I've increased to 24 runs
RO 128 uniform    0%    100%      0%     83% :)
RO 128 pareto     0%    100%      0%     68%
RO 512 uniform    0%    100%      0%     45%
RO 512 pareto     0%     99%      1%     58%
RW 128 uniform    0%    100%      0%     72%
RW 128 pareto     0%    100%      0%     94% :)
RW 512 uniform    0%    100%      0%     29%
RW 512 pareto     0%      1%     99% :)  99% :)

Does isb barrier improve situation on future trunk with Ela's patch?
a=9e3f4f945e b=7865c90afc (15 runs each)
PS 512 uniform    8%     52%     39%     56%
PS 512 pareto    61% :(  38%      1%     36%  <-- 15 runs
                 36% :(  63%      1%     42%  <-- 24 runs
RO 128 uniform    0%    100%      0%     33%
RO 128 pareto     0%    100%      0%     32%
RO 512 uniform    1%     99%      0%     40%
RO 512 pareto     0%    100%      0%     52%
RW 128 uniform    0%    100%      0%     40%
RW 128 pareto     0%    100%      0%     62%
RW 512 uniform    0%    100%      0%     14% :(
RW 512 pareto     0%     89%     11%     92% :)

Does moutatomics help (on top of isb and Ela's patch)? 
a=7865c90afc b=05ba503b3c (15 runs each)
PS 512 uniform    6%     62%     31%     55%
PS 512 pareto     1%     41%     58%     64%  <--- 15 runs
                  0%     19%     81%     71%  <--- 24 runs
RO 128 uniform    0%    100%      0%     91% :)
RO 128 pareto     0%    100%      0%     81% :)
RO 512 uniform    0%    100%      0%     56%
RO 512 pareto     0%     99%      1%     56%
RW 128 uniform    0%    100%      0%     82% :)
RW 128 pareto     0%    100%      0%     87% :)
RW 512 uniform    0%    100%      0%     59%
RW 512 pareto     0%     99%      1%     80% :)
[24 Sep 2020 8:37] Jakub Lopuszanski
Does Ela's patch improve trunk? 
a=f730c2dd69 (9 runs) vs b=9e3f4f945e (15 runs)
PS 512 uniform   11% :(  37%     52%     58%
PS 512 pareto     4%     69%     26%     57%
RO 128 uniform    0%    100%      0%     37%
RO 128 pareto     0%    100%      0%     68%
RO 512 uniform    0%     87%     13%     70%
RO 512 pareto     0%     99%      1%     57%
RW 128 uniform    0%    100%      0%     54%
RW 128 pareto     0%    100%      0%     56%
RW 512 uniform    0%    100%      0%     57%
RW 512 pareto     0%    100%      0%     55%

Does the whole contributor's patch help on the old trunk? 
a=f730c2dd69 vs b=0c9b81704d (9 runs each)
PS 512 uniform   17% :(  41%     41%     55%
PS 512 pareto     4%     63%     32%     59%
RO 128 uniform    0%    100%      0%     95%  :)
RO 128 pareto     0%     99%      1%     67%
RO 512 uniform    0%     83%     17%     73%
RO 512 pareto     0%     80%     20%     66%
RW 128 uniform    0%    100%      0%     75%
RW 128 pareto     0%     98%      2%     91%  :)
RW 512 uniform    0%    100%      0%     53%
RW 512 pareto     0%     10%     90% :)  93%  :))

The meaning (as I understand it) of the last four columns is:
P(E(a)>E(b)+1%) = 
    given observed 15 values of a and 15 values of b, how likely it is that a 
    has distribution with mean value larger by more than one percent than mean 
    of b? 
    Simplifying: how likely it is that avg performance of a is better than b's?
P(E(a)=E(b)+-1%) = 
    ...with mean values not further than 1% apart? 
    Simplifying: how likely it is that a and b are on par?
P(E(a)+1%<E(b)) = 
    ...with mean value smaller by more than 1%? 
    Simplifying: how likely it is that avg performance of a is worse than b's?
P(a<b) = 
    given observed 15 values of a and 15 values of b, how likely it is that if I
    draw another two values from the distributions of a and b, that a will be
    smaller than b?
    Simplifing: how likely it is that a will outperform b in a single run?

I've arbitrarily decided to mark results where 
    P(a<b)>80% with :) and 
    P(a<b)<20% with :(.

Also, given that we often experience so called "build movements", a phenomenon
where a change in one function causes movement of assembly code of other 
functions in such a way that execution speed is impacted due to various 
"code alignment issues", we tend to prefer patches which move the needle a lot,
thus I'm often interested if a change in avg performance by more than 1% is 
likely. For this I've marked 
    results higher than 90% in the the P(E(a)+1%<E(b)) column with :) and 
    results higher than 10% in the P(E(a)>E(b)+1%) column with :(.
Again, choice of thresholds is arbitrary and has more to do with my psychology 
(I'm willing to accept a patch for a wrong reason 10% of the time) than science.
[24 Sep 2020 8:37] Jakub Lopuszanski
It's evident that the patch as a whole helps for RW 512 pareto scenario:
    RW 512 pareto     0%      1%     99% :)  99% :)

It looked like it might be causing problems for PS 512 pareto, though. 
In particular it looks like the `isb barrier` part of the patch might be causing
harm. At least this was my impression after first 15 runs. However looking at 
those results from those 15 runs, I could see a huge variance:
[runs] [min qps] < [avg qps] < [max qps] [version]
15 797478 < 838511.80 < 858226 "mysql-trunk + ela"
15 806469 < 837648.80 < 851918 "mysql-trunk + ela + RELAX=isb + outline atomics"
15 796978 < 828718.33 < 856639 "mysql-trunk + ela + RELAX=isb"
In particular the difference between averages (838511-837648=863) is miniscule
in comparison to the spread (858226 - 797478 = 60748), so perhaps this test is 
too noisy to rely on.
The stdevs are in a range 10k-20k for PS 512 scenarios, while they
are much lower for RO 512 (1k-5k) and RW 512 (45 - 140) scenarios.

As it looked noisy, I've rerun this RW 512 pareto scenario 9 more times, to see
if it will converge to some decision. You can see the results for all 24 runs
in above tables. Some were completely "flipped", and analysis is still unable to
assign >90% ppb to any of hypotheses, despite running for several days, so, I'd
say that this test is not to be trusted.

In conclusion, I think this contribution seems to improve the performance on the
RW 512 pareto scenario, without hurting any other in a provable way.
15 19845 < 20098.20 < 20200 "trunk + ela + RELAX=isb + outline atomics"
 9 19884 < 20097.33 < 20258 "trunk + RELAX=isb + outline atomics"
15 19780 < 20000.00 < 20247 "trunk + ela + RELAX=isb"
 9 19783 < 19996.00 < 20294 "trunk + RELAX=isb"
15 19755 < 19836.20 < 19904 "trunk + ela"
 9 19735 < 19826.89 < 19950 "trunk"

Perhaps there are better ways to test it, or analyse it, but I think what I did
is good enough safety net.
I'll prepare the patch for code review soon.
[24 Sep 2020 18:38] Tzachi Zidenberg
Great! Thank you!
[30 Sep 2020 18:10] Paul DuBois
Posted by developer:
 
Fixed in 8.0.23.

Thanks to Tzachi Zidenberg, who contributed a patch for compiling
MySQL on aarch64 (ARM64).