MySQL Bugs: #100132: Using optimal memory-barrier (in form of acquire/release) for event-mutex

Bug #100132	Using optimal memory-barrier (in form of acquire/release) for event-mutex
Submitted:	7 Jul 2020 5:13	Modified:	22 Sep 2020 3:47
Reporter:	Krunal Bauskar	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.20	OS:	Any
Assigned to:		CPU Architecture:	ARM
Tags:	event-mutex, memory-barrier, performance

Description:
- EventMutex is home-grown spin-lock implementation that provide lock/unlock
  functionality. To facilitate this it uses a couple of atomic variables like
  m_lock_word and waiters with lock-word meant to track mutex lock/unlock
  operation and waiters to flag presence of waiter in turn need for a signal.

- Currently, EventMutex uses the default memory barrier offered by C++ atomic
  that defaults to sequential consistency. This order is too strict and it can
  be switched to use acquire/release order without affecting the core
  flow/correctness.

- In addition to the use of acquire/release barrier patch also identifies a
  redundant memory-fence that could be safely removed with the new proposed
  order.

How to repeat:
* I will submit the contribution patch. Please use it for evaluating performance and comparing it against baseline.

* We have seen performance improvement even with x86 (4-6%) so keeping the CPU arch open and not marking it specific to ARM.

patch for the bug (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: perf#eventmutex-memory-barrier.patch (text/x-patch), 2.25 KiB.

arm benchmark

Attachment: arm_eventmutex.png (image/png, text), 70.49 KiB.

x86 benchmark

Attachment: x86_eventmutex.png (image/png, text), 70.14 KiB.

Blog explaining the details is also published here.
https://mysqlonarm.github.io/Understanding-Memory-Barrier/

Hi Mr. Bauskar,

Thank you for your performance improvement report.

I have analysed it carefully. I consider that your analysis of the memory barrier management is quite correct.

I also like very much your patch.

This is now a fully verified performance improvement report.

Thanks a lot for everything and espeically for the patch.

Thanks appreciate mysql quick responses. Will now wait to see things in production :)

Hello,
thanks for looking into this interesting and complicated area and willingness to contribute!

At https://godbolt.org/z/H9oET- we can see three implementations:
- exit_old is what we currently have in mysql-trunk
- exit_proposed is the one proposed in the contribution
- exit_optimized is "something in between" where we only remove the std::atomic_thread_fence(std::memory_order_acquire) call

My observations:

1. exit_old() and exit_proposed() compile to different assembly on x86/x64, and in particular the assembly for exit_proposed() is buggy: it allows the processor to read the value of m_waiters before it flushes the write to m_lock_word from its store buffer. This is even acknowledged in the blog article https://mysqlonarm.github.io/Understanding-Memory-Barrier/ :

> release-barrier on lock_word followed by an acquire-barrier on waiter this could be reordered.

but somehow dismissed. Perhaps the author believes that compiler is the only agent which can reorder things, and if the "C++ standard should limit compilers from doing so" and thus assembly looks like the steps occur in right order, then its end of story. Unfortunately, the CPU itself also "reorders" things in various ways, in particular on x86/x64 the writes are buffered, and thus are published ("flushed") long after reads (unless one uses one of instructions which flush the buffer). If buffering happens, then following sequence of events is possible and leads to disaster:
1b. thread B calls lock():
2b. thread B sees CAS failed N times because that m_lock_word == true, because thread A holds the mutex
3b. thread B wants to fall back to sleep, which involves steps below
4b. thread B first reserves a wait slot
1a. thread A is calls exit_proposed():
2a. thread A enqueues a write of m_lock_word=false to store buffer, but does not flush it yet
3a. thread A reads m_waiters and sees it is still false
5b. thread B sets m_writers = true (it does not matter if it "immediately" publishes it or not, because thread A already saw it was false)
6b. thread B checks the value of m_lock_word one last time before going to sleep and sees it is still true (because change from A was not flushed yet) so it should go to sleep
7b. thread B goes to sleep
4a. thread A decides that there are no waiters so there is no need to wake up anyone
5a. thread A flushes the write to m_lock_word to the store buffer, but this does not matter because B already decided to go to sleep

End result: mutex is freed, but the thread B sleeps, and there is nobody who will wake it up.
(In practice it would "fix itself" when someone again tries to acquire this mutex, but this might never happen in some scenarios. Thankfully we have some fallback for that: the sync array is regularly scanned for threads waiting on things which are available, so in case of bug in logic like the one above they would got artificially woken up)

2. exit_optimized on x86/x64 compiles to the same code as exit_old (which might give false sense of correctness if we test only on x86/x64) but on ARM64 it exit_optimized compiles to the same code as exit_proposed, so will give new behaviour. As we test mostly on x86/x64 but not on ARM64 this might lead us to not realize there is some bug in this new behaviour (I am not ARM expert so I don't know, but that's the point: we will not know).

3. According to https://en.cppreference.com/w/cpp/atomic/atomic/store the proposed m_waiters.store(false, std::memory_order_acquire); in clear_waiters() is undefined behaviour:
> order must be one of std::memory_order_relaxed, std::memory_order_release or std::memory_order_seq_cst. Otherwise the behavior is undefined.

I think the author simply didn't realize what is the essence of the problem we try to solve in exit(). Which is not surprising, because we too, as a team, had a bug in this place exactly, and in similar places, without realizing. For recent incarnation see for example Bug #30819167 INNORWLOCKTEST DEADLOCKS ON ARM BECAUSE OF BARRIERS MISSING IN SYNC0DEBUG.CC
Thankfully we have this fallback sync array scan.
The problem is that we need to make sure that a write to m_lock_word "happens before" a read from m_waiters, which is very tricky to express in C++, because (see 3. again) C++ does not really have a concept of "acquiring store". Yes, one can try their luck with using m_lock_word.store(false,seq_cst) and hope for the best (this is `exit_optimized`), but there is really nothing I know of in C++ standard that would prevent moving the m_waiters.load(acquire) before such store. Sure, in practice, on x86/x64 we see that assembly for exit_optimized and exit_old looks the same. But then, when we look at assembly generated for ARM, we see that omitting the explicit `std::atomic_thread_fence(std::memory_order_acquire);` causes a change in the assembly (the `dmb ishld` is missing) so I, for one, worry that this enables ARM processor to read value of m_waiters from before publishing the m_lock_word written to by strlb instruction.
Therefore, I think that the current implementation is correct, but even removal of the atomic_thread_fence could lead to a bug difficult to debug, let alone changing the store's ordering from memory_order_seq_cst to memory_order_release, which would cause problems for sure.

I am against this patch as is.

Firstly, thanks for looking into this. As you rightly said memory barriers are complicated and that makes them challenging too.

----

1. The anomaly that I pointed out in my blog is addressed with the followup blog link that I have embedded in the same section of my blog. Let me requote the said external blog link one more time here for more clarity. [external blog https://preshing.com/20170612/can-reordering-of-release-acquire-operations-introduce-deadl...]. The external blog exactly addresses the question that confirms the fact that re-ordering of load-acquire, store-release (on different variables like in this case) is not possible as per C++ standards.

Said external blog, also posted compiler-generated code for confirming the said case and it is also being observed by the Godbolt link that you forward.All 3 variants (on all aarch+compiler combination) confirm the fact that they are not re-ordered.

If you find an aarch+compiler combination where it is re-ordered please share else let's go by the community confirmation.

---- That should help support the reasoning for the removal of explicit-memory-fence.

2. Coming back to buffered writes:

- As per barrier semantics if the atomic variable is released then it is bound to get visible by other thread using acquire.
[So atomic variables are supposed to get flushed to the last level ensuring all other threads see it].
To support this case let me given you another example from well-known author Fedor Pikus
https://www.youtube.com/watch?v=ZQFzMfHIxng (check slides 41/65).

Using the said analogy sequence you described doesn't hold
"2a. thread A enqueues a write of m_lock_word=false to store buffer, but does not flush it yet"
"6b. thread B checks the value of m_lock_word one last time before going to sleep and sees it is still true (because the change from A was not flushed yet) so it should go to sleep"

the release should flush the said memory to ensure the updated version is available to all the threads.

---- That should help support the reasoning for using acquire and release barrier on m_lock_word.

3. Sorry, I missed on the store case but that could be adjusted surely but not to use the release as currently is.

Code, as it is currently, is

if (m_waiter.load(memory_order_acquire)) {
clear_waiter(std::memory_order_release);
// real signal action loop
os_event_set.....
}

This can potentially cause a real signal action loop to get scheduled before clear_waiters which is not expected for sure.

------------------------------------------------------------------------

Introducing a fence is making it more restrictive at cost of performance.
A better approach is to switch too correct and optimal fence semantics.

Ad 1.
>  The external blog exactly addresses the question that confirms the fact that re-ordering of load-acquire, store-release (on different variables like in this case) is not possible as per C++ standards.

Could you please give the exact quote from https://preshing.com/20170612/can-reordering-of-release-acquire-operations-introduce-deadl...? Because, what I see in this article is completely opposite:
> Those rules don’t prevent the reordering of a write-release followed by a read-acquire. 

You wrote:
> If you find an aarch+compiler combination where it is re-ordered please share else let's go by the community confirmation.

I've already did: on x86/x64 if you have a `mov` from register to memory it's "a store", and a `mov` from memory to register is "a load", and if in assembly you see a sequence of "a store" followed by "a move" then the store can get buffered and be flushed long after the load has executed.
In particular in the godbolt link I've shared before (https://godbolt.org/z/H9oET- ) you might see exectly such a situation:

exit_proposed():                     # @exit_proposed()
        mov     byte ptr [rip + m_lock_word], 0
        mov     al, byte ptr [rip + m_waiters]
        test    al, 1
        jne     .LBB1_2
        ret
.LBB1_2:
        jmp     signal()    

One important thing to understand when looking into assembly is that whatever 'semantics' you want from the code at the C++ level, if they are not translated to assembly somehow, then they are lost. Either the compiler produces an assembly that has some particular statements (like `lock` prefix before some instructions gives the CPU exclusive access to the bus, or `xchg` which itself blocks the bus) or it doesn't - it will not help that you, as a programmer hoped for something additional.
If you want to make sure that the first `mov` gets flushed, then you need a different assembly for that to happen.
In particular the same link demonstrates how the old code gets compiled:

exit_old():                           # @exit_old()
        xor     eax, eax
        xchg    byte ptr [rip + m_lock_word], al
        mov     al, byte ptr [rip + m_waiters]
        test    al, 1
        jne     .LBB0_2
        ret
.LBB0_2:
        jmp     signal()  

Notice the differrence: iinstead of:
        mov     byte ptr [rip + m_lock_word], 0
we have:
        xor     eax, eax
        xchg    byte ptr [rip + m_lock_word], al
and indeed, `xchg` needs to lock the bus, and perform the exchange atomically, in particular it causes a flush of the write buffer.

Ad 2.
You make a common error in interpretation of how release and acquire cooperate.
The mere fact that one thread does acquire-load does not somehow cause the other thread to release something. How could it?
It's not like x.load(std::memory_order_acquire) executed in one thread, somehow magically tells the other CPU to flush the write buffer.
How can I know? Because I see that `x.load(std::memory_order_acquire)` compiles to a mere `mov` instruction.
So, what does the spec really tries to say about release-acquire semantics?
It tries to say that IF in thread T1 the x.load(std::memory_order_acquire) returns the same `value` as previously stored by another thread T2 using x.store(value,std::memory_order_release) THEN additionally T1 will see all the other writes to memory done by T2 up to the point of this store.
So, it tries to say something about all the *ORDERING* of writes to OTHER memory locations with respect to write to this single atomic.

There are 3 different points/optimization we are discussing

a. need for the fence.
b. use of acquire and release barrier.
c. wrongly place release while clearing waiter (that can potentially cause signal to re-order before it).

--------------

(a): I think the fence was added to avoid re-ordering of the said statements. The external blog I pointed is just about it. Let me point out relevant snippet again

<snippet>
However, I’m not so sure the compiler is allowed to reorder those statements. Not because of acquire and release semantics, but because of a different rule from the C++ standard. In working draft N4659, section 4.7.2:18 states:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

.....
Therefore, I believe the compiler shouldn’t reorder those statements, and deadlock is not possible.

.....
As a sanity check, I pasted Thread 1’s code into Matt Godbolt’s Compiler Explorer. Judging from the assembly code, none of the three major C++ compilers reorder those statements when optimizations are enabled.
</snippet>

I think the fence was added to avoid this re-ordering.

--------------

(b) use of acquire and release barrier.

Probably let's take a step back and see if we share same understanding of the existing implementation

thread-A                                       			thread-B
----------------------------------------------------------------------------------------------------

lock-word.store(false)					
(should cache in cpu-store-buffer)                      

                                                              waiter.store(true, release)
                                                              (should cache in cpu-store-buffer)

                                                              try_lock()
                                                              (will still fail as lock-word
                                                               is not set in main-memory)

                                                              goes to sleep()

atomic_fence
(this should cause lock-word to get flushed                   atomic_fence on thread-A will not cause
 to main-memory)                                              thread-B waiter to get flushed.

if (waiter.load(acquire)) {
(waiter loaded from main memory is still false and thread-A
 will skip signal routine)
  waiter.store(false, release);
  signal();
}

------- Q. With 2 threads do you agree this could potentially lead to a deadlock?

For better understand let's use gcc-5.3 (that is official supported by MySQL-8.0)
So the code now looks like this:

m_lock_word.store(false, (default=seq_cst)) => mapped too -> MOV instruction.

exit_old():
        mov     BYTE PTR m_lock_word[rip], 0
        mfence
        movzx   eax, BYTE PTR m_waiters[rip]
        test    al, al
        jne     .L4
        rep ret
.L4:
        jmp     signal()

--------------

(c) wrongly place release-barrier while clearing waiter

if (waiter.load(acquire)) {
  waiter.store(false, release);
  signal() .... this could potentially get re-order before release but will not cross the if loop boundary due to acquire
}

this clearly violates the said semantics

  /* The memory order of the array reservation and
  the change in the waiters field is important: when
  we suspend a thread, we first reserve the cell and
  then set waiters field to 1. When threads are released
  in mutex_exit, the waiters field is first set to zero
  and then the event is set to the signaled state. */

===================================================================================

I think confusion is arising since we are trying to mix atomic_fence (meant for synchronization
of non-atomic or relaxed memory order atomic) with atomic that uses acquire/release/seq-cst barrier.

Ad (a): You've provided a snippet, but the portions that you have omitted by "..." are important, because that's where the author talks about the `while` loop, which could in theory take infinite time, and thus you need make sure that a write to an atomic which in the source code appears before the `while` must also occur before the loop in the generated assembly. Otherwise the compiler would fail to satisfy the promise that every write should eventually become visible. So this is crucial, that the example involves `while` loop after the store. Without this `while` compiler is free to reorder the store with load, and this is clearly acknowleged by the author in the opening of the article.
As far I see, there is no `while` loop in the `exit()` implementation of our mutex.

Ad (b): In your ASCII art you wrote:
>   lock-word.store(false)					
>   (should cache in cpu-store-buffer)       
which I do not agree to for two reasons.
First, it mixes high level C++ concepts ("lock_word.store(false)") with low level assembly concepts for particular platform ("should cache in cpu-store-buffer").
So, I assume that what you mean here, is supposed to be interpreted as "On x86/x64, when using gcc 5.3 to compile this single line of C++ we expect emitted assembly to contain instructions which do not flush the cpu-store-buffer".
But even then, this is clearly wrong, because gcc for x86/x64 will emit either xchg, or mov+mfence for this single line of C++, both of which flush the store buffer.
If the second argument to store is omitted, then it is std::memory_order_seq_cst by default, and this means (among other things) that this store has release semantics, which in turn means, that whoever happens to observe result of this store, must also observe effects of any other (even non-atomic) stores before it. But to achieve it on x86/x64 you need to make sure that the store buffer (containing these ealier stores) gets flushed.
So, the options gcc or clang or any other compiler have here are limited, AFAIK, to: xchg, mov+mfence, and mfence+mov. The last one would also satisfy the condition that earlier writes should be visible to whoever sees the new lock_word value, however I haven't seen any compiler actually using this combination.

And you can see that there is `mfence` in the assembly you've pasted:
exit_old():
        mov     BYTE PTR m_lock_word[rip], 0
        mfence

Perhaps you were under wrong impression that the mfence in second line comes from the usage of `std::atomic_thread_fence(std::memory_order_acquire)`, because both have similar name ("*fence*").
But in reality, this `mfence` is part of what the `m_lock_word.store(false);` translates to, which you can easily see in the Godbolt output for gcc-5.3 generated for exit_optimized:

exit_optimized():
        mov     BYTE PTR m_lock_word[rip], 0
        mfence

Recall that the only difference between exit_old and exit_optimized is that the later lacks the std::atomic_thread_fence(std::memory_order_acquire) call, yet both produce the same assembly, and both contain this `mfence` instruction in particular.

So, when you ask: "Q. With 2 threads do you agree this could potentially lead to a deadlock?"
my answer is:
the code of exit_old (which is what we have in mysql-trunk now) can not lead to a deadlock, because we have std::memory_order_seq_cst store for m_lock_word, and on x86/x64 it translates to either mov+mfence or xchg, both of which ensure that store buffer is flushed.
But this is perhaps too low level.
The proof of correctness and deadlock avoidance should actually be in terms of higher level C++ semantics, and this is arguably very difficult to do, as I've already acknowledged - I even suspect that current (very conservative!) implementation could not be water-tightly proven correct using just the rules spelled out in the C++ standard.
But if we look at the assembly generated by compilers for x86/x64 and interpret it in the context of how these kind of CPU handles memory, it can be proven that the assembly can not deadlock, for following reason:
One of the two threads must go first with locking the memory bus: either it is the thread doing `mfence` (or `xchg`) in exit_old(), or it is the thread doing `mfence` (or `xchg`) after setting m_waiters to 1.
There must be some objective order of this two operations, because of how this CPU works.
If we assume that `exit_old` was first, then it follows, that the thread trying to acquire the lock, must have seen its not longer locked and thus will not go to sleep.
If we assume the contrary, that `exit_old` was second, then it follows, that it must have seen that m_waiters is non-zero and thus must have tried to wake up the other thread.

But proving it on the level of C++ semantics is problematic: there is nothing in the standard which says that there must be any objective order between stores to two different atomic variables, UNLESS they are both using std::memory_order_seq_cst. But unfortunately the store to m_waiters is using just std::memory_order_release, so this doesn't give us much.
You can read my question on SO about this issue, and a lot of smart answers to it:
https://stackoverflow.com/questions/60053973/how-to-achieve-a-storeload-barrier-in-c11
This is probably why the developer decided to additionally add `std::atomic_thread_fence(std::memory_order_acquire)` in here.
This is (as I've demonstrated with Godbolt) not needed (and not doing anything) for x86/x64.
It is required (or at least: was hoped to help) for theoretical correctness w.r.t. C++'s memory model, and on weakly-ordered platforms which we do not test like ARM, POWER etc.
And indeed (as I've demonstrated with Godbolt) at least on ARM64 this fence introduces a difference in assembly: it adds the `dmb ishld` instruction, which I believe is important to tie together the histories of the two otherwise independent atomic variables.

As stated above, I don't think I could prove correctness of the current implementation in trunk by just referring to C++'s rules.
So, I've tried to figure out implementations, which I believe are correct, and have these two proposals:
https://godbolt.org/z/n5Gv8e
// Following is only correct if we additionally modify
// set_waiters() implementation, so that it uses
// m_waiters.store(false)
void exit_correct(){
  m_lock_word.store(false);
  if (m_waiters.load()) {
      signal();
  }
}
// Following might be correct even without 
// changes to set_waiters()
void exit_correct2(){
  m_lock_word.exchange(false,std::memory_order_acq_rel);
  if (m_waiters.load()) {
      signal();
  }
}

They are inspired by respectively "Option A" and "Option B" in https://stackoverflow.com/questions/60053973/how-to-achieve-a-storeload-barrier-in-c11 so you can read the whole "proof" there.
The short version of a proof for exit_correct() is that it exploits that all seq_cst accesses for all atomic variables must form some total order (which is why it also requires us to change set_waiters() implementation to use seq_cst).
The short version of a proof for exit_correct2() is that it exploits that `exchange` is both a store and a load depending on how you look at it, and given that try_lock() uses compare_exchange_strong (which also is a store and load at the same time) you can prove that either the load part of the one will synchronize with the store part of the other, or the other way around, but either way you can use that to prove that stuff that appears before earlier operation has to be visible to stuff that happens after the later operation which is enough to prove that if someone went to sleep then they will be woken up.

Interestingly both of them compile to the same code on x86/x64 as current exit_old() on modern gcc and clang (except that for exit_correct() we would also have to modify set_waiters() to use memory_order_seq_cst instead of memory_order_release which unfortunatelly would introduce mfence of xchg).

Even more interestingly, and in line of your proposal, exit_correct() does not emit the `dmb     ishld`.
So there is a tradeoff here: we could use this implementation to make exit() faster on ARM at the expense of additional `mfence` during `set_waiters()`.
I think it might be a good trade off, actually, given that `set_waiters()` is called only when a thread already feels sleepy.
As for `exit_correct2()` it looks like it doesn't use `dmb ishld` but introduces a retry-loop which is probably even worse (I don't know).

So, given all this, I think we could consider a modified version of the patch in which we simply use seq_cst everywhere, but this would have to be tested for performance on both Intel and ARM, and is probably not what you wanted, so let me know what you think.

Sorry, in the above I wrongly wrote "Option A" where I should've written "the first solution in the question", because "Option A" is one using fences, and what I had in mind is the one using seq_cst for all stores and all loads which was the first solution presented there.
So, I wanted to mimick this part:
  std::atomic<int> x{0},y{0};
  void thread_a(){
    x.store(1);
    if(!y.load()) foo();
  }
  void thread_b(){
    y.store(1);
    if(!x.load()) bar();
  }

Sorry for adding to the confusion.

Sorry, I was wrong when I wrote this part:
> If the second argument to store is omitted, then it is std::memory_order_seq_cst by default, 
> and this means (among other things) that this store has release semantics, which in turn means, 
> that whoever happens to observe result of this store, must also observe effects of any other (even non-atomic) stores before it. 
> But to achieve it on x86/x64 you need to make sure that the store buffer (containing these ealier stores) gets flushed.

The first sentence is true, but the second is rubbish. On x86/x64 we actually get this property ("if you see result of a `mov` you also see results of those above it in assembly") for free, because either the buffer is flushed or not, but it can't happen that later write was flushed before ealier write. Sorry about this.

However, it is still true that for lock_word.store(false) on x86/x64 a compiler has to emit xchg or mov+mfence, but for a different reason.
This other reason, is that all std::memory_order_seq_cst atomic operations should form a single total order which agrees with partial order of operations done by given thread in its C++ source code, and this implies among other things, that if the same thread has an x.load() operation below in its source, then this x.load() has to appear after lock_word.store(false) to all observers. This means, that the `mov` which corresponds to `lock_word.store(false)` has to be flushed before the `mov` corresponding to this `x.load()`. Now, this could be achieved in several ways.
1. lock_word.store(false) could be translated to xchg
2. lock_word.store(false) could be translated to mov + mfence
3. x.load() could be translated to mfence+mov  
The last approach is considered shortly in https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html:
>  Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:
> C/C++11 Operation     x86 implementation
> Load Seq_Cst:         LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
> Store Seq Cst:        MOV (into memory)
> As there are typically more loads than stores in a program, this mapping is likely to be less efficient. 
> We should also note that mixing the read-fencing and write-fencing mappings in one program 
> (e.g., by linking object files from two different compilers) can result in an incorrectly compiled program. 
> As a result, we strongly recommend to use the write-fencing mapping only. 

So, I still believe that the `mfence` is result of translating `lock_word.store(false)`, but I was wrong in attributing it solely to the "relase semantics" - it actually comes from the "single total order" semantics.
Sorry for confusion it may have caused.

After reading multiple versions, variants and semantics I think we better stick to code C++
and let gcc/clang do their work to ensure whatever we express in C++ is correctly being mapped
to the respective architecture.

-------------------------------------------------------------------------------------------------------

Patch proposes 4 main things:

1. While taking lock (m_lock_word.compare_exchange_strong), no following instructions should
get re-scheduled before the said statement.

C++ std::memory_acquire_order semantics suggest 2 things here:
- No followup operations could be re-ordered before this statement.
- If there was release executed on the same variable by other thread its effect will be observed.
[How cache-coherence is maintained is out-of-scope so let's leave it to GCC and arch].

What is not clear or missing is when the value is made visible but that should not be an issue for CAS
given it checks for existing state and then swap [this could be an issue for load only where while-loop
is needed but not even for the store as it sets value].

[EventMutex is designed to assume that m_lock_word may not be set immediately and has needed loop in place to re-try].

2. Same way while releasing the lock, flow needs to ensure nothing before said statement should get scheduled after the said statement.
This naturally suggests the use of a release barrier.

[Good example of this is existing spin-lock implementation as suggested in C++
https://en.cppreference.com/w/cpp/atomic/atomic_flag]

[We simply don't care when the effect of release is flushed or made available to other threads. Other threads have the needed checks in place].

3. Removal of the fence: If we look at C++ semantics explicit fence is meant to help synchronize operations for relaxed and non-atomic variables. We wrongly placed them for atomic variables (instead of relying on inherent atomic variables order/barrier).

m_lock_word.store(false, std::memory_order_release);
m_waiters.load(std::memory_order_acquire);
none of the arch+compiler output confirmed that these 2 operations can be re-ordered so placing
fence to control the re-order is redundant.

[Flow doesn't assume m_waiter.load() to return true so there is no need of a while loop.
If true, certain special action of the signal is kicked-off].

4. The flow needs to ensure that the signal loop order is maintained during the set and reset.

sync_array_get_and_reserve_cell()
m_waiter.store(false, std::memory_order_release);
[Release should ensure that sync_array_get_and_reserve_cell() is not scheduled beyond the said point].

m_waiters.store(false, std::memory_order_acquire); [and not release]
os_event_set()
[Acquire should ensure that os_event_set() is not scheduled before the said point].

--------------------------------------------------------------------------------------------------------

So seems like we got the context correct but as a programmer, we need to satisfy our-logic through some working proof.
Let's try to do that by remaining in C++ scope (as far as possible) and following C++ semantics only.

(1) and (2) are self-explanatory and verified by the C++ example too.
(3) has been confirmed by godbolt and so can be assumed as true. [unless we get a specific case that proves it to be wrong].
(4) again using C++ semantics comply with design.

Proof-Concept:

- What if thread-B set waiter to true and effect is not visible to thread-A when it loads the values.
- That's completely ok since it thread-A should never assume the value of waiter and if it assumes then it should enter in strong while/for a loop. In the EventMutex case, thread-A is not assuming waiter value.
- With 2 threads this could lead to a deadlock:
- If thread-A don't see waiter-value (as true)
- **AND** If thread-B fails to see lock_word values (as false) during its re-try-lock execution.
This should be handled by existing sync-array "artificially woken up mechanism" but it is wrong for main-flow
to bear the cost of this one-time mechanism.

-----------------------------------------------------------------------------------------------------------

Mixing the assembly code raises more questions:

a. mov+mfence .... does x86 assures that these 2 separate instructions (with different op-code) execute in an atomic way.
why can't context switch happen post mov.

b. How does lock cmpxchg ensure that store-buffers are flushed and cache line for all cores are invalidated?

c. just because x86 implemented mfence to drain store buffer it is not needed by semantics as confirmed here

https://software.intel.com/es-es/comment/1809815
"From a purely architectural perspective, the MFENCE instruction does not have to directly cause the draining of the store buffers on a processor .... in practice it makes sense to have the MFENCE instruction cause the store buffers to be drained"

Thank you, Mr. Bauskar.

Proposal of a patch which provides provable correctness in C++14-compliant environment and performance on ARM

Attachment: seqcst_mutex.patch (application/octet-stream, text), 7.05 KiB.

OK, I've spent a lot of time thinking about how to fix the implementation of the TTASEventMutex, so that it can be proven using C++14 standard that it does what it should.
In particular, that it will correctly wake up any thread which goes to sleep.
I believe that current implementation in mysql-trunk can not be proven correct in the light of C++14 standard, at least not by me, not in two days. (Which is not the same thing, as saying that I see a bug in the generated assembly: in practice the c++ code can translate on Intel and ARM to assembly which provably works, but in theory there might be a platform like POWER or Itanium, where it will break).

Therefore, I propose following patch[1], which:
1. Removes the unneeded `std::atomic_thread_fence(std::memory_order_acquire);` call (it was not helping in the proof in any way, and translated to a no-op on x86).
2. Changes all the operations on `m_waiters` to memory_order_seq_cst (this is crucial for the proof of correctness. I've spent 2 days trying to figure out a proof which would not require this change, but I could not, see my notes [2]). The cost of this change is that on Intel we have one additional mfence or xchg on the slow path leading to sleep, and another on the slow path leading to wake up - hopefully both should be rare. Anyway, correctness triumphs performance.
3. Provides a rigorous proof of the theorem that a thread which calls `sync_array_wait_event` will not end up sleeping indefinitely.
4. On ARM compiles to the same assembly as the code initially proposed by Krunal Bauskar, see https://godbolt.org/z/e1cha3 for comparison. I hope this is enough to satisfy their bug request.
5. On Intel compiles to assembly which does not have the bug present in Krunal Bauskar's patch, which let the CPU flush the store to m_lock_word after reading m_waiters.
6. Removes the unused `waiters()` method.

Dear, Krunal Bauskar, let me know if the patch [1] solves your bug request. 
If so, then I'll do my best to ask my team to review and incorporate it.

[1] https://bugs.mysql.com/file.php?id=29681&bug_id=100132

[2] Here's the problem with *not* using seq_cst. According to the 29.3.3:
"(...) memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values:
  (3.1) the result of the last modification A of M that precedes B in S, if it
        exists, or
  (3.2) if A exists, the result of some modification of M that is not
        memory_order_seq_cst and that does not happen before A, or"

This opens a door for m_waiters.store(false,std::memory_order_relrelase) in clear_waiters() to pop-out as a provider of a value of almost any read of m_waiters in the future.
For example, I could not rule out following situation involving four threads: X(enter), Y(exit), Z(exit), W(enter):

    X1 epoc_counter=os_event_reset()
    X2 m_waiters.store(true)
    X3 unsuccessful m_lock_word.compare_exchange_strong(false,true)
    X4 os_event_wait(epoc_counter)

    Y1  m_lock_word.store(false)
    Y2  if (m_waiters.load()) {
    Y3    m_waiters.store(false,std::memory_order_release)
    Y4    os_event_set()
        }

    Z3   m_waiters.store(false,std::memory_order_release)
    Z4   os_event_set()

    W1  os_event_reset()
    W2  m_waiters.store(true)

in which relations mentioned in the C+14 standard are:

happens-before:  Z3 - Z4 - X1 - X2 - X3 - X4   
                 Y1 - Y2 - Y3 - Y4
            W1 - W2 /

S: X2 - X3 - Y1 - W2 - Y2

modification-order of m_waiters: X2, W2 ,Z3, Y3

This allows Y2 to read `false` written by Z3, and thus to never execute Y4, and never wake up the thread X!

One could think that if Z3 happens before X2, and X2 is before Y2 in S, then it is impossible for Y2 to read effects of Z3 (false) instead of effects of X2 (true). 
However, the exact wording of 29.3.3.2 doesn't care about X2 if it is not *the last* modification before Y2, so mere addition of an otherwise negligible thread W to the picture, along with its W2 operation, destroys the proof: now W2 is *the last* modification before Y2, and there is no `happens-before` relation between Z3 and W2, so now (3.2) allows Y2 to read effects of Z3!

I suspect this was somehow missed by the committee, because the latest version of the standard seems to be worded very differently, and in particular contains this statement:
"it also ensures that a memory_order::seq_cst load A of M gets its value either from the last modification of M that precedes A in S or from some non-memory_order::seq_cst modification of M that does not happen before any modification of M that precedes A in S."
Note, how "any modification of M that precedes A in S" used in new standard can refer X3, and thus can be used to rule out reading effects of Z3 by Y2.

I have reviewed the updated patch and surely it would solve the problem given we are moving to a complete sequential model. I also went over the proof and looks good. As proposed in the original patch the fence seems to be redundant and getting rid of it helps performance + the issue about wrong memory order while clearing the waiter is also getting addressed since everything is now seq-order.
I am still trying to convince myself and get some provable steps as to why acquire/release for m_lock_word will not work but eventually, if the low-level code is going to step back to the same op-codes proof would be just to make me feel happy and performance may continue to remain same.

I would suggest let's go ahead with the said version and if I find something we can consider it incrementally provided it helps improve performance for other cases too.

But we all agree existing things don't hold the test of time to make it proveable across the board.

Dear Krunal Bauskar, thanks for confirming what we agree about and being open-minded about the issues that still seem debatable.

You wrote 
> I am still trying to convince myself and get some provable steps as to
> why acquire/release for m_lock_word will not work 

which, I think I can help with.

First of all, the problem is not easy to spot looking just at the m_lock_word in isolation. If there were no m_waiters in the picture, then I agree that acquire/release semantics for m_lock_word would be perfectly fine, as usual in case of critical sections.

The difficulty only arises, once you notice that access to this variable has to be coordinated with accesses to m_waiters in a very specific way: we need to make sure, that the thread releasing the lock (A) will not miss to wake up the thread which went to sleep (B).
To make reasoning simpler, please consider this minimalistic version of the problem:

Thread A doing exit:
  m_lock_word.store(false, std::memory_order_release); //A1
  if(m_waiters.load(std::memory_order_acquire)){       //A2
    signal();                                          //A3
  }
Thread B doing enter:
  m_waiters.store(true);  //B1
  bool expected = false;
  if(!m_lock_word.compare_exchange_strong(expected,true)){ //B2
    sleep(); //B3
  }
}

Above code has a bug, which we can prove 
- theoretically, by referring to C++14 standard, 
- and practically, by looking at result of translation to assembly on x86/x64 and understanding how this particular processor family handles memory accesses.

https://godbolt.org/z/77obTx

exit():                               # @exit()
        mov     byte ptr [rip + m_lock_word], 0 //A1
        mov     al, byte ptr [rip + m_waiters]  //A2
        test    al, 1
        jne     .LBB0_2
        ret
.LBB0_2:
        jmp     signal()              # TAILCALL //A3
enter():                              # @enter()
        mov     al, 1
        xchg    byte ptr [rip + m_waiters], al //B1
        mov     cl, 1
        xor     eax, eax
        lock            cmpxchg byte ptr [rip + m_lock_word], cl //B2
        jne     .LBB1_2
        ret
.LBB1_2:
        jmp     sleep()               # TAILCALL //B3
m_waiters:
        .zero   1

m_lock_word:
        .zero   1

First the theoretical proof:
AFAICS following execution is possible in light of C++14:
  A1 happens-before A2 happens-before A3
  B1 happens-before B2 happens-before B3
  A2 loads false
  B2 loads true
  B1 is before B2 in the order S
  The modification order for m_waiters is: B1
  The modification order for m_lock_word is: B2, A1
as the loads of one thread do not read values stored by the other thread, there is no synchronizes-with relation between them.

If you can find anything in the standard which would refute above configuration of happens-before,loads,S,and modification orders, let me know.
But if you can not refute it, it means it is permissible by the C++14 standard, and thus thread B can go to sleep() while the thread A does not call signal().

Now the technical proof:
In the assembly, we see A1 and A2 are just simple mov operations, where A1 moves 0 to memory location, and thus can get buffered in write buffer, and A2 is a load from memory to a register, and thus is not buffered, finally, we don't see any mfence or xchg or `lock`-prefixed instruction, or anything else that could lead to flushing the write buffer, so we conclude that it is quite probable, that results of A1 will be flushed long after the A2 was executed.
Therefore it is quite possible, that what will happen (from the point of view of some wall-clock) is this:
- Thread A executes A1, but puts the result of the assignment to the write buffer, instead of publishing it immediately
- Thread A executes A2, that is loads m_waiters, and sees it is false
- Thread A seeing that m_waiters was false, returns from the function without calling signal()
- Thread B executes B1, m_waiters=true (publishing it immediately to the bus, because it uses xchg, but essentially the same scenario could simply happen even with simple mov, provided that we simply wait in this step of the timeline until the write buffer of B gets flushed, for example during the next instruction)
- Thread B executes B2, observing that m_lock_word is still true (because A's write from step A2 still sits in A's write buffer), thus decides to go to sleep
- Thread B executes B3, and goes to sleep()
- (optional) Thread A finally flushes the write buffer

Notes:
- Yes, the compiler did not reorder A1 and A2 when generating the assembly
- BUT, the compiled did not put any work into making sure that A1 and A2 will not be reordered by the CPU itself, in particular there is no `mfence` between A1 and A2, the A1 is not `lock`-prefixed, does not use `xchg`, etc. So x86/x64 CPU can, and does, simply defer flushing of A1 as long as it see fit.
- And this is not a bug in the compiler (as we see there is nothing in C++14 standard, that would force the compiler here to put such barrier) nor in CPU (as there is no reason x86/x64 should publish stores before executing loads - it is publicly known piece of this CPU's specification that they can freely "reorder stores after loads", but not the other three combinations. The purpose of write buffer is to let this happen as often as possible for performance reasons)

Now, let's see what really changes if we use seq_cst for both operations in exit(). How does it help to prove correctness?

First the theoretical change:
Assuming that thread B went to sleep in B3, it must be that B2 saw m_lock_word==true, and according to C++14 rules, this B2 should see the result of the last modification in the modification-order of m_lock_word. Let A be the thread which performed said modification (lets call it A0). 
Now consider operations A1 and A2 done by A when calling exit().
Clearly A1 happens-after A0 because it is sequenced-after A0 from A's perspective.
Thus A1 must be after A0 in modification-order of m_lock_word.
But B2 is the immediately next operation after A0 in said order, so A1 must be even further in it, somewhere after B2.
[So far, all of above could be said also for the previous implementation, but in what happens next we will start using the properties of seq_cst operations]
Now, all four operations {A1,A2,B1,B2} need to be arranged in a total order S, which has to agree with happens-before and modification-order of m_waiters and modification-order of m_lock_word.
So, we know that A1 is before A2 in S (because it is sequenced-before), and that B1 is before B2 in S (because of sequenced-before).
We also shown that A1 must be after B2 (because of modification-order of m_lock_word).
Thus we've proved that the order in S must be: B1,B2,A1,A2.
But, this in particular means that A2, which loads m_waiters, must read value from most recent seq_cst store to this variable in S (rule 29.3.3.1), which is either B1 or perhaps some completely other C, from some other thread which is somewhere between B1 and A2 in S. In the case it happens to be B1, the A2 will load true, and correctly call signal(), great.
In case C was some other thread calling enter(), the value read by A2 will also be true, great.
So, the only interesting case would be if C was a thread which stored `false` to m_waiters, but m_waiters.store(false) is only performed by threads calling exit() and is sequenced-before their call to signal(), which means that C `happens-before` a call to signal().
Either this signal() happens-after os_event_reset() done by B (in which case it will not sleep indefinitely, great) or happens-before this os_event_reset() which would introduce a cycle in S + happens-before:  C is after B1 in S, B1 happens-after os_event_reset() and os_event_reset() happens-after signal() which happens-after C.

What made this proof possible? That we now can invoke argument about "A1 must be after B2 in S" (which is only possible if A1 is seq_cst, thus appears in S at all) and derive that "A2 must read either B1 or some C which is after B1 in S" (which requires not only A2, but actually all modifications of m_waiters to be seq_cst).

Now the technical change:

exit():                               # @exit()
        xor     eax, eax
        xchg    byte ptr [rip + m_lock_word], al //A1
        mov     al, byte ptr [rip + m_waiters]  //A2
        test    al, 1
        jne     .LBB0_2
        ret
.LBB0_2:
        jmp     signal()              # TAILCALL //A3

So, now A1 is translated to `xchg` which is an instruction which on x86/x64 locks the memory bus and flushes the write buffer.
This has the implication (on x86/x64), that if another CPU (B) has in its assembly instructions B1 which flush a write to location read by A1, followed by a read B2 of location written to by A2, then (For simplicity, I assume only threads in the system are A and B):
- if A2 did not saw result of B1, then B2 must see result of A1. Why? Because the flush of A1 was before A2 which was before flush of B1 which was before B2.
- conversely, if B2 did not saw results of A1, then A2 must see result of B1. Why? Well, for one thing, (p implies q) is equivalent to (!q implies !p). But you can also see it directly, that the flush of B1 was before B2 which was before flush of A1 which was before A2.
In higher-level terms this gives us what we really need here: if A did not saw B going to sleep, then B must see release A released the lock, and if B did not saw the lock released by A, then A must see B went to sleep.

-------------

Hi Jakub Lopuszanski,

Thanks for the detailed explanation.

What continues to confuse me is how does CPU ensure that no context switched is done between [mov, mfence]?

---------------------

All of our proofs are based on the fact that store buffer will not be flushed immediately on change (when the normal store is done).
I have been digging some bit on this front including how ARM handles Load-Exclusive and Store-Exclusive operations.

Finally, I found something that looks conceivable.

As confirmed by the article too (from Intel) CAX operation should naturally cause other cores relevant cache-line to invalidate. (I presume caching protocol should limit it to cache-line and not enforce all store-buffer
invalidation but that is processor implementation optimization)

This obviously make sense because if CAX read the stale value and same is done by other CAX on other core they both may execute the same operations w/o exclusivity leading to the race.

----
https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Draining-store-buffer-on-othe...

Jim -- are you sure that a locked RMW operation will force the value to be flushed from the remote processor's store buffer? I don't see this as a documented mechanism for forcing store buffer flushing.

If it did not, then you could not possibly have two threads performing XADD without getting out of sequence. IOW the 2nd XADD could not evict the first thread's pending write (unless it could way-back the other thread like in TSX).

If the remote thread has a pending write the local thread could not perform an XADD using stale data (without corruption).
----

---------------------

With that as background, I tried to see how we can optimize the existing flow.

* m_lock_word is already using CAX and that is the only place it is loaded and evaluated.
* Other instances it is just stored. So even if the store is not flushed next in sequence CAX (from other or same core) will invalidate the cache line and get the latest value.

* m_waiters var was loaded independently but was also set immediately post evaluation. These 2 operations can be replaced with single CAX that will ensure m_waiters var is loaded with the latest value in an exclusive way.

https://godbolt.org/z/K8bKqq

> What continues to confuse me is how does CPU ensure that no context switched is done between [mov, mfence]?

AFAIK a context switch may happen in between the two. But a context switch itself requires full memory barrier, so whatever was in the store buffer, will get flushed as part of the context switching logic.

> All of our proofs are based on the fact that store buffer will not be flushed immediately on change (when the normal store is done).

I don't think that I was assuming this when proving correctness of my solution.
I only *permitted* it to happen when proving incorrectness of your solution.
In other words: the CPU can flush the store buffer in places additional to those specified explicitly by `mfence`, and my algorithm should still work fine.

A good mental model for how Intel processor works is described in "x86-TSO: A Rigorous and Usable Programmer’s Model forx86 Multiprocessors"
https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf 
which I highly recommend, if you haven't read it.

In the model this paper presents, you can think that XCHG is implicitly LOCK'd, and 
> To execute a LOCK’d instruction, a thread must first obtain the global lock.
> At the end of the instruction, it flushes its store buffer and relinquishes the lock. 
> While the lock is held by one thread, no other thread can read.

So, if two threads attempt XCHG at the same time on Intel processor, then only one of them will be able to hold the global lock at the same time, and thus they will execute sequentially, and the later of them, will necessarily see the results of the previous one, because it must have flushed the store buffer before releasing the global lock.

I went over the said paper and it too proposes and confirms our analysis as true as done on intel forum (by Jim). I also connected with ARM expert and he too confirmed that exchange will cause flush of store buffer on other cores.

So exchange proposal/solution should help us maintain the needed memory order and achieve the needed effect and also prove that thread will not enter a deadlock.

Do you see any anomalies with the proposed alternative (proposed in godbolt)?
If no, I will roll the patch. I have been testing it locally.

Hi Krunal,

Whatever we end up implementing, I want it to:
1. Be formally correct according to C++14 standard's rules. In other words, it is not enough to demonstrate that assembly generated for Intel and ARM64 by gcc and clang looks good. We need to have a reason to believe that also on other standard-compliant platforms (including those which do not exists yet) the code will work correctly. Therefore, I treat godbolt as a tool that can *refute* some approach, but it can not really *prove* correctness. It can provide counter examples and intuitions, but the proof is something we have to figure out ourselves.
2. Be practical, that is it should generate performant assembly for the platforms we care about the most. Ultimately this should be confirmed by actual performance tests, but in case godbolt shows that new code compiles to the same assembly as the old code, we can reasonably expect performance to be the same. And if godbolt shows that new code removes some expensive operation, we should expect that performance should not deteriorate (or in case it does deteriorate, then it is because of some other flaw in the code worth fixing).

In your's godbolt link, I see:

  void exit_proposal1(){
    m_lock_word.store(false, std::memory_order_release);
    bool expected = true;
    if(m_waiters.compare_exchange_strong(expected,false,std::memory_order_acquire,std::memory_order_acquire)) {
      signal();
    }
  }

I think this can not be proven correct according to the standard, for the same reason as your earlier attempts: if the `m_waiters` access has only `memory_order_acquire`, then informally speaking there is nothing in the standard which would prevent code from above to slide below it.
Similarly, if m_lock_word access has just memory_order_release, then there is nothing preventing following lines below it from sliding above it.
Yes, on a particular platforms like Intel you get some additional guarantees (such as that exchange translates to instruction which additionally has to flush the buffer thus you get "release" semantics for free), but these are not part of C++14 standard and thus can not help in rigorous proof.
So, this "proposal1" does not achieve the goal number 1 in my view. Let me know if you can present rigorous proof of correctness for it - I can't. 

As for my second goal (the one about performance) I am not ARM64 performance expert, but my limited hands-on experience with one mysql performance bug on ARM64 was about high contention caused by multiple threads looping and invalidating each others attempts at changing a single address (it was just a single atomic::fetch_add instruction which compiles down to such retry-loop). IIUC, on ARM64 operation: 
   m_waiters.compare_exchange_strong(true,false,std::memory_order_acquire) 
gets compiled to a retry-loop:

.L16:                          // do{
        ldaxrb  w1, [x0]       //     w1 = m_waiters; acquire barrier; monitor.add(m_waiters)
        cmp     w1, 1          //     if(w1!=1)
        bne     .L17           //         break; 
        stxrb   w2, wzr, [x0]  //     m_waiters=0; w2= monitor.check(m_waiters)
        cbnz    w2, .L16       // }while(w2)    

Aren't you afraid that such looping can be costly? 
I don't have empirical performance data on this, so I trust your performance tests on ARM.

Another more obvious problem with your enter_proposal1 and enter_proposal2 is that you set `m_waiters` AFTER checking m_lock_word.
Clearly, such a code is completely wrong, as even in a naively simple world where each instruction executes fully exactly when it is its turn in the assembly (what a beautiful world would that be!:D), it can happen that the whole `exit_proposal[12]()` executes in between the compare_exchange_strong of `m_lock_word` and a store to `m_waiters`, in which case `exit_proposal[12]()` will not call `signal()`, but enter_proposal[12]() will call `sleep()` and thus will never be woken up.
A simple fix for that is obviously to swap the order of the two operation ins `enter_proposal[12]()`.
Also, the store to m_waiters "should" then have "memory_order_acquire" to prevent it from sliding back down, however there is no such thing as "acquire store" in C++14, so again, we enter the mythical "StoreLoad barrier" issue.
This can be fixed from the other side: we can force the instruction below the store to issue a release barrier, so that the store can not slide below it.
Again, there is no such thing like "release load" in C++14, but there are tricks involving read-modify-write or read-DONT-modify-write which give good enough approximation: we can exploit that they are both reads and writes and thus can have both acquire and release semantics.
In other words we need m_lock_word.compare_exchange_strong to have at least the release semantics, to prevent store on m_waiters from sliding below it.
Which brings me back to the problem mentioned earlier: that using memory_order_acquire is insufficient for the exchanges.
Thus I would propose to change the ordering to memory_order_acl_rel as is done for example in already existing functions rw_lock_debug_mutex_enter() and rw_lock_debug_mutex_exit() which you can find in sync0debug.cc and as it was proposed in "Option B" of 
https://stackoverflow.com/questions/60053973/how-to-achieve-a-storeload-barrier-in-c11 mentioned earlier

Also, the esense of Option B and rw_lock_debug_mutex_enter/exit() is to use read-modify-write operations in both: enter and exit on *the same variable*. Why it's so important? Because then we can exploit the C++14's guarantee that read-modify-write operations have to read most recent modification in variable's modification order, and thus if they have memory_order_acq_rel then they synchronize-with each other in a chain-like fashion, giving rise to a happens-before chain.
As we already (have to!) use read-modify-write for m_lock_word in enter(), it makes sense to use this variable as the cornerstone of this trick.

This would lead to this BUGGY implementation:

void BUGGY_enter_proposal3(){
  m_waiters.store(true, std::memory_order_release); //"release" not "relaxed" so that it stays below os_event_reset() 
  bool expected = false;
  if(!m_lock_word.compare_exchange_strong(expected,true, std::memory_order_acq_rel, std::memory_order_acq_rel)) {
    sleep();
  }
}
void BUGGY_exit_proposal3(){
  m_lock_word.exchange(false, std::memory_order_acq_rel); //we ignore the result, but care about the barriers
  if(m_waiters.load(std::memory_order_relaxed)) {
    m_waiters.store(false,std::memory_order_relaxed); // <- BUG! we need to force it to happen-before signal
    signal();
  }
}
Here, again we have the similar problem: we need to make sure that m_waiters.store(false,..) happens before signal().
(Note that we have already released a mutex at this point, so there might be already another round of whole algorithm going on, so we must avoid a situation where we overwrite m_waiters (set to true during next round) with false without waking up anyone!)
One way to achieve it, is again to use read-modify-write operation:
void enter_proposal3(){
  m_waiters.store(true, std::memory_order_release); //"release" not "relaxed" so that it stays below os_event_reset() 
  bool expected = false;
  if(!m_lock_word.compare_exchange_strong(expected,true, std::memory_order_acq_rel, std::memory_order_acq_rel)) {
    sleep();
  }
}
void exit_proposal3(){
  m_lock_word.exchange(false, std::memory_order_acq_rel); //we ignore the result, but care about the barriers
  if(m_waiters.exchange(false, std::memory_order_acquire)) {
    signal();
  }
}
This one, I believe, is correct. (That is I don't see any obvious bug in it despite looking at it for 10 minutes).
Why? Because if enter_proposal3() went to sleep() it must mean that its m_lock_word.compare_exchange_strong(expected,true,...) is before exit_proposal3's m_lock_word.exchange(false,..) in the m_lock_word's modification order, and thus the later synchronizes-with the earlier (perhaps indirectly, if there were other threads in the middle which also performed compare_exchange_strong on this variable).
But, this means that m_lock_word.exchange happens-before m_lock_word.compare_exchange_strong and in consequence:
m_waiters.store(true, ...) happens before if(m_waiters.exchange(false, ..)).
So, either this `if` sees the value `true` put there by `enter_proposal3`, or some other thread has already changed it to false.
In the first case, exit_proposal3() will call signal() after enter_proposal3 called os_event_reset(). Great!
In the later case, this other thread which changed true to false, must synchronize-with m_waiters.store(true, std::memory_order_release); so that this threads sleep() happens-after os_event_reset(). Great!

Also, I am not sure which one is cheaper:
  bool expected = false;
  if(!m_lock_word.compare_exchange_strong(expected,true, std::memory_order_acq_rel, std::memory_order_acq_rel)) {
or simply:
  if(!m_lock_word.exchange(true, std::memory_order_acq_rel, std::memory_order_acq_rel)) {
I found people claiming that at least on Intel there is no difference between the two.
It is perhaps something you might want to experiment with on ARM.
(Same transformation `exchange` <-> `compare_exchange_strong` could be applied to m_waiters variable, again, I am not sure if it gives any performance gains on ARM).

I must say that "proposal 3" looks fairly complicated to my eye, when compared to my most recent proposal, which was simply seq_cst everywhere and seems to generate relatively reasonable assembly.

One downside of exit_proposal3(), is that on Intel, even in the "quick path" where there are no waiters, it has to perform two xchg operations.
I believe this can be fixed like this:

void exit_proposal3b(){
  m_lock_word.exchange(false, std::memory_order_acq_rel); //we ignore the result, but care about the barriers
  if(m_waiters.load(std::memory_order_acquire)){
    m_waiters.exchange(false,std::memory_order_acquire);
    signal();
  }
}
the "proof" remains relatively similar. But the code "quick path" now has just one xchg.
Interestingly exit_proposal3b compiles down to the same assembly as exit_proposal_simplest on Intel. 
(However enter's assembly seems simpler for proposal 3 than for proposal simple as store to m_waiters is not seq_cst and thus does not need mfence nor xchg on Intel)

Even more interestingly, enter_proposal3 and enter_proposal_simplest compile to same asssembly on ARM64!
This seems to be because on ARM64 there is no big difference between x.store(v,memory_order_release) and x.store(v,memory_order_seq_cst) and because compare_exchange_strong with memory_order_acq_rel does not differ much from seq_cst.
However, there is a difference on ARM64 between exit_proposal3b and exit_proposal_simplest. 
I leave it to you to decide which of the two is more performant on ARM64 in your experiments, but at first glance it looks like exit_proposal_simplest has "simpler" assembly.

Here you can see the comparison of the versions I consider correct: 
https://godbolt.org/z/6Te4MM

So, to summarize: once I "fix" all your proposal's bugs, I end up with exit_proposal3b which seems to:
- be more complicated in terms of C++ code complexity
- compile to same assembly on Intel except for one less flush in slow path of enter
- compile on ARM64 to something which looks more complicated (involves retry-loops, ldaxrb instead of ldarb etc.) 

So, I still believe that the current "simplest" proposal is the best.

Also, please note, that for 8.0.21 we will focus on gcc 9.3 and higher

Hi Jakub,

Yes. Code has to be semantically correct that will ensure it works on all platforms. C++ as standard specify abstract semantics and respective architecture implement them pratically. So we take help of C++ from semantics perspective.

If the code is semantically correct then it is bound to pass C++ model but surely we don't prove the model for each and every code we write by taking it to standard and then converting it. I don't see MySQL or other software would ever evolve if we continue to map things to abstract model ignoring how respective arch implement it.

-----------------------------------------------------------------------------------------------------------------------------

That apart if we come back to the original code then let me help clarify some doubts.

1. ref: exit_proposal1()

  void exit_proposal1(){
    m_lock_word.store(false, std::memory_order_release);
    bool expected = true;
    if(m_waiters.compare_exchange_strong(expected,false,std::memory_order_acquire,std::memory_order_acquire)) {
      signal();
    }
  }

I have already pointed the external-blog that clearly proves the point that store-release will not be reordered with load-acquire.
(Infact the blog also points to C++ standard why this is not possible).
None of the known godbolt assembly proved it to be wrong.

[If this is the point of contention then I don't have any more pressing reasons beyond one that already presented].

2. ref: exchange flushing store buffer. I thought this was used as a point to prove why the intial solution may not work.

   Anyway, I think it is logical (and doesn't even need C++ standard) that an exchange reading stale value can cause race condition in code (intel forum expert too has confirmed it).

3. ref: performance on arm64 due to loop.
   arm has support for xchg like instruction named casalb but it needs +lse support (default with arm-v8.1).
   I am still trying to figure out how to compile it for target architecture (different scope).
   
   Even with the said looping when I tested it locally the said proposed patch (using exchange) has proved
   to show performance improvement.

4. enter_proposal1 I was considering the 1st CAX and seems like you assumed to be 2nd CAX.
   Let me re-write it to make it complete (like we have EventMutex).

  void enter_proposal1(){
    bool expected = false;
    if(!m_lock_word.compare_exchange_strong(expected, true, std::memory_order_acquire, std::memory_order_acquire)) {
      pause_and_retry_logic(N secs);
      reserve_slot();
      m_waiters.store(true, std::memory_order_release);
      if(m_lock_word.compare_exchange_strong(expected, true,std::memory_order_acquire, std::memory_order_acquire)) {
         release_slot();
         return;
      }
      sleep();
    }
    return;
  }
 
  This should also help address the instructions interleaving comment.
  t1 calling enter() -> critical-section->                     exit (no signal)
  t2 calling                               enter() -> pause ->                  -> waiter=true -> try_lock (will get it)

5. With the added completeness reserve_slot() can't get below store-release and load/xchg-acquire can't go before store-release.
   So inherent order is maintained.

6. "m_lock_word.compare_exchange_strong to have at least the release semantics, to prevent store on m_waiters from sliding below it"
   I don't think store-release can slides below load-acquire given both are atomic statements.
   Slide/Re-order is applicable for normal statement (or relaxed). Again, will refer to Fedor Pikus presentation.
   Would be happy to findout more if store-release can by-pass load-acquire (even on 2 different atomic variable).

7. I am not sure why the example of relaxed memory order was infact needed. It just adds to the confusion.

8. I surely agree proposal3 is complicated than the simple seq_cst and both seems to generate closely same code. So no gain I can think off.

-----------------------------------------------------------------------------------------------------------------------------

Let's take this to closure. We can continue to discuss which proposal comply with the C++ standard and make use of assembly to help prove correctness but that doesn't help the main issue of performance + existing correctness.

As attested before seq_cst solution of-course looks correct but I feel it is not optimal (we can have different views) but it is also true that my proposed solution (release/acquire) doesn't make an additional dent-making difference in the performance.

**** So may be you can check if seq_cst can be adapted.****

--
(Just small suggestion as we explore now m_waiter.load() and m_waiter.store() could be combined with seq_cst order. can help reduce explict fence).

void exit_proposal2(){
  m_lock_word.store(false);
  if(m_waiters.exchange(false)){
    signal();
  }
}

-----------------------------------------------------------------------------------------------------------------------------

thanks for tip for gcc-9.3 we really need to move ahead from gcc-5.3.

> I have already pointed the external-blog that clearly proves the point that store-release will not be reordered with load-acquire.

This blog post says something completely opposite, and I've already pointed it to you in one of my previous comments.

It seems to me, that you focus only on the order of the statements in the generated assembly, but do not take into account that CPU may execute this assembly out of order, unless the assembly contains `mfence` instructions. Otherwise, why would assembly even have such a keyword like `mfence`?

In particular I agree that:
    x.store(false,std::memory_order_release);
    z=y.load(std::memory_order_acquire);
on Intel will probably compile to assembly which has
    mov x <- false
    mov z <- y
in the same order as it was in C++ source, but later on, the CPU may decide to execute the two in what appears to be the opposite order.
(to be more precise: it will execute stuff in parallel, and let the x be flushed from store buffer later).

Let me give you an empirical proof of this reordering of store-release and load-acquire happening in reality:

[jlopusza@supra06]/export/home/tmp/jlopusza/mysql: cat blah.cpp
#include<atomic>
#include<thread>
#include<iostream>
std::atomic<bool> x{false},y{false},go{false};
bool y_seen_by_a,x_seen_by_b;
void a(){
    while(!go){};
    x.store(true,std::memory_order_release);
    y_seen_by_a= y.load(std::memory_order_acquire);
}
void b(){
    while(!go){};
    y.store(true);
    x_seen_by_b= x.load();
}
int main(){
  for(int i=0;i<10000;++i){
      go=x=y=false;
      std::thread ta(a);
      std::thread tb(b);
      go=true;
      ta.join();
      tb.join();
      if(!y_seen_by_a && !x_seen_by_b){
          std::cerr << "error in iteration " << i << std::endl;
      }
  }
  return 0;
}

In this simple demo, `go` is used just to start make sure both threads execute the code in parallel as soon as both of them notice go==true.
The goal of this demo is to demonstrate, that it *is* possible, that both threads will report seeing (respectively) y==false and x==false, despite them setting (respectively) x=true and y=true before the check. This clearly means, that at least one of the writes had to be reordered after a read, because otherwise, at least one of the writes would have to be seen by the other thread.

I'm compiling on an Intel processor, using this compiler:

[jlopusza@supra06]/export/home/tmp/jlopusza/mysql: /opt/rh/devtoolset-9/root/bin/g++ --version
g++ (GCC) 9.1.1 20190605 (Red Hat 9.1.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

And I get following results:

[jlopusza@supra06]/export/home/tmp/jlopusza/mysql: /opt/rh/devtoolset-9/root/bin/g++ -O2 -lpthread -o blah blah.cpp && ./blah
error in iteration 1158
error in iteration 1984
error in iteration 2144
error in iteration 3246
error in iteration 5054
error in iteration 5626
error in iteration 8764

And if I pass `--save-temps` flag to compiler, so I can see the generated assembly, I see:
.L5:
        movzbl  go(%rip), %eax
        testb   %al, %al
        je      .L5
        movb    $1, x(%rip)
        movzbl  y(%rip), %eax
        testb   %al, %al
        setne   y_seen_by_a(%rip)
        ret
        .cfi_endproc

which confirms, what I try to tell you: yes, in the assembly the move from x is before the move to y, but the CPU can execute these two `mov`es in such a way that the value of y appears to be from future (or equivalently: the write to x seems to be deferred into the future).

Also, if I modify a()'s code so that it uses seq_cst for both x.store and y.load, then there is no longer any error reported, and the assembly contains an `mfence` between the two `mov`es.

1. I am not sure why you feel the blog is talking opposite

   Here is what I understood:
   The blog helps support the point that store-release (on x) can't be re-ordered with following load-acquire (on y).

   [Author starts with negation and prove how this is not possible. If that is creating confusion].

<snippet from blog>
However, I’m not so sure the compiler is allowed to reorder those statements. Not because of acquire and release semantics, but because of a different rule from the C++ standard. In working draft N4659, section 4.7.2:18 states:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

.....
And if the compiler can’t rule out that the while loop is infinite, then it shouldn’t reorder the first highlighted line to occur after the loop. If it moves that line after an infinite loop, then it is violating §4.7.2:18 of the C++ standard.

.....
Therefore, I believe the compiler shouldn’t reorder those statements, and deadlock is not possible. [Note: This is not an iron-clad guarantee; see the update at the end of this post.]

.....
Anthony Williams (author of C++ Concurrency in Action) states in the comments that he doesn’t think the above example can deadlock either.
</snippet from blog>

2. Compiler re-ordering vs CPU re-ordering. This point is also raised in the same blog and replied by the author.
   Also, to rule it out using RMW can avoid reading stale value.

3. Coming back to your experimental program effect you are observing it due to store-buffering (as you too have attested above).
   There are 2 ways to solve it. Move to seq_cst model or using RMW.

* I tried the said program replacing everything to use acquire/release. Got the needed error.
* Then to fix it I tried 2 approaches:

approach-1:
-----------

void a1(){
    while(!go){};
    x.store(true,std::memory_order_seq_cst);
    y_seen_by_a= y.load(std::memory_order_seq_cst);
}
void b1(){
    while(!go){};
    y.store(true,std::memory_order_seq_cst);
    x_seen_by_b= x.load(std::memory_order_seq_cst);
}

approach-2:
-----------

void a2(){
    while(!go){};
    x.exchange(true,std::memory_order_release);
    y_seen_by_a= y.load(std::memory_order_acquire);
}
void b2(){
    while(!go){};
    y.exchange(true,std::memory_order_release);
    x_seen_by_b= x.load(std::memory_order_acquire);
}

Both version never emitted the error.

https://godbolt.org/z/nd5zEa

Inline with *one of* the proposed CAX solution

-----------

But as I said yday on overall front there is no major performance difference so adpating either solution is fine. 

So I presume you will be checking for adaption of patch accordingly.

Let me, once again, bring your attention to the important role, played by the `while` loop, in the blog sentence you have pasted from the blog:
> And if the compiler can’t rule out that the while loop is infinite, then it shouldn’t reorder the first highlighted line to occur after the loop. If it moves that line after an infinite loop, then it is violating §4.7.2:18 of the C++ standard.

Yes, compiler can not move a store below an infinite loop.

But in EventMutex's exit() and enter() we are not having any loop below the store. We have an acquire load.

And, as you've seen, by running the demo I gave you, a release store CAN be reordered with a subsequent load acquire.
The reorder happens because C++14 rules permit it, and thus compiler is allowed to emit an assembly which does not enforce any ordering between the two, and x86 CPU takes advantage of it and reorders them as it sees fit during execution.
And yes, the very same C++ source may behave (and does behave) correctly on ARM, for the simple reason that on ARM64 release stores, and acquire loads are implemented using the same assembly instructions as seq_cst stores and seq_cst loads respectively. This is just a "happy coincidence" for this particular platform and we can not use a C++ code which only works on ARM64 but not on Intel in MySQL.
You can see this at https://godbolt.org/z/9PzWf6 - there's no difference in generated assembly on ARM64, but on Intel to get seq_cst compiler has to additionally emit `mfence`.
I hope this explains :
1. Why there is no performance difference on ARM64 between store(release)+load(acquire) vs store(seq_cst)+load(seq_cst) ? Because it is the same assembly on ARM64.
2. Why I can not accept a solution which uses store(release)+load(acquire)? Because it is theoretically flawed, and in practice causes a deadlock on Intel.

As for your "approach-2" which uses exchanges on two different variables, can you prove correctness of it?

updated patch

Attachment: patch.bug100132 (application/octet-stream, text), 2.17 KiB.

PFA the updated patch. It uses CAX along with improved mem-barrier (acq_rel).

1. Use of CAX ensures full fence behavior and avoid re-ordering
   ref: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-268.h...

2. Use of acq_rel order for CAX ensures ensure the store-release synchronizes with load-acquire.

   
https://godbolt.org/z/WTs6rz

Hello Krunal,
1. Do you have a proof of correctness for your latest proposal?
2. Do you have performance test results comparing your latest proposal (https://bugs.mysql.com/file.php?id=29709&bug_id=100132) to the one I've proposed on 13 Jul 14:38 (https://bugs.mysql.com/file.php?id=29681&bug_id=100132)?
3. In the godbolt link you've provided in point 2 (https://godbolt.org/z/WTs6rz) there is the same bug again: in both enter_existing and enter_proposed you set m_waiter.store AFTER checking m_lock_word. As I've already explained this can lead to a deadlock if the current holder of the lock will execute the whole exit() procedure in between the two lines.

4. Is there any specific reason you pass std::memory_order_acq_rel to compare_exchange_strong calls instead of the default std::memory_order_seq_cst? It seems to me that it produces exactly the same assembly on both Intel and ARM https://godbolt.org/z/GfdooE yet makes it much harder to prove anything.
5. Is there any specific reason for using compare_exchange_strong instead of using exchange?

1. Do you have a proof of correctness for your latest proposal?

* I presume the patch is self-explanatory (after all the discussion and feedback) but let me still present it.

-----------------------------------------------------------

thread-A                              thread-B
-----------------------------------------------

obtain lock (cas)
enter critical section
                                 try to obtain lock but fails
                                 set waiters (from false -> true)
                                 [possibility: in store buffer not flushed]
                                 re-try obtaining the lock
				 sleep

lock-store (false)
[again in store-buffer]

check if the waiter is true
[being cas it would invalidate
 other core cache line and get
 exclusive access].

waiter is loaded as true
invokes signal.

proof for cas needing exclusive access was already presented from intel forum and pretty natural for anyone to derive.

-----------------------------------------------------------
(use-case based on the scenario you pointed back on [8 Jul 13:11] Jakub Lopuszanski).

thread-A                              thread-B
-----------------------------------------------

obtain lock (cas)
enter critical section
                                 try to obtain lock but fails
lock-store (false)
[again in store-buffer]

see the waiter = false
(don't call signal)
                                 set waiters (from false -> true)
                                 [possibility: in store buffer not flushed]
                                 re-try obtaining the lock
                                 [again being cas will cause other core cacheline
                                  to invalidate and get exclusive access].
                                 get lock proceed.

-----------------------------------------------------------

more such combinations could be rolled out based on same line.

idea is simple that cas causes exclusive access so last modified value of the respective variables are read.

-----------------------------------------------------------

* Flow uses CAS or store operation on the respective variables.
  store-release (on lockword or waiter) synchronizes with load-acquire from cas.
  cas uses store-release order (during store) to synchronize with load-acquire from another cas.

(check below why acq_rel is used)
===========================================================================================

2. Do you have performance test results comparing your latest proposal (https://bugs.mysql.com/file.php?id=29709&bug_id=100132) to the one I've proposed on 13 Jul 14:38 (https://bugs.mysql.com/file.php?id=29681&bug_id=100132)?

Yes, I have been executing perf-test on most of the proposal (including seq-cst) and that's how I quoted
this "my proposed solution (release/acquire) doesn't make an additional dent-making difference in the performance."
when I was proposing the other solution.

===========================================================================================

3. In the godbolt link you've provided in point 2 (https://godbolt.org/z/WTs6rz) there is the same bug again: in both enter_existing and enter_proposed you set m_waiter.store AFTER checking m_lock_word. As I've already explained this can lead to a deadlock if the current holder of the lock will execute the whole exit() procedure in between the two lines.

May be you missed my reply [[16 Jul 13:56] Krunal Bauskar check point-4].
Let's one more time sync on the existing code flow

try_lock()
  - if fail try spin_and_try_lock()[N attempt]
  - if no success reserve slot
  - set waiters.true()
  - give another try for lock using try_lock (4 times)
  - sleep_till_signaled()
so 1st cas corresponds to 1st attempted try_lock before spinning.

  void enter(uint32_t max_spins, uint32_t max_delay, const char *filename,
             uint32_t line) UNIV_NOTHROW {
    if (!try_lock()) {
      spin_and_try_lock(max_spins, max_delay, filename, line);
    }
  }
 
Anyway, godbolt was shown just to project assembly instruction but now that I have attached patch you can refer to it if godbolt is confusing you.
If you still see issues with the patch (that can expose deadlock) please help point out.
===========================================================================================

4. Is there any specific reason you pass std::memory_order_acq_rel to compare_exchange_strong calls instead of the default std::memory_order_seq_cst? It seems to me that it produces exactly the same assembly on both Intel and ARM https://godbolt.org/z/GfdooE yet makes it much harder to prove anything.

CAS is RMW operations and involves load-acquire and store-release so it is important that CAS uses ordering
on both front as if m_lock_word = true then CAS = load-acquire only operation.
Also, store-release from CAS will synchronize with followup modification order load-acquire CAS.

Also, acq-rel as per C++ standard will not allow any re-ordering before and after it.
"No memory reads or writes in the current thread can be reordered before or after this store."

This also helps ease our question of store-load re-ordering.
(BTW, the said issue is already eased out on x86 with use of CAS but we want the code to stand the test of time with other architecture).

acq-rel is meant for RMW and tend to serve purpose and clarity better than seq-cst.
[Unfortunately, we can't code based on asm for x86 and arm otherwise lot of previous solutions could be hold valid].

===========================================================================================

5. Is there any specific reason for using compare_exchange_strong instead of using exchange?

exchange is force store always vs cas can turn into load-acquire and avoid store.

proposal based on m_waiters.exchange(..) used consistently

Attachment: rb.2 (application/octet-stream, text), 6.18 KiB.

Hello Krunal,

Taking all of the above into account I can see two correct solutions which we might consider for merging into trunk (after careful code review, which hasn't finished yet).

Solution 1 is the one I've already proposed on 13 Jul 14:38: https://bugs.mysql.com/file.php?id=29681&bug_id=100132 (Please note though, that during ongoing CR it was already identified that the proof itself requires correction, so don't pay too much attention to the comments in the code. The code itself should be correct, though)

Solution 2 is something completely new and relies on accessing m_waiters only using seq_cst atomic<bool>::exchange operations. I've uploaded it today https://bugs.mysql.com/file.php?id=29718&bug_id=100132 (Please note that this was not yet code reviewed, but given that this mimics existing solution from sync0debug I highly expect it to be correct).

What would be very helpful at this stage, is if you could provide us with performance test results comparing the following three implementations:
- baseline = latest mysql 8.0.x you have access to
- baseline with Solution 1 patch applied https://bugs.mysql.com/file.php?id=29681&bug_id=100132
- baseline with Solution 2 patch applied https://bugs.mysql.com/file.php?id=29718&bug_id=100132

This could help us make an informed decision which of the two solutions to choose.
Could you please let me know if you can provide such results (otherwise we will pick the version ourselves)?

benchmark

Attachment: zipfian-x86.png (image/png, text), 129.72 KiB.

benchmark

Attachment: uniform-x86.png (image/png, text), 128.43 KiB.

benchmark

Attachment: zipfian-arm.png (image/png, text), 129.38 KiB.

benchmark

Attachment: uniform-arm.png (image/png, text), 130.82 KiB.

1. If provided a choice (between solution-1 and solution-2 (exchange)) I would opt for seq-cst.

   I am really not sure why the other flavor was added because I also foresee it to be sub-optimal given it is doing exchange but ignoring value and using another exchange which could be simple load.
[also sync0debug uses acq_rel fence vs seq-cst fence for exchange and case is simpler given single atomic].

2. I would also like to know what shortcoming do we see re-submitted approach of acq-rel.

3. I have evaluated the performance of seq-cst (Solution-1) and acq-rel (one that re-submitted. just for reference). [If need arise will re-generate it for Solution-2 but will may need time].

Attaching results you can take a look.

Hello Krunal,
Thank you for sharing available data.

AD1.
> If provided a choice (between solution-1 and solution-2 (exchange)) I would opt for seq-cst.

OK, good to know.

> I am really not sure why the other flavor was added because I also foresee it to be sub-optimal given it is doing exchange but ignoring value and using another exchange which could be simple load.

It was added, because it is the "closest approximation" of you "re-submitted approach of acq-rel" which I could come up with, that is provably correct. Also, it has some potential to be faster than "Solution-1" on Intel, because it needs just one, not two, synchronizing operations during exit with waking up waiters. Also, it has a nice property that the proof of correctness doesn't rely on implementation details of os_event_rest/set/wait(low) to the same extent as the proof of correctness for "Solution-1". Also it is very similar to existing, tested, solution in sync0debug.
So, "Solution-2" has some nice properties we care about (provable correctness, modularity, "similar to existing").

> [also sync0debug uses acq_rel fence vs seq-cst fence for exchange and case is simpler given single atomic].

To be precise, sync0debug does not use `acq_rel fence` - it uses `acq_rel exchange`. The difference is very important, for example `std::atomic_thread_fence(std::memory_order_acq_rel);` doesn't really do anything on Intel ( it compiles to no-op on Intel https://godbolt.org/z/9oeYYs ) and its semantic given by the C++14 standard are quite difficult to grasp.
What the sync0debug uses is `rw_lock_debug_waiters.exchange(.., std::memory_order_acq_rel)` for which C++14 standard gives different guarantees, ones that we care about.
For example C++14 mandates, that all `exchange` operations on a variable are always considered read-modify-write operations, and thus take part in modification-order for that variable, chain together in Release Sequences, must see most recent write in the said modification-order, etc.

Indeed, I agree that "Solution-2" could be changed to use exchange(.., std::memory_order_acq_rel) instead of default std::memory_order_seq_cst and still be provably correct, for essentially the same reasons sync0debug is (hopefully) correct.
However, since the time I've implemented the sync0debug solution, we had a lot of internal debates in our team about when we should use orderings weaker than seq_cst and most of us agree that it should only be used sparingly in situations where it gives performance gains after carefully proving correctness.
Given that there is absolutely no difference in generated assembly between exchange(..,std::memory_order_acq_rel) and exchange(..,std::memory_order_seq_cst) neither on Intel nor on ARM64 (see https://godbolt.org/z/os7Me8 ) there can't be any gain in performance neither.
Yet, the proof/reasoning might get more complicated if we use weaker orderings, in particular if more variables are involved.
Therefore, it is clear for me, that in this case it is better to use the default seq_cst than to use acq_rel for these exchanges: there is no performance penalties on platforms we care about, and the code is a bit easier to analyse.

AD2. > I would also like to know what shortcoming do we see re-submitted approach of acq-rel.

It has no proof of correctness.
And frankly, it can not have:
As you already said, a failed compare_exchange_strong operation is not considered a read-modify-write. It is considered a load (See § 29.6.5.21 of the C++14 standard).
So, in your proposed:
+ m_lock_word.store(false, std::memory_order_release);
+ bool expected = true;
+ if (m_waiters.compare_exchange_strong(expected, false,
+ std::memory_order_acq_rel)) {
the m_waiters.compare_exchange_strong in some executions can be treated as a memory_order_acquire load of `false`.
Therefore, according to the C++14 standard it's not possible to build any "synchornizes-with" relation between this thread and the thread which performs enter() in the scenario we care about (a thread enter()ing the mutex sees it is taken and goes to sleep).
You see, to establish a "synchronizes-with" relation, a thread must read the value which another thread stored.
But the value of `false` is not stored to `m_waiters` during enter(). (`false` is stored to `m_waiters` only during exit()).
And the value of `true` is not stored to `m_lock_word` during exit(). (`true` is stored to `m_lock_word` only during enter()).
So neither can the thread doing a failed m_waiters.compare_exchange_strong synchronize-with enter() (or the other way around), nor can the thread doing enter() synchronize-with m_lock_word.store(false, std::memory_order_release) done in exit() (or the other way around).
As there is no `synchornizes-with` edge between the two threads, it is impossible to deduce any "happens-before" relation between the two threads.
Therefore I see no hope in proving that a thread going to sleep must be woken up: there's no `happens-before` relation to base the proof on.

The way this difficulty is bypassed in Solution-1 is by invoking rule § 29.3.3, however this relies on that all operations are seq_cst and thus can be included in the "order S" which this rule talks about. Your solution uses weaker orderings to modify the variables which makes it possible that the difficulties mentioned in 29.3.3 (3.2) and (3.3) can occur and spoil the proof.

The way this difficulty is bypassed in Solution-2 is by invoking rule § 29.3.11, however this relies on that all operations on m_waiters are read-modify-write operations, so that all of them participate in modification-order of m_waiters. Your solution uses compare_exchange_strong which sometimes is just a load and also sometimes uses store release for same variables, which makes it impossible to deduce required `synchronizes-with` relations using this approach.

I don't want a code which can not be proved correct in the light of C++14 standard.

AD3. > I have evaluated the performance of seq-cst (Solution-1) and acq-rel (one that re-submitted. just for reference).

Thank you very much!

> [If need arise will re-generate it for Solution-2 but will may need time].

Yes, we still need to see results for Solution-2, please provide them if possible. Thank you!

Thanks for taking the time in explaining the re-submitted approach.

Unfortunately, I beg to differ from your proposal. Let me explain.

------------------------------------------------------------------------------------------

1. C++ abstract definition of synchronize with relationship just help establish that
   despite re-ordering what assurance developers can rely on.

<anything here ....>
A: store-release(M)
                    [-- synchrnoize with --]

                                     B: load-acquire(M)
                                     <.....is visible here>

<PERIOD>.
This is the only thing that could be derived from the standard.
I presume you are of the same understanding let me know if otherwise. We can sync this first.
29.3[An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A]

------------------------------------------------------------------------------------------

2. CAS is load-acquire (failure case) or load-acquire  + store-release (success case).
   store is always store-release.

------------------------------------------------------------------------------------------

3. Let's try to map it to our example with *relevant blocks only*

============================================================================
Assuming sequence is starting fresh for 1st time

            t1                                       t2
      ---------------                           ----------------

pre-section: some random shared variables.

m_lock_word.cas(true)
<true: load-acquire + store-release>                       

....critical-section .....                    

                                            m_lock_word.cas(true)
                                            <false: so load-acquire
                                             can sync with variables
                                             from pre-section but
                                             flow anyway doesn't need it>

                                            reserve_slot_for_waiting()

                                            m_waiters.store(true)
                                            <store-release>
                                           
                                            retry-last-time and sleep()

.....critical section .....
m_lock_word.store(false)
<store-release>

m_waiters.cas(false)
<true: load-acquire ensuring it
 synchronize with m_waiter.store-release
 (from t2) there-by ensuring it will
 see reserved-slot.

 store-release>
signal()

                                            m_lock_word.cas(true)
                                            <true: load-acquire
                                             can safely access shared variables
                                             from the critical-section (and before)
                                             
					     store-release>

                                            .... critical section ....

m_lock_word.cas(true)
<false: load-acquire so
can only assume changes from previous
store-release (that is its own store-release)
are visible but failure flow anyways doesn't
make use of these critical section changes>

reserve_slot_for_waiting()

                                             .....critical section .....
                                             m_lock_word.store(false)
                                             <store-release>

                                             m_waiters.cas(false)
                                             <false: load-acquire so can't assume
                                              recent reserve_slot changes from thread-1
                                              but anyway it doesn't need it as it would
                                              skip signal()>

m_waiters.store(true)
<store-release>

retry-last-time and sleep()

m_lock_word.cas(true)
<true: can assume changes from t2 critical
 section modification>

.......... and the sequence continue with more such combinations
                                          
============================================================================

Above section proves how that - synchronize with - the relationship is established
among the thread (abstract model) and ensures that each thread makes use of the
relevant data:
- failed (load-acquire) m_waiters.cas will never make an assumption or use of data that is not safe to touch
- failed m_lock_word (load-acquire) will never make an assumption or use of the data that is not safe to touch.

------------------------------------------------------------------------------------------

Suggestion:
* Do you mind if we move this discussion on mysqlcommunity-slack.
* Eventually we can copy-paste our complete discussion thread for reference to the bug report.
* this would just make it faster and interactive model can help us sync things before we flow deep into the respective idea.

Hi Krunal,

Please notice that in the 29.3.2 rule that you cite there is this crucial fragment "and takes its value from any side effect in the release sequence headed by A".

You can not ignore this crucial condition for synchronization.

If this CAS operation fails:

bool expected = true;
if (m_waiters.compare_exchange_strong(expected, false,
std::memory_order_acq_rel)) {

it means that it loaded m_waiters equal to false.
And if it saw m_waiters equal to false, then certainly it HAS NOT TAKEN this value from m_waiters.store(true) thus it can not be said to synchronize with it.

Perhaps you try to ASSUME that this CAS must see `m_waiters==true`, but this is assuming the thesis you try to prove.

Perhaps you only ASSUME that this CAS must `happen-after` m_waiters.store(true) and this is WHY it MUST see `m_waiters==true`. But this is again assuming something which was never given, as it is not obvious why such `happens-after` relation should hold here.

Perhaps you believe that this m_waiters.cas(false) must `happen-after` m_waiters.store(true), because you try to chain together three edges:
m_waiters.cas(false) `happens-after` m_lock_word.store(false) (yes indeed, because it is sequenced-after),
retry-last-time` `happens-after` m_waiters.store(true) (yes indeed, because it is sequenced-after),
and somehow sneak in the additional proposition that `retry-last-time` has to `happen-before` m_lock_word.store(false).
But this last step is again an unsupported assumption: we can not deduce a load `synchronizes-with` a store (nor `happens-before`) in this direction, because it is stores that synchronize-with loads, and moreover only if the value stored is the value read, and here the value IS NOT seen. There is no rule in the standard which would say "if load DOES NOT take a value from a store THEN it this store synchronizes with this load", unfortunately.

Perhaps you have some other proof, in which case I encourage you to write it up and present it clearly, so we can review it.

Frankly, I doubt that such a proof can exist, given that if you replace the .compare_exchange_strong with load(std::memory_order_acquire) (as was in your original proposal) you get a code which deadlocks when running ./mtr --suite=innodb on our intel machine. So, the proof would have to somehow depend on a difference between CAS and acquire-load, yet you seem to agree that CAS can degenerate to acquire-load. But, who knows, maybe you see some way to prove that this CAS always succeeds, which would be puzzling, as it would suggest `if` is unnecessarly checking a condition which always evaluates to true. So, perhaps there is some more narrow theorem which could be proven, like maybe that this CAS at least succeeds when the other thread went to sleep,... oh...wait, this is what was to be proven all along :)

As for your proposal to discuss this on chat, I find it beneficial to discuss proofs of correctness in a long-form instead of quick back-and forth. This gives me opportunity to thoroughly think the proof through, fix mistakes, and run tests before publishing.

Speaking of which, could you please share performance test results for Solution-2?

1. I presume we have a different understanding of the same concept.

I presume you are assuming that C++ "synchronizes with" defines how the thread
are scheduled and as per my understanding C++ is just trying to define
how thread sees the data with re-ordering effective.

For example m_waiters.cas can get 2 values only
- true: that suggests it has got the value from other thread (that set it to true)
and data that other thread modified is now visible (especially reserve-slot in our case)

- false: that suggests it has got the value from the thread that set it to false (during exit)
and so data visibility is limited accordingly.

[Said condition is clearly marked in the running flow proof I gave in my previous comment]

2. Just in case you have missed deciphering my running flow proof it clearly takes all possible
the outcome of CAS and so bit surprised when you said

"Perhaps you try to ASSUME that this CAS must see `m_waiters==true`, ....."

3. "Perhaps you believe that this m_waiters.cas(false) must `happen-after` m_waiters.store(true), because you try to chain together three edges:"

Again, in my running flow proof, it clearly depicts which block the code is assuming post-load-acquire for both
success and failure case.

4. CAS vs load() .... we all know this both has different semantics even if CAS act as simple load (in failure case).
Even on intel cas generate lock cmpxchg and load would generate mov so there is the difference in cache-coherence
and exclusive access. Already given needed proof. Would appreciate if you can check them and dig bit more on it.
Maybe talking to an h/w expert (like I did locally) would help.

-------

I would re-request you to re-look at the said approach (ignoring the fact whether we consider it or not)
as if you need to advocate it and then try to discover gaps.
Most of the assumptions or queries will be self-answered.

-----------------

ref: interactive communication. I think this gap of one-way communication could be avoided (and a lot of Perhaps and Assumption
could be sorted out easily) and nothing like we would hold others responsible for quoting something on slack and coming back
and correcting it. In fact, I presume you may be using some internal im (at-least it was most used during my days @ InnoDB)for internal team discussion.

But anyway this was a suggestion and I respect your personal choice.

-----------------

ref: Solution-2 benchmarking as I quoted it is on todo just taking time due to my local resource schedule.

Hi Krunal,

> I presume you are assuming that C++ "synchronizes with" defines how the thread
> are scheduled and as per my understanding C++ is just trying to define
> how thread sees the data with re-ordering effective.

I interpret `synchronizes-with` exactly as it is defined in the C++14 standard.
That is, I only see this relation between two operations where C++14 standard says it should occur between the two, and I only derive from it consequences which the C++14 standard provides.
Therefore I do not interpret it it terms of "how threads are scheduled" (AFAICS, C++14 standard doesn't talk about scheduling in chapters related to `synchronizes-with`).
But I also can not fully agree with "how thread sees the data with re-ordering effective", because "re-orderding" is not really a concept C++14 standard talks about. The word "re-ordering" presumes some reference order to which you can compare to talk about reordering. So, I agree with the "how thread sees the data" part - yes, C++14 standard tries to say what data which thread will see. And for that purpose C++14 defines several concepts:
- sequenced-before relation 
- one modification-order for each atomic variable
- synchronizes-with relation
- happens-before relation
- "takes value from" relation
and rules which tie these all concepts together.
I've searched for the word "reorder" and it appears only in "Note:"s, not in the definitions and rules themselves.

Therefore, when I say, that I need a proof of correctness according to C++14 standard, I mean something which avoids talking too much about "reordering", and rather focuses on proving how sequenced-before, modification-orders, synchronizes-with, happens-before and "taking value from" are restricted by the structure of the program and rules of the standard to prohibit executions which lead to problems.

In particular, what I am mostly interested in, is proving that execution in which one thread went to sleep and the other did not wake it up is prohibited by the rules.

In your own attempt at the proof you've put this fragment:

<QUOTE>

m_lock_word.cas(true)
<false: load-acquire so
can only assume changes from previous
store-release (that is its own store-release)
are visible but failure flow anyways doesn't
make use of these critical section changes>

reserve_slot_for_waiting()

                                             .....critical section .....
                                             m_lock_word.store(false)
                                             <store-release>

                                             m_waiters.cas(false)
                                             <false: load-acquire so can't assume
                                              recent reserve_slot changes from thread-1
                                              but anyway it doesn't need it as it would
                                              skip signal()>

m_waiters.store(true)
<store-release>

retry-last-time and sleep()

</QUOTE>

It contains "it would skip signal()" and "retry-last-time and sleep()".
Note, that if the thread on the right side indeed skips the call to signal(), and the thread on the left indeed goes to sleep() then we have a deadlock: a thread on the left waits for someone to wake it up, but the only thread that could do that missed the only chance to call signal().

It is perhaps tempting to argue that this scenario is impossible because "retry-last-time" will succeed and thus the thread on the left will not go to sleep.
However, this is assuming the thesis we need to prove using C++14 axioms and code structure.
For retry-last-time to succeed, it would have to see the false stored to m_lock_word by the thread on the right.
But what if retry-last-time doesn't see it?
To prove that it has to see it, we would have to prove that the m_lock_word.store(false) on the right `happens-before` the retry-last-time on the left. Or perhaps that it is earlier in the modification-order. Or maybe that it is ealier in S. However, by avoiding to use std::memory_order_seq_cst and insisting on using compare_exchange_strong instead of exchange, we made using this proof strategies very difficult, at least for me.
I am still open to seeing a solid proof refuting this scenario.
But how can we prove it, without having any "something on the right `synchronizes-with` something on the left" relation to build on?

> we all know this both has different semantics even if CAS act as simple load (in failure case).
> Even on intel cas generate lock cmpxchg and load would generate mov so there is the difference in cache-coherence
> and exclusive access.

<..had to split the comment..>

<..continued..>

> we all know this both has different semantics even if CAS act as simple load (in failure case).
> Even on intel cas generate lock cmpxchg and load would generate mov so there is the difference in cache-coherence
> and exclusive access.

In which sense you use the word "semantics"?

If by "semantics" of compare_exchange_strong() you mean "the assembly generated by my compiler on my platform", then indeed it may be true that there are differences between generated assembly on your platform.
For example on ARM64 https://godbolt.org/z/Ee4vKG the difference seems to be that  
  x.load(std::memory_order_acquire) 
compiles down to:
  ldarb   w0, [x0]
and a failed compare_exchange_acq_rel(expected,false,std::memory_order_acq_rel) compiles down to:
  ldaxrb  w1, [x0]
I'm not ARM64 expert, so I am not sure what are all the differences between ldarb and ldaxrb, except for enabling monitoring of the pointed location for modifications (and I don't see how this particular difference affects the problem we are discussing), so perhaps this somehow helps with proof.
(Note however that this is a waste of time, because on ARM64 the acquire-loads are implemented in the same way as seq-cst-loads, thus this code simply works on ARM64 for the same reason your original patch worked on ARM64 but deadlocked on Intel - it gets seq-cst for free on ARM64)

If by "semantics" of compare_exchange_strong() you mean "what C++14 standard says it means", then in § 29.6.5.21 it says this:

Effects: Atomically, compares the contents of the memory pointed to by object or by this for equality with that in expected, and if true,(...), and if false, updates the contents of the memory in expected with the contents of the memory pointed to by object or by this. Further, if the comparison is true, (...), and if the comparison is false, memory is affected according to the value of failure. When only one memory_order argument is supplied, the value of success is order, and the value of failure is order except that a value of memory_order_acq_rel shall be replaced by the value memory_order_acquire and a value of memory_order_release shall be replaced by the value memory_order_relaxed. If the operation returns true, (..). Otherwise, these operations are atomic load operations.

So, there is not much difference between failed x.compare_exchange(expected,false,std::memory_order_acq_rel) and expected=x.load(std::memory_order_acquire). Let me know if you meant some other rule(s) in the C++14 standard.

I am interested in this second meaning of the word "semantics", because I need argument stronger than "it works on my my machine". I want argument to be "it works on any platform with C++14 support".

> I would re-request you to re-look at the said approach (ignoring the fact whether we consider it or not)
> as if you need to advocate it and then try to discover gaps.

Yes, this is good advice, thank you.

The general difficulty with your proof is that it seems to be a "proof by enumerating all possible executions" which seems to require a lot of attention to make sure no scenario was missed.

Moreover the format of the proof (where each line denotes another moment in time) seems to presume sequential consistency - for example, how, in this format, one can depict two operations for which there is no `happens-before` relation in any direction? If I put one above the other, this might somewhat suggest I believe one `happens-before` the other, which gives bad intuition. If I put them at the same line, this somewhat fixes the issue, but what if I have two operations on the left, and two operations on the right, and no synchronizes-with relation between them? Should I put them all in a single line?
In particular, how can I depict a scenario, in which exiting thread t1 did :

 m_lock_word.store(false, std::memory_order_release);
 failed m_waiters.compare_exchange_strong(false,true,std::memory_order_acq_rel);

and entering thread t2 did:

  m_waiters.store(true, std::memory_order_release);
  failed m_lock_word.compare_exchange_strong(true,false,std::memory_order_acq_rel);

and there is no `synchornizes-with` or `happens-before` relation between the two actions of t1 and the two actions of t2?

> ref: Solution-2 benchmarking as I quoted it is on todo just taking time due to my local resource schedule.

Thank you.

Sorry, I see no way to edit my previous comment, but I've swapped all arguments to compare_exchange_strong's. It should have been:

In particular, how can I depict a scenario, in which exiting thread t1 did :

 m_lock_word.store(false, std::memory_order_release);
 failed m_waiters.compare_exchange_strong(true,false,std::memory_order_acq_rel);

and entering thread t2 did:

  m_waiters.store(true, std::memory_order_release);
  failed m_lock_word.compare_exchange_strong(false,true,std::memory_order_acq_rel);

and there is no `synchornizes-with` or `happens-before` relation between the two actions of t1 and the two actions of t2?

Sorry for adding to confusion. (And thanks to Marcin Babij for spotting this)

C++ standard define abstract model with agreement and not semantics and so I presume everyone is free to make their own
interpretation. I tend to go by what community, processor vendor, majority of the users has interpreted which seems to be
different from your understanding of the model.

I appreciate the time you have spend on this issue but I don't think we should continue to spend more times discussing
how to intepret C++ standard as we both have our own views about it.

We both agreed on seq-cst which though not-optimal is always suppose to work and easiest and safest way out to program C++.

----

As committed I would surely run Solution-2 benchmarking for reference. (you can expect it by weekend).
Finally calls ofcourse is with codebase owner.

exchange arm benchmark

Attachment: arm-exchange.png (image/png, text), 98.76 KiB.

exchange x86 benchmark

Attachment: x86-exchange.png (image/png, text), 101.55 KiB.

Added benchmark for exchange solution on arm and x86.

Thanks.

Thank you, very much, Mr. Bauskar,

For everything, including the benchmarks.

It strikes me that the gain in performance is not consistent.

Can you add couple of more informations, like number of CPU's , number of cores, frequencies, amount of RAM (and how much of it is used for our daemon) and the type of permanent storage that is used.

Thanks a lot, in advance.

x86_64: Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz (36 cores * 2 thread = 72, 2 numa nodes)
server: 60 cores, sysbench: 12 cores
Memory: 188 GB (BPool: 128 GB)
Storage: NVME SSD 1.5 TB (data-directory hardly crosses 140 GB)

ARM: Kunpeng 920 (2.6 Ghz) (128 cores, 4 numa nodes)
server: 112 cores, sysbench :16 cores
Memory: 1 TB (used 128 GB)
Storage: NVME SSD 3 TB (data-directory hardly crosses 140 GB)

--------

* Number should be looked as relative to respective arch baseline (not to compare x86 numbers with ARM).

* ARM is subjected to NUMA noise but averaging of 3 runs is done to reduce it and relativeness is w.r.t to same arch baseline.

--------

Hope this helps. let me know if you need more info.

Hello Mr. Bauskar,
Thank you for conducting all these experiments and sharing the collected data.

I have trouble figuring out how to interpret them - either they are very noisy, or I'm missing something, and I hope you could help me understand it.

In the file you've shared [27 Jul 7:04] titled uniform-arm.png there are visible drops in performance for seq-cst version for -rand-type=uniform for upd and ni-upd:
upd ni-upd
tps tps 
9 -2
-1 -1
-1 -4
-1 -4
-2 -4
-6 -7

I assume that "seq-cst" stands for my "Proposal-1".
But if this is so, then the only difference between current trunk and Proposal-1 should be this https://godbolt.org/z/WsshWd :
- dmp isshld
So, intuitively, this should help performance, not deteriorate it.

Hence my clarifying questions:
1. Do I understand correctly that the charts you've shared 27 Jul use label "seq-cst" for my Proposal-1?
2. How stable are the results you get?
3. Are the numbers in the spreadsheets results of averaging several runs for each setup, or just results for single run? (I don't know how to interpret the header "ROUND (60/10/0/10/120)".
4. Would it be much trouble for you to rerun at least the cases which got the most extreme results:
-rand-type=uniform seq-cst upd 1024 threads (-6%)
-rand-type=uniform seq-cst ni-upd 1024 threads (-7%)
-rand-type=zipfian exchange ps 64 threads  (-12%)
-rand-type=uniform exchange ps 1024 threads (-11%)
so that we could asses the variance?

Sorry, I was writing my comment in parallel to yours, so I see that question 3. is already answered: the numbers are averages from 3 runs.

Could you share the individual 3 values so that we could see the variance?

Jakub,

1. seq-cst is Proposal-1
BTW, the original trunk code uses release memory-order (in addition to fence)
for m_waiters. If that is adding to the difference.

2. x86 is less prone to noise (due to lesser numa node) and generally bit stable
ARM is more prone to noise (part of it is contributed by more numa nodes).

Repeatability of the result of ARM for read-only is bit of challenge given
all in memory so performance is function of cpu and cpu-scheduler.
For read-write there is IO involved so part of it is overlapped.
[So we can discount part of the read-only test-cases but not always.
I tend to ignore 4-5% of difference for read-only].

Uniform (default in sysbench) is does heavy IO and so hides the real cpu
contention so we also try with zipfian and that could be used a reference.

3. You can checkout how I invoke sysbench here along with the conf used
(not completely updated but should give you fair idea).
https://github.com/mysqlonarm/benchmark-suites/blob/master/sysbench/conf/100tx5m_cpubound....

[Infact, we do 4 runs and always starts same seed-data-directory and so ignore the 1st run].

4. ref: re-run. just to be doubly sure I tend to re-run workload twice even though each run has
averaging in place. re-run is difficult at this time as I have already released the shared
machine for other project to use it. While I will get it back but some other things
are waiting. If I find time (surely not in short-term) I can try few things.

[I would discount 1024 threads regression if other threads are doing well but if
all threads are consistently doing bad then 1024 thread regression with some noise
factor can't be completely discounted].

5. ref: 3 run. Sorry, I fear we don't have them now (we cleaned common partition
since other db team needed it for their benchmarking).

--------------

On side note, variance of ARM can be further reduced once we start fixing other ARM issues.
ARM is senstive to lot of things that probably x86 is not and so those patches (from community).

Would be really happy if MySQL team can spend some times with other patches too as it would further
stablize the ARM and variance with real contention (like this one) would reduced.
[Some of them are really straight forward].

x86 is being tuned for 20+ yrs now so small variance is caught and fixed ASAP.
May be eventually we need to reach that stage for ARM too and all these fixes will help us reach there.

--------------

Let me know if you need anything more. Would be happy to support.

Data for Proposal-1 and Proposal-2 ARM zipfian

Attachment: summary-arm-zipfian.png (image/png, text), 12.40 KiB.

Hi Krunal,

Ad1.
> the original trunk code uses release memory-order (in addition to fence) for m_waiters.

Yes, I know, and this makes difference on x86. But I was asking specifically about ARM64 chart uniform-arm.png.
In the godbolt link I've shared above (https://godbolt.org/z/WsshWd), you can see that on ARM64 there is no difference between m_waiters.store(X,std::memory_order_release) [see `trunk_set_waiters`] and m_waiters.store(X) [see `proposal_1_set_waiters`], both map simply to `stlrb`. (Which makes perfect sense, because on arm stlrb gives sequential consistency "for free").
Can you spot any difference between mysql-trunk and Proposal-1 in assembly generated for ARM64, except for already mentioned `dmb ishld` being removed?

Ad2. Thanks for explanation 
Ad3. I see. Thanks for sharing testing code.
Ad4. Thanks
Ad5. Ooops. 

So, to paraphrase and summarise your perspective on perfromance results:
- you care more about read-write than read-only results
- you care more about zipfian than uniform results
- you care more about <=512 threads than 1024 threads
- you care more about differences larger than |5%| than those at most <=|5%|

This means the most important information is in charts titled zipfian-arm.org and arm-exchange.png, and I can almost ignore columns titled 'ro' and rows titled '1024'.
In other words the things you most care are those I've aggregated on the picture https://bugs.mysql.com/file.php?id=29771&bug_id=100132 
Do I get it right?

I definitely think that this analysis would benefit from re-run to minimize impact of noise (and verify the procedure itself).
If this helps anyhow to make it easier, I think it's enough to re-run the testcase scenarios you care mostly about (which, I believe to be those on my picture).
The data we have now, "seems noisy" (not just to me). 

I'm willing to wait for more data.

For moment, I thought (and was happy) you got access to some internal ARM machine
and was able to benchmark it locally.

Alas! it was short-lived.

It seems like you took my directions to help interpret results as hard-recommendations
and type-casted some statement which I fear are wrongly type-casted.

Fact at-least for now is we don't get to see the same stability in ARM results like x86.

As I pointed all our efforts to help fix all these senstive parts will help make MySQL on ARM
more predictable. (For example: spin-lock are tuned for x86 and not for ARM).

So we need to take direction from x86 and interpret results on ARM based on the data inputs.

I don't foresee re-run would change the picture completely as already pointed I tend to re-run
things 2 times.

-------

Classic case is when 8.0.21 MySQL (you) fixed the log sys to make it more ARM friendly (read numa friendly)
we saw things stablized further. We use to see a gain of 50+% for small changes which probably
was just hinting that patch helps in +ve direction but surely not 50%.

I've conducted experiment on ARM64 machine I was given access to (ellex04), comparing four revisions of code on four testing scenarios (/BMK/sb_exec/sb11-OLTP_RW_10M_8tab-{pareto,uniform}-ps-trx.sh x {128,512} threads).
For each scenario and version there were 9 runs, each starting from the same copy of prepared-data directory, with 60second warmup, and 300 seconds actual testing time. The TPS for each of 9 runs was recorded, and minimum, average and maximum of these 9 TPS are reported below (larger numbers are better).

pareto 128 threads
9 20847 < 20927.56 < 20976 "mysql-trunk@aad0504e + proposal-2 exchange"
9 20641 < 20833.67 < 20923 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 20698 < 20814.56 < 20897 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 20713 < 20802.33 < 20859 "mysql-trunk@aad0504e"
pareto 512 threads
9 20376 < 20434.56 < 20479 "mysql-trunk@aad0504e + proposal-2 exchange"
9 20345 < 20413.67 < 20515 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 20272 < 20320.33 < 20372 "mysql-trunk@aad0504e"
9 20222 < 20311.11 < 20433 "mysql-trunk@aad0504e + proposal-1 seq-cst"
uniform 128 threads
9 21003 < 21092.22 < 21145 "mysql-trunk@aad0504e + proposal-2 exchange"
9 20860 < 20970.56 < 21091 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 20868 < 20966.00 < 21040 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 20822 < 20953.56 < 21039 "mysql-trunk@aad0504e"
uniform 512 threads
9 23641 < 23787.33 < 23910 "mysql-trunk@aad0504e + proposal-2 exchange"
9 23321 < 23641.89 < 23795 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 23033 < 23641.22 < 23897 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 22977 < 23540.78 < 23676 "mysql-trunk@aad0504e"

The tests were run in interleaved pattern, to minimize temporal environmental impact, so the order was like:
pareto 128 code-version-1
pareto 128 code-version-2
pareto 128 code-version-3
pareto 128 code-version-4
pareto 128 code-version-1
pareto 128 code-version-2
...

I'd say the differences aren't statistically significant, but if treated as such they suggest that proposal-2 is the best.
By eyeballing the most pronounced difference seems to be for uniform 512, but if I try bayesian analysis on this data to compare a="mysql-trunk@aad0504e" to b="mysql-trunk@aad0504e + proposal-2 exchange" I get:

$ cat links/logs/512-uniform-RW-krunal-mutex.* | ./summarize_experiment.awk | awk '($2=="35fbe94c03f686f651a22cfb751c1e79ec3f1a39"){print "b", $3}($2=="8512f87bb90477a5f3fcc0943a1c1ac3e1ed10b2"){print "a", $3}' | ./cli.sh
{
  lengths: [ 9, 9 ],
  means: [ 23540.777777777777, 23787.333333333332 ],
  stddevs: [ 216.3040555432201, 77.67238891652553 ]
}
{
  onePercentDifference: 236.64055555555555,
  avgMeanDifference: 200.9299048970551,
  medianMeanDifference: 194.022351587515,
  smallerThanMinusOnePercent: 0,
  nearZero: 0.7624,
  largerThanOnePercent: 0.2376,
  largerThanZero: 0.9966,
  smallerThanZero: 0.0034,
  pbOfSmaller: 0.889772888213664
}
P(E(a)>E(b)+1%)=   0% P(E(a)=E(b)+-1%)=  76% P(E(a)+1%<E(b))=  24% P(a<b)=  89%

So, I'd say that most probably their actual distributions have means differing by at most 1%.

I'm puzzled by proposal-2 version being fast on ARM64 (as I was expecting the retry-loops resulting from exchanges to cause slowdowns).
After reading Krunal's article about LSE (https://mysqlonarm.github.io/ARM-LSE-and-MySQL/) I've started suspecting that perhaps the loop was somehow replaced with single operation by compiler, but, according to gdb, it was not:
(gdb) disassemble PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit
Dump of assembler code for function PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit():
   0x0000000002049048 <+0>:     stp     x29, x30, [sp,#-32]!
   0x000000000204904c <+4>:     mov     x29, sp
   0x0000000002049050 <+8>:     str     x19, [sp,#16]
   0x0000000002049054 <+12>:    mov     x19, x0
   0x0000000002049058 <+16>:    ldr     x0, [x0,#32]
   0x000000000204905c <+20>:    cbz     x0, 0x2049064 <PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit()+28>
   0x0000000002049060 <+24>:    bl      0x22208b0 <pfs_unlock_mutex_v1(PSI_mutex*)>
   0x0000000002049064 <+28>:    stlrb   wzr, [x19]
   0x0000000002049068 <+32>:    add     x0, x19, #0x1
!  0x000000000204906c <+36>:    ldaxrb  w1, [x0]
!  0x0000000002049070 <+40>:    stlxrb  w2, wzr, [x0]
!  0x0000000002049074 <+44>:    cbnz    w2, 0x204906c <PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit()+36>
   0x0000000002049078 <+48>:    tst     w1, #0xff
   0x000000000204907c <+52>:    b.ne    0x204908c <PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit()+68>
   0x0000000002049080 <+56>:    ldr     x19, [sp,#16]
   0x0000000002049084 <+60>:    ldp     x29, x30, [sp],#32
   0x0000000002049088 <+64>:    ret
   0x000000000204908c <+68>:    ldr     x0, [x19,#8]
   0x0000000002049090 <+72>:    bl      0x1ef0140 <os_event_set(os_event*)>
   0x0000000002049094 <+76>:    ldr     x19, [sp,#16]
   0x0000000002049098 <+80>:    ldp     x29, x30, [sp],#32
   0x000000000204909c <+84>:    b       0x1fa8af0 <sync_array_object_signalled()>
End of assembler dump.

so the loop (exactly the same as I've visible earlier in our godbolt analysis) is present, yet the solution is still at least as fast (if not faster) than others.

Some machine specs:

[mysql@ellex04 q-test-root]$ lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                256
On-line CPU(s) list:   0-255
Thread(s) per core:    4
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          2
Model:                 1
BogoMIPS:              400.00
L1d cache:             unknown size
L1i cache:             unknown size
L2 cache:              unknown size
L3 cache:              unknown size
NUMA node0 CPU(s):     0-127
NUMA node1 CPU(s):     128-255
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

[mysql@ellex04 q-test-root]$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 0 size: 130466 MB
node 0 free: 85127 MB
node 1 cpus: 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 1 size: 130915 MB
node 1 free: 87624 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

I'd say that these results from ellex04 are in favor of proposal-2.
I'll run one more set of tests, this time for RO workload.

Jakub,

Good to see results.

* Do you mind also running zipfian because I discover this problem with zipfian run. Now pareto and zipfian shouldn't have a big difference since both reveal contention but I think it would also help set some common line of our evaluation. (--rand-type=zipfian --rand-zipfian-exp=1.5)

* 2 numa (256 cores) I presume we can surely scale beyond 512 because 256 cores and 512 is like an optimal combination. I normally scale of 12x .. so 256 could scale upto 2048. (may be just few runs)

----------

- Not related to the bug but needs general feedback since you are using different ARM machine.

- Also, I learned (got feedback internally) that switching to 2 numa should help improve or stablize things bit. 4 numa makes it worse. So when I get the next slot I will try that too for general stability.

- ARM arch is bit new to me so full of surprises and y'day I found an interesting case when ldaxr (with load acquire) always perform better than ldxr 
and same with store-release (when in theory should be opposite).

(Of-course not a final statement I am still investigating and checking with experts).

Sorry this stmt got truncated

"- Not related to the bug but needs general feedback since you are using different ARM machine."

should have been .....

"- Not related to the bug but needs general feedback since you are using different ARM machine. Do you see lot of variance among N runs. If I compare this case for example: 

9 22977 < 23540.78 < 23676 "mysql-trunk@aad0504e"

low and high has 3% difference despite of 60s warmup and 300 secs run which should stabilize and remove all kinds of noise.

I do see ARM things are bit prone to noise (more than x86) and part of it should go off once we start to stablize MySQL on ARM through optimization patches but wondering if this is inherent property of ARM systems?

> Do you mind also running zipfian?
I can run RW on zipfian after RO finishes (takes one day).

> .. so 256 could scale up to 2048.
I can run for 2048, too. 

> Do you see lot of variance among N runs. ?

Yes, usually I see a lot of variance, even for Intel, which is why I've asked you how much iterations you have run, and asked for re-run as 3 might be too little to draw sound conclusions.

> If I compare this case for example: 
> 9 22977 < 23540.78 < 23676 "mysql-trunk@aad0504e"

For this test in particular the nine TPS values were:
a 23554
a 23580
a 22977
a 23676
a 23671
a 23619
a 23621
a 23545
a 23624

The results for RO variant of the test:
pareto 128 threads
9 30646 < 30837.11 < 30947 "mysql-trunk@aad0504e + proposal-2 exchange"
9 30752 < 30802.44 < 30840 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 30515 < 30735.67 < 30824 "mysql-trunk@aad0504e"
9 30477 < 30682.11 < 30789 "mysql-trunk@aad0504e + proposal-1 seq-cst"
pareto 512 threads
9 30300 < 30795.22 < 31101 "mysql-trunk@aad0504e"
9 30464 < 30791.11 < 31369 "mysql-trunk@aad0504e + proposal-2 exchange"
9 30213 < 30680.22 < 31057 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 30449 < 30642.78 < 30988 "mysql-trunk@aad0504e + proposal-1 seq-cst"
uniform 128 threads
9 30391 < 30603.78 < 30780 "mysql-trunk@aad0504e + proposal-2 exchange"
9 30378 < 30595.78 < 30658 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 30374 < 30560.44 < 30638 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 30283 < 30545.67 < 30662 "mysql-trunk@aad0504e"
uniform 512 threads
9 30227 < 30561.89 < 30672 "mysql-trunk@aad0504e"
9 30151 < 30514.11 < 30818 "mysql-trunk@aad0504e + proposal-2 exchange"
9 30091 < 30430.56 < 30664 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 30129 < 30365.44 < 30759 "mysql-trunk@aad0504e + proposal-1 seq-cst"

(Note how large is the min-max spread) Anyway, it still looks like proposal-2 is a good choice.

Now, I'll run the tests for 2048 threads x {uniform,pareto} x {RW,RO}.

Later I'll run the tests for zipfian.

Jakub,

just one small question given the machine is of decent configuration I feel 30K qps for RO workload is too low when the potential is to hit 1+ million (infact RW is hitting 20K) .... wondering kind of configuration you are using cpu-bound, io-bound. Can you check if anything is mis-tuned?

Hi Krunal,

>just one small question given the machine is of decent configuration I feel 30K qps for RO workload is too low when the potential is to hit 1+ million (infact RW is hitting 20K) .... wondering kind of configuration you are using cpu-bound, io-bound. Can you check if anything is mis-tuned?

thanks for bringing it to my attention.
The reported numbers are TPS - not QPS - which is part of the misunderstanding.
But a mistake on my side, was that I was running a version of the script which wraps 14 queries into BEGIN and COMMIT which makes this a bit difficult to interpret in terms of QPS (you could argue that it's like 14*TPS but what about the impact of transactions?).
After consulting this issue with Dimitri, I believe it was a bad idea to use this "transactional" version of the script for RO, so I'll rerun RO using "notrx" variant, in which the 14 queries are executed outside of explicit transactions, and I'll report QPS for them instead of TPS.
Preliminary tests indicate the values are somewhere around 430K QPS, but let's wait for full results.

Aah ok.

Also from the text is sounds like you are running tpcc and not sysbench. (do you mind running your test with sysbench).

I presume given the noise that is going to remain whether you run with 9 runs or 3 runs (that's how I reduced my runs from 1+10 to 1+3) you can also check if short runs would help.

> Also from the text is sounds like you are running tpcc and not sysbench. (do you mind running your test with sysbench).

Hm, what makes you think I am running tpcc?

AFAIU I am running /BMK/sysbench-1.1-luajit21 with ./sb_exec/lua/OLTP_RO-notrx.lua script. AFAIU both are in-house adaptations of versions of sysbench. 

> I presume given the noise that is going to remain whether you run with 9 runs or 3 runs (that's how I reduced my runs from 1+10 to 1+3) you can also check if short runs would help.

I don't understand. I'm not trying to change how code executes by running it multiple times - I'm just trying to measure how it works. To gain understanding about mean and variance, the more samples I have the better. If the reality is that the TPS changes from execution to execution, then I don't want to fight with this reality, or flinch away from it, but rather know about it and measure it.  Therefore I don't see how running it 3 times instead of 9 times can help me here. 
(In the larger perspective: yes, I do want to minimize the variance by fixing the source code [rather than modifying the testing framework], but to measure my progress towards this goal, I have to have measurements which reflect the reality along the way)
((That being said, there are already some mechanisms in the testing procedure which try to avoid variance from environment impacting the measurements, such as reading all data-files before the test [to minimize impact of io system state], or warmup, etc, but the more such artificial preparation, post-processing and "stabilizers", the less the test reflects the reality, so I am generally afraid to add more of them, unless it is clear that they only normalize the environment, and not the system under test))

Anyway, I can share results of first 3 iterations if that's helpful:

RW (TPS)

pareto 2048 threads
3 5660 < 5722.00 < 5762 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
3 5659 < 5681.33 < 5705 "mysql-trunk@aad0504e + proposal-1 seq-cst"
3 5438 < 5622.67 < 5723 "mysql-trunk@aad0504e + proposal-2 exchange"
3 5378 < 5617.67 < 5766 "mysql-trunk@aad0504e"
uniform 2048 threads
3 20084 < 20418.33 < 20642 "mysql-trunk@aad0504e + proposal-2 exchange"
3 19963 < 20375.33 < 20727 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
3 20131 < 20272.33 < 20364 "mysql-trunk@aad0504e + proposal-1 seq-cst"
3 19880 < 20128.00 < 20270 "mysql-trunk@aad0504e"

RO (QPS)

pareto 128 threads
3 444085 < 446001.67 < 448420 "mysql-trunk@aad0504e + proposal-2 exchange"
3 444894 < 445152.00 < 445367 "mysql-trunk@aad0504e + proposal-1 seq-cst"
3 441735 < 444620.33 < 446136 "mysql-trunk@aad0504e"
3 441227 < 444443.00 < 446413 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
pareto 512 threads
3 445846 < 446168.00 < 446514 "mysql-trunk@aad0504e + proposal-2 exchange"
3 441703 < 442647.67 < 443195 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
3 437797 < 441349.67 < 443741 "mysql-trunk@aad0504e + proposal-1 seq-cst"
3 437836 < 439766.00 < 440849 "mysql-trunk@aad0504e"
pareto 2048 threads
3 402555 < 402992.33 < 403233 "mysql-trunk@aad0504e + proposal-2 exchange"
3 401108 < 402099.33 < 403193 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
3 399906 < 402044.00 < 403999 "mysql-trunk@aad0504e"
3 396511 < 400746.33 < 403086 "mysql-trunk@aad0504e + proposal-1 seq-cst"
uniform 128 threads
3 444504 < 444997.33 < 445736 "mysql-trunk@aad0504e + proposal-2 exchange"
3 442767 < 443208.00 < 443529 "mysql-trunk@aad0504e + proposal-1 seq-cst"
3 440478 < 442581.67 < 443742 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
3 438336 < 440025.00 < 442291 "mysql-trunk@aad0504e"
uniform 512 threads
4 436343 < 437769.75 < 439844 "mysql-trunk@aad0504e + proposal-1 seq-cst"
4 435015 < 437696.25 < 441516 "mysql-trunk@aad0504e + proposal-2 exchange"
3 432387 < 435977.00 < 441192 "mysql-trunk@aad0504e"
4 432580 < 433888.25 < 437602 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
uniform 2048 threads
3 397651 < 400943.00 < 403288 "mysql-trunk@aad0504e + proposal-2 exchange"
3 398910 < 400059.67 < 401708 "mysql-trunk@aad0504e"
3 395175 < 399503.33 < 401812 "mysql-trunk@aad0504e + proposal-1 seq-cst"
3 395589 < 398011.33 < 401469 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"

Full results from latest runs.
I wouldn't trust values for 2048 too much - the sysbench and mysqld were run on the same machine, competing for resources, and in peak load I couldn't even log in to the machine because it was out of resources. Anyway, it still looks like relatively to other options proposal-2 isn't bad.

pareto 128 threads
9 443911 < 445986.78 < 448738 "mysql-trunk@aad0504e + proposal-2 exchange"
9 441227 < 445415.56 < 447294 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 441480 < 445112.22 < 446458 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 441322 < 444985.44 < 447038 "mysql-trunk@aad0504e"
pareto 512 threads
9 439525 < 444238.22 < 446824 "mysql-trunk@aad0504e + proposal-2 exchange"
9 435405 < 442085.22 < 446634 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 435768 < 441278.78 < 446069 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 435744 < 439249.44 < 441694 "mysql-trunk@aad0504e"
pareto 2048 threads
9 401344 < 403343.44 < 406590 "mysql-trunk@aad0504e + proposal-2 exchange"
9 399906 < 403128.89 < 405249 "mysql-trunk@aad0504e"
9 396511 < 400908.11 < 403214 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 396564 < 400808.56 < 407538 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
uniform 128 threads
9 431423 < 442606.78 < 445736 "mysql-trunk@aad0504e + proposal-2 exchange"
9 439502 < 442227.00 < 443586 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 436983 < 441858.78 < 443742 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 438336 < 441297.44 < 443243 "mysql-trunk@aad0504e"
uniform 512 threads
9 435015 < 438064.56 < 444638 "mysql-trunk@aad0504e + proposal-2 exchange"
9 432846 < 436952.33 < 443163 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 432604 < 435281.22 < 438256 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"
9 424567 < 434693.78 < 441192 "mysql-trunk@aad0504e"
uniform 2048 threads
9 397359 < 400893.78 < 404548 "mysql-trunk@aad0504e + proposal-2 exchange"
9 395175 < 400383.22 < 402530 "mysql-trunk@aad0504e + proposal-1 seq-cst"
9 395139 < 399129.78 < 401708 "mysql-trunk@aad0504e"
9 395589 < 398480.89 < 401469 "mysql-trunk@aad0504e + 22 Jul 9:19 patch from Krunal"

Hi again - I was silent for a while, but actually still working on this task.
One thing didn't let me rest: why my sysbench results indicate that "proposal-2 exchange" is at last as fast as "proposal-1 seq-cst" or maybe even faster, given that one could reasonably expect that atomic<T>::exchange is slower than atomic<T>::store.

I've run bazzilion tests on ellex04, and here is summary of my finding.

TL;DR: it all boils down to code alignment. 

Longer story:
I wrote a synthetic test by copy&pasting code from our mutex implementations and supporting infrastructure (like os_event) with a plan to spawn T (=1,2,4,...) threads and let them spin in tight loops trying to constantly enter and exit 4 mutexes. 
That was the plan, but as soon as I got to T=1 I've observed a strange phenomenon: changing implementation of signal() impacted test duration. This was puzzling, because for T=1 you have just one thread doing enters and exits, so it never has to wait for anyone nor signal anyone (and I verified this reasonable expectation by running with asserts once).
So, apparently modifying a function which is not even executed once impacts speed.

One "hilarious" example: changing `std::chrono::milliseconds(1)` to `std::chrono::microseconds(1)` in the unused signal() function somehow impacted speed of execution for T=1.
When I've looked into assembly, I've found the only obvious difference to be:
https://godbolt.org/z/xMx3EK
<         mov     x0, 16960
<         movk    x0, 0xf, lsl 16
---
>         mov     x0, 1000

1 microsecond is 1000 nanoseconds, so it can be loaded to register x0 using a single mov instruction which takes 4 bytes to encode.
1 millisecond is 1000000 nanoseconds, and it is loaded to register x0 using a trick of first loading 16960 to lower bytes of the register, and then loading 0xf to higher 16 bits of the register, indeed (15 << 16) + 16960 == 1000000, but this takes 8 bytes to encode.
This causes the surrounding assembly code to be shifted by 4 bytes.
By adding `asm("nop");` (which is a four byte instruction) just before `std::this_thread::sleep_for(std::chrono::microseconds(1))` I was able to add artificial shift of 4 bytes so the assembly code of both code versions is more similarly aligned, however there are still differences around function's prologue and epilogue (maybe because of some other assymetries between logic of sleep_for when upper bytes of x0 are known to be zeros?).
Anyway, adding this `asm("nop")` made the difference in timing of the the two smaller.

I shared this example, just to illustrates two points:
- code alignment has impact on speed
- even something you wouldn't consider a change in the logic or flow (like changing a constant from 1000 to 1000000) can lead to assembly of different length and shifting of code which ripples throught the executable in unpredictable fashion

Anyway, once I got to run the synthetic test to compare actual "proposal-1", "proposal-2", "trunk" and "Krunal's patch" versions of enter+exit, for T=1 I got results like this:
  PROPOSAL_1 THREADS    1 real 0m0.814s user 0m0.813s sys 0m0.001s
       TRUNK THREADS    1 real 0m0.917s user 0m0.916s sys 0m0.000s
      KRUNAL THREADS    1 real 0m0.973s user 0m0.971s sys 0m0.002s
  PROPOSAL_2 THREADS    1 real 0m1.122s user 0m1.121s sys 0m0.001s
Which matches my intuitions about relative speed of store()+load() (in PROPOSAL_1), store()+fence()+load() (in TRUNK), compare_exchange_strong (in KRUNAL) and exchange (in PROPOSAL_2).
Yes, adding `asm("nop")` in various places *did* change the above timings but the impact was never on the magnitudue of 1.121 / 0.813 = 37% which would be required to make PROPOSAL_1 slower than PROPOSAL_2.

To make sure this intuitions are not totally off, I've also run an ultra-synthetic test in which a single atomic operation is run in a for loop for 100000000 iterations. Yes, this is a very naive way of testing things, but it looks like at least compiler was not optimizing the loop in any way, so the test made sufficent sense to me.
Results:
EXCHANGE
real 0m1.870s user 0m1.870s sys 0m0.000s
LOAD
real 0m0.047s user 0m0.047s sys 0m0.000s
STORE
real 0m0.868s user 0m0.867s sys 0m0.001s
FENCE
real 0m1.369s user 0m1.368s sys 0m0.001s
BAD_CAS
real 0m1.004s user 0m1.003s sys 0m0.001s
OK_CAS
real 0m1.916s user 0m1.914s sys 0m0.001s

Same, but with -march=native:
EXCHANGE
real 0m1.460s user 0m1.459s sys 0m0.001s
LOAD
real 0m0.047s user 0m0.046s sys 0m0.001s
STORE
real 0m0.868s user 0m0.867s sys 0m0.001s
FENCE
real 0m1.369s user 0m1.368s sys 0m0.000s
BAD_CAS
real 0m2.280s user 0m2.278s sys 0m0.002s
OK_CAS
real 0m2.280s user 0m2.279s sys 0m0.001s

Observe, how `-march=native` causes compare_exchange_strong to be a single operation instead of combination of "load+perhaps write on success" due to LSE, and thus, BAD_CAS and OK_CAS has same timing for this configuration. Also, it agrees with observation from Krunal's blog, that LSE appears to be slower than regular version of CAS.

Anyway, back to the alignment story, I observed that adding N `nop`s at the beginning of main() (so they are "executed" only once and thus should not really impact speed in any measurable way) impacts execution time of the synthetic test in predictable way: for same N I got almost the same time each run.
When plotting time(N) it appeared there is a cycle of length, that is time(N+16)=time(N), which given that on arm64 "nop" is four bytes, means that the cycle is 64-bytes. 
I figured that it would be interesting to try to remove lines from the synthetic mutex testing code, one by one, as long as the effect of dependence of time on number of nops is persistent, this way I could find a minimal demo of the issue.
It turned out that the minimal code which exhibits alignment issues is:

#include<atomic>

template< unsigned N >
inline static void nops(){
    asm ("nop");
    nops< N - 1 >();
}

template<> inline void nops<0>(){};

std::atomic_bool m_lock_word{false};

int main(){
        nops<NOPS>(); // compile with -DNOPS=$N
        for(int i=0;i<50000000;++i){
                m_lock_word.load();
                m_lock_word.store(false);
                nops<16>();
        }
        return 0;
}

Interestingly, I could reproduce impact of NOPS also on Skylake, but not on another machine.
It is tempting to attribute the 64-byte cycle to "cache line", but actually there are myriads of various small effects from instruction fetcher (probably 16 bytes?), uop cache (32 bytes "lines"? perhaps 64 according to other documentation), LSD (mechanism which tries to cache loop operations, but may fail if a window of 32 bytes contains to many uops, etc), instructions not fitting into pipelines stage, decoder window, instruction cache misses, etc.
I've spent much time reading about this, and one obvious takeaway is that: THERE IS A LOT of mechanisms which all interfere with each other and could be blamed.
Interestingly all of these mechanisms I read about, would be oblivious to a shift of code by 64 bytes, which might explain why the cycle repeats after 64-bytes.

Anyway, one thing looks important: alignment of the code.

So, I took a look at disassembled mysqld binary of proposal-2 and proposal-1 and tried to add 'nop's in exit() and enter() so that the two variants of code have "equal lengths". This is much more tricky than I thought, because of the way the code pieces interact with each other during compilation. For example, even though exchange() seems like just 3 operations (ldaxrb, stlxrb,cbnz) and say, load() is just one ldarb you might be tempted to simply add 2 nops to make them equal. However this doesn't take into account that for example the compiler might need to use more registers for the ldaxrb+stlxrb+cbznz combo, thus has to add some register-permuting moves around the code, etc.
So, by trial end error (and diffing outputs of objdump -d) I finally figured that to get most similar alignment I should add one nop to signal() in proposal-2 (to compensate for the removed `clear_waiters()` call), two nops in proposal-1's exit, and one nop in set_waiters().

Then I've run sysbench RW again and got:

pareto 128
9 20690 < 20853.00 < 20937 "mysql-trunk@aad0504e + proposal-2 exchange + 1 x nop in signal"
9 20769 < 20823.22 < 20880 "mysql-trunk@aad0504e + proposal-1 seq-cst + 2 x nop in exit + 1 x nop in set_waiters"
pareto 512
9 20180 < 20336.44 < 20469 "mysql-trunk@aad0504e + proposal-2 exchange + 1 x nop in signal"
9 20238 < 20305.00 < 20373 "mysql-trunk@aad0504e + proposal-1 seq-cst + 2 x nop in exit + 1 x nop in set_waiters"
uniform 128
9 20718 < 20975.22 < 21077 "mysql-trunk@aad0504e + proposal-2 exchange + 1 x nop in signal"
9 20899 < 20970.56 < 21042 "mysql-trunk@aad0504e + proposal-1 seq-cst + 2 x nop in exit + 1 x nop in set_waiters"
uniform 512
9 23567 < 23706.56 < 23987 "mysql-trunk@aad0504e + proposal-1 seq-cst + 2 x nop in exit + 1 x nop in set_waiters"
9 23556 < 23675.78 < 23956 "mysql-trunk@aad0504e + proposal-2 exchange + 1 x nop in signal"

Note how small are differences between averages (in particular for uniform 512 proposal-1 has won) compare it to results from [5 Aug 10:15] where they were on the order of 100+ TPS.
I'd say the two versions are essentially equal in performance now after aligning them the same way.
And, while one could argue that `nop` is actually "taking time" when "executed", one should note, that I've added more nops to proposal-1 than to proposal-2, and the only nop added to proposal-2 is on the slow rare path where it has to wake up someone. 

OK, so my overall conclusion after this whole journey is:
- removing fence definitely helps
- difference between Proposal 1 and Proposal 2 in sysbench testing is insignificant
- however it is a fact, that Proposal 1 uses operations faster than those in Proposal 2 (that is: store()+load() is faster than exchange()), it's just that this is not such a critical part of executing sysbench, and gets drowned by alignment effects anyway

The main benefit of Proposal 1 at this point is that there might exist workloads other than the RW sysbench tests I wrote, or environments other than the one I tested on, where it is indeed faster than Proposal 2 in measurable way (this is what synthetic tests and analysis of the involved opcodes suggest, and also what Krunal's tests suggest).

The main benefit of Proposal 2 at this point is that the proof of correctness is somewhat shorter/simpler and does not rely on os_event's implementation details.

The main takeaway for me from this whole investigation is that code alignment effects are huge, in one of synthetic tests I did I could observe differences as large as 0.708s vs 0m0.845s caused by just shifting the code by 4 bytes (yes, these are repeatable results, like I run it tens of times) which is 20% difference.

I've tried compiling with -falign-{functions,jumps,labels,loops}=128, but this is a really bad idea on ARM64, because the read-modify retry loops appearing frequently in the code get aligned by this, and then the nops needed for such alignment are actually executed on path leading to the loop. Clang has a nicer --align-all-nofallthru-blocks which I haven't tested yet, which avoids adding nops which might get executed.

I'm not yet sure how to modify my future testing practices to account for code alignment issues, but perhaps some ideas like Layout Randomization from https://www.youtube.com/watch?v=r-TLSBdHe1A would help.

That is something new, interesting, and surprising given the modern processor. The addition of NOP seems to affect the performance of simple things by 20% is huge and as I see difference could be seen on Skylake too.

We all knew the code bloat problem due to an increase/lengthy function but normally compiler would optimize it to with flags on.
(InnoDB uses O2 optimization not sure if O3 does some trick).

4 bytes shift gets in the instruction code could affect performance is something new.

I wonder sometimes is a new-gen processor and new tools making our life (as a developer) difficult to write the program. Caring about cache-alignment, memory-alignment, processor compatibility, now instruction alignment when we should be spending time coding logic :(

Thanks for sharing detailed report.

I see we are not the only ones who've noticed that LSE's implementation of compare_and_swap (via `casal` instruction) is slower than a read-modify loop
https://community.arm.com/developer/f/infrastructure-solutions/47321/casal-instruction-is-...

Posted by developer:
 
Fixed as of the upcoming 8.0.22 release, and here's the proposed changelog entry from the documentation team:

The EventMutex spin lock implementation was optimized.

Posted by developer:
 
Changelog entry revised as follows:

Fixed as of the upcoming 8.0.22 release, and here's the proposed changelog entry from the documentation team:

"The TTASEventMutex::exit function was optimized for ARM64. 

Thanks to Krunal Bauskar for the contribution."

Thank you Daniel, Jakub and especially Krunal !!!!!

Thanks MySQL team for considering the idea and support to help improve MySQL on ARM.

Thank you Krunal for your contribution that has been added to 8.0.22: https://lefred.be/content/mysql-8-0-22-thank-you-for-the-contributions/