Bug #96504 Refine atomics and barriers for weak memory order platform
Submitted: 12 Aug 2019 4:42 Modified: 12 Aug 2019 12:37
Reporter: Cai Yibo (OCA) Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:ARM

[12 Aug 2019 4:42] Cai Yibo

After fixing memory barrier bug on Arm [1], we studied mysql memory model related code(atomics, barriers) and found some possible optimization points. We are planning to do refinements, which will introduce non-trivial changes. Would like to hear comments from the community first.

Some typical cases are listed below. Please review. Thanks.

some memory barriers can be optimized

Some full memory barriers can be replaced with more relaxed ones.
E.g., in bugfix "Insufficient memory barriers in the rw-lock implementation caused deadlocks on ARM" [1], we use wmb and rmb to enforce write and read ordering. We may replace the barriers with load-acquire and store-release for better performance on Arm.

memory model too strong for atomic operations

When building mysql with gcc, by default legacy __sync built-ins with strong memory order are used for all atomic operations, rather than the recommended __atomic ones with fine memory order controls. Even if __atomic is used, it always sets the strongest sequential order. See [2] for an example.

This is suboptimal for weak memory order platforms like Arm or PPC. For most use cases, full memory barrier is overkill, we can leverage C11 memory model to improve performance.
Some examples:
- For case [3], acquire and release model is more reasonable.
- For cases [4][5], only atomicity is required, relaxed order should be enough.

some memory models are not accurate

For TAS(os_atomic_test_and_set)[6] and CAS(os_atomic_val_compare_and_swap)[7]:
- They force sequential order for Arm and PPC, which is not necessary.
- For x86, they use release order when success. But they are called by tas_lock[8] and trylock[9] to acquire locks which requires at least acquire order. The memory model is not accurate, though it won't cause problem on x86 as load-acquire on x86 is just a mov instruction[10].

Leveraging C/C++11 memory model, we can use consistent and clear memory order for all architectures.

[1] https://bugs.mysql.com/bug.php?id=94699

[2] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/o...

[3] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/o...

[4] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/s...

[5] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/buf/buf0b...

[6] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/o...

[7] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/o...

[8] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/i...

[9] https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.17/storage/innobase/include/i...

[10] https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

How to repeat:
[12 Aug 2019 12:37] Sinisa Milivojevic

Thank you for your report on the performance improvements for the ARM64 platform.

I have studied your report in detail and concluded that you are correct.

Verified as reported.
[12 Aug 2019 18:21] Omer Barnir
Set Architecture