Bug #114361 Unaligned Delegate's lock reduces performance
Submitted: 15 Mar 1:53 Modified: 18 Mar 1:57
Reporter: Xizhe Zhang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: alignment, commit, lock

[15 Mar 1:53] Xizhe Zhang
Description:
Usually the Cache Line size is 64B. If the memory address of the lock is aligned according to 64B, then the lock struct can be completely loaded into a Cache Line of the CPU Cache. If the struct of a lock spans two Cache Lines, high-frequency access to the lock will reduce performance.

During our stress testing, we found that adding a global static variable that is not used at all in the code will cause performance degradation. We used perf to see the CPU hotspots before and after, and the differences are as follows:
# Event 'cycles'
#
# Baseline    Delta  Shared Object              Symbol
# ........  .......  .........................  .....................................................
#
     8.36%  +24.44%  [kernel.kallsyms]          [k] _raw_spin_lock
     2.34%   -0.97%  mysqld_jemalloc.0          [.] ut_delay
     1.64%   -0.45%  libc-2.17.so               [.] __memcpy_ssse3_back
     1.47%   -0.44%  mysqld_jemalloc.0          [.] mutex_enter_inline<PolicyMutex<TTASEventMutex<Gen
     1.28%   -0.41%  libpthread-2.17.so         [.] pthread_mutex_lock
# ...

As you can see, the biggest difference is _raw_spin_lock, and the call chain of the one with poorer performance is as follows:
-   32.79%    32.79%  mysqld           [kernel.kallsyms]          [k] _raw_spin_lock
   - _raw_spin_lock
      - 58.53% futex_wake
           do_futex
           sys_futex
         - system_call_fastpath
            - __lll_unlock_wake
               - 56.06% Delegate::read_lock
                  + 34.39% Trans_delegate::before_commit
                  + 22.81% Trans_delegate::trans_begin
                  + 21.83% Trans_delegate::before_dml
                  + 20.97% Trans_delegate::after_commit
               ...
      - 40.21% futex_wait_setup
           futex_wait
           do_futex
           sys_futex
         - system_call_fastpath
            - __lll_lock_wait
               - 54.55% Delegate::read_lock
                  + 34.19% Trans_delegate::before_commit
                  + 22.95% Trans_delegate::trans_begin
                  + 22.00% Trans_delegate::before_dml
                  + 20.86% Trans_delegate::after_commit
               ...

It can be found that the increase in CPU time occupied by _raw_spin_lock mainly comes from Trans_delegate. After our investigation, we found that this is because the memory of the object pointed by transaction_delegate is aligned according to 16B. When allocating memory during the initialization phase, the global static variables we defined affected the memory address of the object pointed by transaction_delegate. This causes the member variable lock of Trans_delegate to not be aligned according to 64B. The type of lock is mysql_rwlock_t and its size is 64B, and spanning the Cache Line makes access to the lock slower.

How to repeat:
Modifying in the source code to make transaction_delegate->lock spans two Cache Lines, which will show performance hot spots during high-concurrency stress testing.

Suggested fix:
The fix is actually very simple. Just align the Delegate::lock to 64B. The entire lock can be put into one Cache Line, so that the performance can be restored:

diff --git a/sql/rpl_handler.h b/sql/rpl_handler.h
index 7fcd63779e8..bce14673d43 100644
--- a/sql/rpl_handler.h
+++ b/sql/rpl_handler.h
@@ -211,9 +211,11 @@ class Delegate {
   /** List of registered observers */
   Observer_info_list observer_info_list;
   /**
-    A read/write lock to be used when not optimizing for static plugin config
+    A read/write lock to be used when not optimizing for static plugin config.
+    Align lock to CPU_LEVEL1_DCACHE_LINESIZE to ensure it can be loaded into
+    one Cache Line, reduce performance loss in high concurrency scenarios.
    */
-  mysql_rwlock_t lock;
+  alignas(CPU_LEVEL1_DCACHE_LINESIZE) mysql_rwlock_t lock;
   /**
     A shared-exclusive spin lock to be used when optimizing for static plugin
     config.
[15 Mar 10:54] MySQL Verification Team
Hi Mr. Zheng,

Thank you for your bug report.

We do not understand how adding a global static variable that is not used at all in the code can  cause performance degradation.

However, the rest of your analysis is impressive, including the profiling. Your remedy is simple and acceptable.

This is now a verified performance improvement report, with the patch that is added.

This is an improvement that is intended for 8.0 and higher versions.

Thank you for your contribution.
[15 Mar 10:56] MySQL Verification Team
Upon some further analysis, it could be applied to other locking mechanisms in MySQL, but this will require further, internal, analysis.
[15 Mar 13:07] Xizhe Zhang
Sorry for not clarifying the relationship between "global static variable" and "transaction_delegate->lock". The uninitialized "global static variable" is located in the BSS (Better Save Space) section. During compilation, the memory address of the variable can be determined and stored in the symbol table of the binary file.

The following is the initialization function of "transaction_delegate". Although the object is dynamically allocated, it uses the memory space of the static array "place_trans_mem". This array's memory is aligned according to 16B. When we define a "global static variable", its position in the BSS section is earlier than "place_trans_mem", increasing the address of "place_trans_mem". As a result, "transaction_delegate->lock" is not aligned according to 64B as described before.

int delegates_init() {
  // align as 16B
  alignas(Trans_delegate) static char place_trans_mem[sizeof(Trans_delegate)];
  ...
  transaction_delegate = new (place_trans_mem) Trans_delegate;
}
[18 Mar 2:55] Xizhe Zhang
Hello, Verification Team, this is my patch.

Attachment: delegate_lock.gitlog (application/octet-stream, text), 1.68 KiB.

[19 Mar 11:11] MySQL Verification Team
Hi Mr. Zhang,

Thank you for your patch.

We shall immediately make your patch available to our Development team.

Thanks again !!!!
[14 Aug 8:23] Libing Song
Hi MySQL Team,

The bug is still there. It is a one line fix which should be easy to fix.
[14 Aug 9:34] MySQL Verification Team
Hi Mr. Zhang,

We are in the midst of the vacation time and our Development is currently overloaded.

Hence, we all need some patience ......