MySQL Bugs: #113760: Insufficient memory barriers in rw-lock implementation caused crash on ARM arch

Bug #113760	Insufficient memory barriers in rw-lock implementation caused crash on ARM arch
Submitted:	25 Jan 2024 13:00	Modified:	25 Jan 2024 14:00
Reporter:	Brian Yue (OCA)	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.7.44	OS:	Linux
Assigned to:		CPU Architecture:	ARM
Tags:	arm, crash, memory barrier

Description:
Hello,
  We recently found several crashes of MySQL 5.7 on ARM architecture, and finally fix it.
  The reason is similar with the previously fixed bug  "Bug #29508001 MYSQL DEADLOCK AND BUGCHECK ON AARCH64 UNDER STRESS TEST" but not the same function. There are three rw-lock functions which have similar implementation that suffer with weakly ordered model of memory on ARM:
  1. rw_lock_x_lock_low in storage/innobase/sync/sync0rw.cc
  2. rw_lock_sx_lock_low in storage/innobase/sync/sync0rw.cc
  3. rw_lock_x_lock_func_nowait in storage/innobase/include/sync0rw.ic

  In MySQL 5.7, only rw_lock_x_lock_low and rw_lock_sx_lock_low is fixed in "Bug #29508001 MYSQL DEADLOCK AND BUGCHECK ON AARCH64 UNDER STRESS TEST", missing rw_lock_x_lock_func_nowait.

  However, in MySQL 8.0.20 rw_lock_x_lock_func_nowait is fixed by "Bug #30401416 RWLOCK:REFINE LOCK->RECURSIVE WITH C11 ATOMICS" using std::atomic in a modern c++ way:
```
    if (!pass && lock->recursive.load(std::memory_order_acquire) &&
        os_thread_eq(lock->writer_thread, os_thread_get_curr_id())) {
```
  But this modification is not backported to MySQL 5.7, so crash or deadlock happens because of cocurrent rwlock request through rw_lock_x_lock_func_nowait. I guess it could be a mistake. On my mysql server, if app keeps deleting many records, maybe the DML thread and bg purge thread or ibuf merge operation are modifying the buffer page concurretly and causing a damaged page due to useless rwlock mechanism.

How to repeat:
It's a issue that is not easy to reproduce, but I believe the previous code analysis is enough. However, the method to produce the issue is followed:

(1) make a table with 10,000,000 records using sysbench
(2) keep deleting batch records, each time we delete 1000 records in a transaction:
delete from sbtest1 limit 1000;
(3) keep trying ...

Suggested fix:
diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
index c48df97..5a62593 100644
--- a/storage/innobase/include/sync0rw.ic
+++ b/storage/innobase/include/sync0rw.ic
@@ -462,12 +462,27 @@ rw_lock_x_lock_func_nowait(
        mutex_exit(&(lock->mutex));

 #endif
-       if (success) {
-               rw_lock_set_writer_id_and_recursion_flag(lock, true);

-       } else if (lock->recursive
-                  && os_thread_eq(lock->writer_thread,
-                                  os_thread_get_curr_id())) {
+       /*
+               Same question as GDB-325546, use os_rmb to fit in 5.7* VERSION.
+               Fix in 8.0 VERSION see <Bug #30401416>.
+       */
+       bool recursive;
+       os_thread_id_t writer_thread;
+       if (!success)
+       {
+               recursive = lock->recursive;
+               os_rmb;
+               writer_thread = lock->writer_thread;
+       }
+
+       if (success)
+       {
+               rw_lock_set_writer_id_and_recursion_flag(lock, true);
+       }
+       else if (recursive && os_thread_eq(writer_thread,
+                                     os_thread_get_curr_id()))
+       {
                /* Relock: this lock_word modification is safe since no other
                threads can modify (lock, unlock, or reserve) lock_word while
                there is an exclusive writer and this is the writer thread. */

Hi Mr. Yue,

Thank you very much for your bug report.

However, we have to inform you that it has been a while since we stopped any maintenance of the version 5.7.

If you can provide a test case that crashes our 8.0.36 or 8.3.0 server, then please, do provide a test case in pure SQL !!!!!

Thanks in advance.