Description:
Hello,
We recently found several crashes of MySQL 5.7 on ARM architecture, and finally fix it.
The reason is similar with the previously fixed bug "Bug #29508001 MYSQL DEADLOCK AND BUGCHECK ON AARCH64 UNDER STRESS TEST" but not the same function. There are three rw-lock functions which have similar implementation that suffer with weakly ordered model of memory on ARM:
1. rw_lock_x_lock_low in storage/innobase/sync/sync0rw.cc
2. rw_lock_sx_lock_low in storage/innobase/sync/sync0rw.cc
3. rw_lock_x_lock_func_nowait in storage/innobase/include/sync0rw.ic
In MySQL 5.7, only rw_lock_x_lock_low and rw_lock_sx_lock_low is fixed in "Bug #29508001 MYSQL DEADLOCK AND BUGCHECK ON AARCH64 UNDER STRESS TEST", missing rw_lock_x_lock_func_nowait.
However, in MySQL 8.0.20 rw_lock_x_lock_func_nowait is fixed by "Bug #30401416 RWLOCK:REFINE LOCK->RECURSIVE WITH C11 ATOMICS" using std::atomic in a modern c++ way:
```
if (!pass && lock->recursive.load(std::memory_order_acquire) &&
os_thread_eq(lock->writer_thread, os_thread_get_curr_id())) {
```
But this modification is not backported to MySQL 5.7, so crash or deadlock happens because of cocurrent rwlock request through rw_lock_x_lock_func_nowait. I guess it could be a mistake. On my mysql server, if app keeps deleting many records, maybe the DML thread and bg purge thread or ibuf merge operation are modifying the buffer page concurretly and causing a damaged page due to useless rwlock mechanism.
How to repeat:
It's a issue that is not easy to reproduce, but I believe the previous code analysis is enough. However, the method to produce the issue is followed:
(1) make a table with 10,000,000 records using sysbench
(2) keep deleting batch records, each time we delete 1000 records in a transaction:
delete from sbtest1 limit 1000;
(3) keep trying ...
Suggested fix:
diff --git a/storage/innobase/include/sync0rw.ic b/storage/innobase/include/sync0rw.ic
index c48df97..5a62593 100644
--- a/storage/innobase/include/sync0rw.ic
+++ b/storage/innobase/include/sync0rw.ic
@@ -462,12 +462,27 @@ rw_lock_x_lock_func_nowait(
mutex_exit(&(lock->mutex));
#endif
- if (success) {
- rw_lock_set_writer_id_and_recursion_flag(lock, true);
- } else if (lock->recursive
- && os_thread_eq(lock->writer_thread,
- os_thread_get_curr_id())) {
+ /*
+ Same question as GDB-325546, use os_rmb to fit in 5.7* VERSION.
+ Fix in 8.0 VERSION see <Bug #30401416>.
+ */
+ bool recursive;
+ os_thread_id_t writer_thread;
+ if (!success)
+ {
+ recursive = lock->recursive;
+ os_rmb;
+ writer_thread = lock->writer_thread;
+ }
+
+ if (success)
+ {
+ rw_lock_set_writer_id_and_recursion_flag(lock, true);
+ }
+ else if (recursive && os_thread_eq(writer_thread,
+ os_thread_get_curr_id()))
+ {
/* Relock: this lock_word modification is safe since no other
threads can modify (lock, unlock, or reserve) lock_word while
there is an exclusive writer and this is the writer thread. */