Bug #99219 Innodb freeze running REPLACE statements
Submitted: 9 Apr 2020 12:16 Modified: 10 Apr 2020 12:19
Reporter: kfpanda kf Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.7.27 OS:Any
Assigned to: CPU Architecture:Any

[9 Apr 2020 12:16] kfpanda kf
Description:
Recently, a thread srv_error_monitor_thread was encountered waiting for srv_innodb_monitor_mutex. But the srv_innodb_monitor_mutex lock is not held by any thread. The positive problem is difficult to reproduce. But from code analysis, it feels like there's a problem. 

In the thread srv_error_monitor_thread, we should not acquire any locks. Or you shouldn't keep waiting for a lock. 

Bug #79185 shows a similar fix. But I ran into this problem in the higher version. 

Thread 2648 (Thread 0x7f2ef7bff700 (LWP 20799)):
#0  0x00007f302d3d1975 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000110ba11 in wait (this=<optimized out>) at /data/storage/innobase/os/os0event.cc:171
#2  os_event::wait_low (this=0x7f302b778478, reset_sig_count=<optimized out>) at /data/storage/innobase/os/os0event.cc:358
#3  0x000000000110c35a in os_event_wait_low (event=<optimized out>, reset_sig_count=<optimized out>) at /data/storage/innobase/os/os0event.cc:603
#4  0x00000000011ceb07 in sync_array_wait_event (arr=arr@entry=0x7f302b6a8c18, cell=@0x7f2ef7bfd3f0: 0x7f30295ee458) at /data/storage/innobase/sync/sync0arr.cc:475
#5  0x00000000011b82dc in TTASEventMutex<GenericPolicy>::wait (this=this@entry=0x22b8e80 <srv_innodb_monitor_mutex>, filename=filename@entry=0x1907908 "/data/storage/innobase/srv/srv0srv.cc", line=line@entry=1171, spin=4) at /data/storage/innobase/include/ut0mutex.ic:89
#6  0x00000000011bb94a in spin_and_try_lock (line=1171, filename=0x1907908 "/data/storage/innobase/srv/srv0srv.cc", max_delay=<optimized out>, max_spins=60, this=0x22b8e80 <srv_innodb_monitor_mutex>) at /data/storage/innobase/include/ib0mutex.h:850
#7  enter (line=1171, filename=0x1907908 "/data/storage/innobase/srv/srv0srv.cc", max_delay=<optimized out>, max_spins=<optimized out>, this=0x22b8e80 <srv_innodb_monitor_mutex>) at /data/storage/innobase/include/ib0mutex.h:707
#8  enter (name=0x1907908 "/data/storage/innobase/srv/srv0srv.cc", line=1171, n_delay=<optimized out>, n_spins=<optimized out>, this=0x22b8e80 <srv_innodb_monitor_mutex>) at /data/storage/innobase/include/ib0mutex.h:987
#9  srv_refresh_innodb_monitor_stats () at /data/storage/innobase/srv/srv0srv.cc:1171
#10 srv_error_monitor_thread (arg=<optimized out>) at /data/storage/innobase/srv/srv0srv.cc:1758
#11 0x00007f302d3cddf5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f302be8648d in clone () from /lib64/libc.so.6

How to repeat:
Recently, a thread srv_error_monitor_thread was encountered waiting for srv_innodb_monitor_mutex. But the srv_innodb_monitor_mutex lock is not held by any thread. The positive problem is difficult to reproduce. But from code analysis, it feels like there's a problem. 

In the thread srv_error_monitor_thread, we should not acquire any locks. Or you shouldn't keep waiting for a lock. 

Bug #79185 shows a similar fix. But I ran into this problem in the higher version. 

Suggested fix:
A simple way to fix it.

--- a/storage/innobase/srv/srv0srv.cc
+++ b/storage/innobase/srv/srv0srv.cc
@@ -1176,7 +1176,8 @@ void
 srv_refresh_innodb_monitor_stats(void)
 /*==================================*/
 {
-       mutex_enter(&srv_innodb_monitor_mutex);
+    if(mutex_enter_nowait(&srv_innodb_monitor_mutex))
+      return ;
 
        srv_last_monitor_time = ut_time_monotonic();
[9 Apr 2020 13:03] MySQL Verification Team
HI Mr. kf,

Thank you for your bug report.

However, I do not see a bug here. Simply, there are three functions that are using that mutex. Beside the one that you mentioned, there is also a function for printing InnoDB status and the one that is exporting it. Hence, what you have is a possibility that three different threads could read or change same data, which is why usage of mutex is so important.

Hence, I do not think that simply returning would be acceptable solution. Not so much based on the data that you have provided us. 

What we would need is the analysis of which thread was holding this mutex and why would is it holding it for such a long time. That info would be more helpful to you or to us. For that purpose you can use both InnoDB status and performance schema.

Let us know what you have found out.
[9 Apr 2020 15:26] kfpanda kf
Because of the amount of content, I put the details in the attached file.

Note: I think this problem may be related to memoryBarrier
[9 Apr 2020 15:28] kfpanda kf
The detailed information

Attachment: detail.txt (text/plain), 13.71 KiB.

[10 Apr 2020 12:19] MySQL Verification Team
HI Mr. kf,

Thank you for your reply.

However, your detailed stacktraces only exhibit what I have described. There are three functions, each of which can be called by a different thread, which is why this mutex is very much necessary.

Regarding memory fence, it is resolved in the OS kernel, not in the user space.

Not a bug.