MySQL Bugs: #99230: The thread srv_error_monitor

Bug #99230	The thread srv_error_monitor_thread hangs.
Submitted:	10 Apr 2020 14:04	Modified:	13 Apr 2020 12:28
Reporter:	kfpanda kf	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.7	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Briefly, thread srv_error_monitor_thread is waiting for the lock srv_innodb_monitor_mutex. But the lock is not held by any thread.
The reason I said that none of threads held the lock srv_innodb_monitor_mutex is that you can see from the stack that the _owner=0

I analyzed the functions used for locking srv_innodb_monitor_mutex. Three functions use this lock, srv_refresh_innodb_monitor_stats, 
srv_export_innodb_status, and srv_printf_innodb_monitor. They all released the lock well.

Therefore, I think the lock problem may be caused by memoryBarrier. 
The thread srv_error_monitor_thread handles the memoryBarrier problem of locks, but the thread itself gets stuck before it can handle the problem.

Below is the stack information
------------------------------------------
(gdb) thr 2648
[Switching to thread 2648 (Thread 0x7f2ef7bff700 (LWP 20799))]
#0  0x00007f302d3d1975 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f302d3d1975 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000110ba11 in wait (this=<optimized out>) at /data//storage/innobase/os/os0event.cc:171
#2  os_event::wait_low (this=0x7f302b778478, reset_sig_count=<optimized out>) at /data//storage/innobase/os/os0event.cc:358
#3  0x000000000110c35a in os_event_wait_low (event=<optimized out>, reset_sig_count=<optimized out>) at /data//storage/innobase/os/os0event.cc:603
#4  0x00000000011ceb07 in sync_array_wait_event (arr=arr@entry=0x7f302b6a8c18, cell=@0x7f2ef7bfd3f0: 0x7f30295ee458) at /data//storage/innobase/sync/sync0arr.cc:475
#5  0x00000000011b82dc in TTASEventMutex<GenericPolicy>::wait (this=this@entry=0x22b8e80 <srv_innodb_monitor_mutex>, filename=filename@entry=0x1907908 "/data//storage/innobase/srv/srv0srv.cc", 
    line=line@entry=1171, spin=4) at /data//storage/innobase/include/ut0mutex.ic:89
#6  0x00000000011bb94a in spin_and_try_lock (line=1171, filename=0x1907908 "/data//storage/innobase/srv/srv0srv.cc", max_delay=<optimized out>, max_spins=60, 
    this=0x22b8e80 <srv_innodb_monitor_mutex>) at /data//storage/innobase/include/ib0mutex.h:850
#7  enter (line=1171, filename=0x1907908 "/data//storage/innobase/srv/srv0srv.cc", max_delay=<optimized out>, max_spins=<optimized out>, this=0x22b8e80 <srv_innodb_monitor_mutex>)
    at /data//storage/innobase/include/ib0mutex.h:707
#8  enter (name=0x1907908 "/data//storage/innobase/srv/srv0srv.cc", line=1171, n_delay=<optimized out>, n_spins=<optimized out>, this=0x22b8e80 <srv_innodb_monitor_mutex>)
    at /data//storage/innobase/include/ib0mutex.h:987
#9  srv_refresh_innodb_monitor_stats () at /data//storage/innobase/srv/srv0srv.cc:1171
#10 srv_error_monitor_thread (arg=<optimized out>) at /data//storage/innobase/srv/srv0srv.cc:1758
#11 0x00007f302d3cddf5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f302be8648d in clone () from /lib64/libc.so.6
(gdb) f 6 
#6  0x00000000011bb94a in spin_and_try_lock (line=1171, filename=0x1907908 "/data//storage/innobase/srv/srv0srv.cc", max_delay=<optimized out>, max_spins=60, 
    this=0x22b8e80 <srv_innodb_monitor_mutex>) at /data//storage/innobase/include/ib0mutex.h:850
850	/data//storage/innobase/include/ib0mutex.h: No such file or directory.
(gdb) p *this
$2 = {m_lock_word = 1, m_waiters = 1, m_event = 0x7f302b778478, m_policy = {m_count = {m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}, m_id = LATCH_ID_SRV_INNODB_MONITOR}}
(gdb) p *this->m_event
$3 = {m_set = false, signal_count = 1, mutex = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 3, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 12 times>, "\003", '\000' <repeats 26 times>, __align = 0}}, cond_var = {__data = {__lock = 0, __futex = 3, __total_seq = 3, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x7f302b778488, __nwaiters = 7, 
      __broadcast_seq = 0}, __size = "\000\000\000\000\003\000\000\000\003", '\000' <repeats 23 times>, "\210\204w+0\177\000\000\a\000\000\000\000\000\000", __align = 12884901888}, static cond_attr = {__size = "\002\000\000", 
    __align = 2}, static cond_attr_has_monotonic_clock = true, Python Exception <type 'exceptions.ValueError'> Cannot find type event_iter_t::_Node: 
static global_initialized = true, event_iter = }

==================================
I reported a bug, but maybe he didn't understand it. Bug #79185
The community has fixed similar bugs before. Bug #99219

How to repeat:
The problem is difficult to reproduce. But from the stack and the code analysis, it felt like something was wrong.

Suggested fix:
A simple way to fix it.

--- a/storage/innobase/srv/srv0srv.cc
+++ b/storage/innobase/srv/srv0srv.cc
@@ -1176,7 +1176,8 @@ void
 srv_refresh_innodb_monitor_stats(void)
 /*==================================*/
 {
-       mutex_enter(&srv_innodb_monitor_mutex);
+    if(mutex_enter_nowait(&srv_innodb_monitor_mutex))
+      return ;
 
        srv_last_monitor_time = ut_time_monotonic();

I reported a bug, but maybe he didn't understand it.  Bug #99219  
The community has fixed similar bugs before. Bug #79185

Hi Mr. kf,

Thank you for your bug report.

First of all, you are mentioning bug #79185, which has nothing similar to what you are reporting.

You are reporting an expected behaviour. When there is a large number of updates, then the mutex that you are mentioning could be locked for longer time. During that time, if any other thread attempts to do the same, or to run InnoDB status or similar, it has to wait. It is irrelevant whether that thread owns the mutex or not. The owner of the mutex is the thread that created it, but any other thread can use it. There are situations where thread has to own the mutex, but this is not one of such situations.

In a multi-threaded program, like our server, you have to use mutex locks in order to prevent two threads of writing or reading the same data. Memory fence does not help, as memory fence deals with single CPU instructions. Here, we need to protect entire blocks of code, containing hundreds (or more) CPU instructions. This is why we can not remove mutex locks from our code.

In a highly loaded environments, this leads to lock waits. We can not prevent that, for the reasons explained above. 

You have also not encountered an assert due to the wait on semaphore. We suggest that you analyse your slow query log and see how long are your DMLs running. That would explain your situation.