MySQL Bugs: #70430: Race condition between purge coordinator and InnoDB monitor crash on shutdown

Bug #70430	Race condition between purge coordinator and InnoDB monitor crash on shutdown
Submitted:	26 Sep 2013 7:52	Modified:	28 Mar 2014 19:08
Reporter:	Laurynas Biveinis (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
If InnoDB monitor is enabled, the following sequence is possible during the shutdown:

1) srv_monitor_thread enters its loop iteration;
2) shutdown is requested, srv_shutdown_state becomes SRV_SHUTDOWN_CLEANUP;
3) purge coordinator sees it, shuts down, sets purge_sys->state = PURGE_STATE_EXIT;
4) srv_monitor_thread resumes and crashes in srv_printf_innodb_monitor()/lock_print_info_summary() because of

	switch (purge_sys->state){
	case PURGE_STATE_EXIT:
	case PURGE_STATE_INIT:
		/* Should never be in this state while the system is running. */
		ut_error;

How to repeat:
Code analysis. Very intermittently seen on RQG.

Suggested fix:
Make InnoDB monitor accept stopped purge, or stop the monitor thread first.

Sample Percona Server 5.6.13 stacktraces:

crash:

Thread 1 (LWP 23301):
+bt
#0  0x000000363720c69c in pthread_kill () from /lib64/libpthread.so.0
#1  0x000000000066449e in handle_fatal_signal (sig=6) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/signal_handler.cc:254
#2  <signal handler called>
#3  0x0000003636e328a5 in raise () from /lib64/libc.so.6
#4  0x0000003636e34085 in abort () from /lib64/libc.so.6
#5  0x000000000096b648 in lock_print_info_summary (file=0x2beae90, nowait=<optimized out>) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/lock/lock0lock.cc:5212
#6  0x00000000009f6abb in srv_printf_innodb_monitor (file=0x2beae90, nowait=1, trx_start_pos=0x0, trx_end=0x0) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/srv/srv0srv.cc:1261
#7  0x00000000009f79a2 in srv_monitor_thread (arg=<optimized out>) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/srv/srv0srv.cc:1886
#8  0x0000003637207851 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003636ee890d in clone () from /lib64/libc.so.6

shutdown:

Thread 35 (LWP 23549):
+bt
#0  0x0000003636ee14f3 in select () from /lib64/libc.so.6
#1  0x000000000099345f in os_thread_sleep (tm=<optimized out>) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/os/os0thread.cc:255
#2  0x000000000097b58e in logs_empty_and_mark_files_at_shutdown () at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/log/log0log.cc:3274
#3  0x00000000009fb20f in innobase_shutdown_for_mysql () at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/srv/srv0start.cc:2869
#4  0x000000000093d444 in innobase_end (hton=<optimized out>, type=<optimized out>) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/innobase/handler/ha_innodb.cc:3627
#5  0x00000000005aa42e in ha_finalize_handlerton (plugin=0x2bc4eb0) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/handler.cc:636
#6  0x00000000006f08fe in plugin_deinitialize (ref_check=true, plugin=0x2bc4eb0) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/sql_plugin.cc:937
#7  reap_plugins () at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/sql_plugin.cc:1015
#8  0x00000000006f12d3 in plugin_shutdown () at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/sql_plugin.cc:1700
#9  0x000000000059ea34 in clean_up (print_message=true) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/mysqld.cc:1840
#10 0x000000000059ee0e in unireg_end () at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/mysqld.cc:1708
#11 0x00000000005a2038 in kill_server (sig_ptr=0x0) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/mysqld.cc:1636
#12 0x00000000005a223e in kill_server_thread (arg=<optimized out>) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/sql/mysqld.cc:1659
#13 0x00000000009236aa in pfs_spawn_thread (arg=0x7f1988001b00) at /bzr/5.6-QA_opt/Percona-Server-5.6.13-rc60.5/storage/perfschema/pfs.cc:1858
#14 0x0000003637207851 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003636ee890d in clone () from /lib64/libc.so.6

And the purge coordinator thread has already stopped.  Full stacktraces and additional data available at https://bugs.launchpad.net/percona-server/+bug/1230238

Thank you for the report.

I can not repeat described behavior if add condition to DECLARE_THREAD(srv_monitor_thread) waiting when purge_sys->state == PURGE_STATE_EXIT. Please provide repeatable test case, demonstrating the problem.

Source code change I used:

=== modified file 'storage/innobase/srv/srv0srv.cc'
--- storage/innobase/srv/srv0srv.cc	2013-11-29 12:56:54 +0000
+++ storage/innobase/srv/srv0srv.cc	2013-12-12 21:35:03 +0000
@@ -1536,6 +1536,7 @@
 	time_elapsed = difftime(current_time, last_monitor_time);
 
 	if (time_elapsed > 15) {
+while (purge_sys->state != PURGE_STATE_EXIT) {}
 		last_monitor_time = ut_time();
 
 		if (srv_print_innodb_monitor) {

Test case I used:

--source include/have_innodb.inc
create table innodb_monitor(f1 int) engine=innodb;
--sleep 20
select 1;
--shutdown_server

Sveta -

In your testcase purge is idle, thus the coordinator waiting for activity by sleeping on a non-timingout event, which is never signalled, hence the purge state is never set to PURGE_STATE_EXIT.

The easiest way to make the purge active there deterministically is this on the top of your patch:

=== modified file 'storage/innobase/srv/srv0srv.cc'
--- storage/innobase/srv/srv0srv.cc	2013-07-29 09:18:44 +0000
+++ storage/innobase/srv/srv0srv.cc	2013-12-13 11:05:22 +0000
@@ -2742,6 +2743,8 @@
 		/* If there are no records to purge or the last
 		purge didn't purge any records then wait for activity. */
 
+		n_total_purged = 1;
+
 		if (purge_sys->state == PURGE_STATE_STOP
 		    || n_total_purged == 0) {
 

Then running your testcase gives the expected result:

Thread 1 (Thread 0x7f4742146700 (LWP 24969)):
#0  0x00007f474fbd9f0c in __pthread_kill (threadid=<optimized out>, signo=6) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:62
#1  0x0000000000aa8ba9 in my_write_core (sig=6) at /home/laurynas/percona/src/5.6/mysys/stacktrace.c:422
#2  0x000000000071a4f3 in handle_fatal_signal (sig=6) at /home/laurynas/percona/src/5.6/sql/signal_handler.cc:248
#3  <signal handler called>
#4  0x00007f474f01df77 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007f474f0215e8 in __GI_abort () at abort.c:90
#6  0x0000000000bee043 in lock_print_info_summary (file=0x7f474f3a91a0 <_IO_2_1_stderr_>, nowait=1) at /home/laurynas/percona/src/5.6/storage/innobase/lock/lock0lock.cc:5184
#7  0x0000000000cd55d8 in srv_printf_innodb_monitor (file=0x7f474f3a91a0 <_IO_2_1_stderr_>, nowait=1, trx_start_pos=0x0, trx_end=0x0) at /home/laurynas/percona/src/5.6/storage/innobase/srv/srv0srv.cc:1175
#8  0x0000000000cd62a2 in srv_monitor_thread (arg=0x0) at /home/laurynas/percona/src/5.6/storage/innobase/srv/srv0srv.cc:1555
#9  0x00007f474fbd4f6e in start_thread (arg=0x7f4742146700) at pthread_create.c:311
#10 0x00007f474f0e19cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thank you for the feedback.

Verified as described.

Fixed as of 5.6.17, 5.7.4 and here's the changelog entry:

The "srv_monitor_thread" would crash in the "lock_print_info_summary()"
function due to a race condition between the "srv_monitor_thread" and
purge coordinator thread. 

Thank you for the bug report.

5.6$ bzr log -r 5753 -n0
------------------------------------------------------------
revno: 5753
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Thu 2014-01-16 11:34:22 +0900
message:
  Bug#17980590 : RACE CONDITION BETWEEN PURGE COORDINATOR AND INNODB MONITOR CRASH ON SHUTDOWN
  
  lock_print_info_summary() should treat purge_sys->state==PURGE_STATE_EXIT case, instead of ut_error crash.
  Because it might be called after purge_thread exited.
  
  Approved by Sunny Bains in rb#4209