MySQL Bugs: #62614: Server deadlock between RESET MASTER and PURGE BINARY LOGS

Bug #62614	Server deadlock between RESET MASTER and PURGE BINARY LOGS
Submitted:	4 Oct 2011 8:46	Modified:	4 Jul 2014 6:41
Reporter:	Kristian Nielsen	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.5.17, 5.6, 5.7	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
PURGE BINARY LOGS takes mutexes in the wrong order, which can cause the entire
server to deadlock:

bool MYSQL_BIN_LOG::reset_logs(THD* thd)
{
  /*
    The following mutex is needed to ensure that no threads call
    'delete thd' as we would then risk missing a 'rollback' from this
    thread. If the transaction involved MyISAM tables, it should go
    into binlog even on rollback.
  */
  mysql_mutex_lock(&LOCK_thread_count);

  mysql_mutex_lock(&LOCK_log);
  mysql_mutex_lock(&LOCK_index);

Other parts of the server, for example RESET MASTER, take locks in the
opposite order (first LOCK_index then LOCK_thread_count). Thus a deadlock is
possible.

This was reproduced in MySQL 5.5.17 with this bzr revision:

    revid:vasil.dimov@oracle.com-20110909071254-jb86bz87bx33zyje

Note that since the deadlock involves LOCK_thread_count, when it occurs it
blocks any new queries from running.

How to repeat:
Here is a test case that reproduces the deadlock, hanging the server hard.

Since this is a race condition, the test case uses DEBUG_SYNC to reliably
reproduce the problem. I was actually able to reproduce without DEBUG_SYNC on
my machine (just delete all use of DEBUG_SYNC from the test case), however
this may or may not work on another machine/os/whatever.

To use the DEBUG_SYNC testcase, apply this patch:

=== modified file 'sql/log.cc'
--- sql/log.cc	2011-07-03 23:48:19 +0000
+++ sql/log.cc	2011-10-04 07:28:36 +0000
@@ -38,6 +38,7 @@
 #include "rpl_filter.h"
 #include "rpl_rli.h"
 #include "sql_audit.h"
+#include "debug_sync.h"
 
 #include <my_dir.h>
 #include <stdarg.h>
@@ -3373,6 +3374,7 @@ bool MYSQL_BIN_LOG::reset_logs(THD* thd)
 
   ha_reset_logs(thd);
 
+  DEBUG_SYNC(thd, "reset_logs_before_LOCK_thread_count");
   /*
     The following mutex is needed to ensure that no threads call
     'delete thd' as we would then risk missing a 'rollback' from this
@@ -3381,6 +3383,8 @@ bool MYSQL_BIN_LOG::reset_logs(THD* thd)
   */
   mysql_mutex_lock(&LOCK_thread_count);
 
+  DEBUG_SYNC(thd, "reset_logs_after_LOCK_thread_count");
+
   /*
     We need to get both locks to be sure that no one is trying to
     write to the index log file.
@@ -3670,8 +3674,10 @@ int MYSQL_BIN_LOG::purge_logs(const char
   DBUG_ENTER("purge_logs");
   DBUG_PRINT("info",("to_log= %s",to_log));
 
+  DEBUG_SYNC(thd, "purge_logs_before_LOCK_index");
   if (need_mutex)
     mysql_mutex_lock(&LOCK_index);
+  DEBUG_SYNC(thd, "purge_logs_after_LOCK_index");
   if ((error=find_log_pos(&log_info, to_log, 0 /*no mutex*/))) 
   {
     sql_print_error("MYSQL_BIN_LOG::purge_logs was called with file %s not "

Then run this test case:

--source include/have_log_bin.inc
--source include/have_debug_sync.inc

# Test case to reproduce a deadlock in binlog handling.
# RESET MASTER locks first LOCK_index and then LOCK_thread_count
# PURGE BINARY LOGS locks first LOCK_thread_count and LOCK_index.
# This can lead to deadlock if the two commands run in parallel, with
# one thread waiting to get LOCK_index and the other waiting for
# LOCK_thread_count

--connect(con1,localhost,root,,)
--connect(con2,localhost,root,,)

connection default;
CREATE TABLE t1 (a INT PRIMARY KEY);
FLUSH LOGS;
INSERT INTO t1 VALUES (1);
FLUSH LOGS;
INSERT INTO t1 VALUES (2);
FLUSH LOGS;
SHOW BINARY LOGS;

# Start a PURGE BINARY LOGS, pausing just before it takes the first lock.
connection con1;
SET DEBUG_SYNC= "purge_logs_before_LOCK_index SIGNAL con1_ready1 WAIT_FOR con1_cont1";
SET DEBUG_SYNC= "purge_logs_after_LOCK_index SIGNAL con1_ready2 WAIT_FOR con1_cont2 TIMEOUT 5";
send PURGE BINARY LOGS TO "master-bin.000002";
connection default;
SET DEBUG_SYNC= "now WAIT_FOR con1_ready1";

# Start RESET MASTER, pausing just before it takes the first lock.
connection con2;
SET DEBUG_SYNC= "reset_logs_before_LOCK_thread_count SIGNAL con2_ready1 WAIT_FOR con2_cont1";
send RESET MASTER;
connection default;
SET DEBUG_SYNC= "now WAIT_FOR con2_ready1";

connection default;
# Let PURGE BINARY LOGS take the first lock
SET DEBUG_SYNC= "now SIGNAL con1_cont1";
SET DEBUG_SYNC= "now WAIT_FOR con1_ready2";
# Now let con2 continue. It will take LOCK_thread_count and then wait for
# con1 to release LOCK_index.
# After 5 seconds con1 will timeout and then wait for con2 to release
# LOCK_thread_count -> deadlock.
SET DEBUG_SYNC= "now SIGNAL con2_cont1";

connection con1;
reap;
connection con2;
reap;
DROP TABLE t1;

The test case will time out due to the deadlock. The "analyze timeout" will
fail due to LOCK_thread_count blocking all queries from running. Manually
attaching gdb while the test case is running and doing `thread apply all bt`
will show how it is deadlocked.

Suggested fix:
An easy fix is to change MYSQL_BIN_LOG::reset_logs() to only lock
LOCK_thread_count _after_ LOCK_index.

But it is not clear to me if this still ensures that we do not miss a
rollback, as per the comment in the function.

It seems this was introduced with this commit:

revision-id: sven.sandberg@oracle.com-20110526105043-wykkmpktyye2kav6
parent: sergey.glukhov@oracle.com-20110526100639-o3tcac0chgy9m9ij
committer: Sven Sandberg <sven.sandberg@oracle.com>
branch nick: 5.1
timestamp: Thu 2011-05-26 12:50:43 +0200
message:
  BUG#12574820: binlog.binlog_tmp_table timing out in daily and weekly trunk run
  Problem: MYSQL_BIN_LOG::reset_logs acquires mutexes in wrong order.
  The correct order is first LOCK_thread_count and then LOCK_log. This function
  does it the other way around. This leads to deadlock when run in parallel
  with a thread that takes the two locks in correct order. For example, a thread
  that disconnects will take the locks in the correct order.
  Fix: change order of the locks in MYSQL_BIN_LOG::reset_logs:
  first LOCK_thread_count and then LOCK_log.

But I think it is wrong that LOCK_thread_count should be locked before
LOCK_index/LOCK_log. The LOCK_thread_count is a short-lived lock, holding it
blocks _all_ threads from starting a new query. LOCK_log is a long log, it is
held while doing I/O.

I could not find where a disconnecting thread will lock LOCK_log while holding
LOCK_thread_count, however if there is such a place I would suggest the
correct fix is to switch the locking order in the disconnecting thread.

In any case there are more places where LOCK_index is taken before
LOCK_thread_count. For example MYSQL_BIN_LOG::purge_logs() which calls
log_in_use(); the former holds LOCK_index and the latter takes
LOCK_thread_count.

Still happens on todays 5.7.3 code:

Thread 1: reset master
Thread 2: purge master logs before now()

Thread 1 waits for thread 2
Thread 2 waits for thread 1

Thread 1:
---------
Owns LOCK_thread_count
Owns LOCK_log
Waits for LOCK_index

	ntdll.dll!ZwWaitForSingleObject
	ntdll.dll!RtlpWaitOnCriticalSection
	ntdll.dll!RtlEnterCriticalSection
	mysqld-debug.exe!inline_mysql_mutex_lock
	mysqld-debug.exe!MYSQL_BIN_LOG::reset_logs
	mysqld-debug.exe!reset_master
	mysqld-debug.exe!reload_acl_and_cache
	mysqld-debug.exe!mysql_execute_command
	mysqld-debug.exe!mysql_parse
	mysqld-debug.exe!dispatch_command
	mysqld-debug.exe!do_command
	mysqld-debug.exe!handle_connection
	mysqld-debug.exe!pfs_spawn_thread
	mysqld-debug.exe!pthread_start
	mysqld-debug.exe!_callthreadstartex
	mysqld-debug.exe!_threadstartex

Thread 2:
---------
Owns LOCK_index
Waits for LOCK_thread_count

	ntdll.dll!ZwWaitForSingleObject
	ntdll.dll!RtlpWaitOnCriticalSection
	ntdll.dll!RtlEnterCriticalSection
	mysqld-debug.exe!inline_mysql_mutex_lock
	mysqld-debug.exe!log_in_use
	mysqld-debug.exe!MYSQL_BIN_LOG::purge_logs_before_date
	mysqld-debug.exe!purge_master_logs_before_date
	mysqld-debug.exe!mysql_execute_command
	mysqld-debug.exe!mysql_parse
	mysqld-debug.exe!dispatch_command
	mysqld-debug.exe!do_command
	mysqld-debug.exe!handle_connection
	mysqld-debug.exe!pfs_spawn_thread
	mysqld-debug.exe!pthread_start
	mysqld-debug.exe!_callthreadstartex
	mysqld-debug.exe!_threadstartex

cross referencing another deadlock problem:
http://bugs.mysql.com/bug.php?id=69954

This issue is fixed as part of bug#69954
(which is fixed in  5.5.39, 5.6.20, and 5.7.5 )