| Bug #62614 | Server deadlock between RESET MASTER and PURGE BINARY LOGS | ||
|---|---|---|---|
| Submitted: | 4 Oct 2011 8:46 | Modified: | 4 Jul 2014 6:41 |
| Reporter: | Kristian Nielsen | Email Updates: | |
| Status: | Duplicate | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
| Version: | 5.5.17, 5.6, 5.7 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[4 Oct 2011 9:48]
Kristian Nielsen
It seems this was introduced with this commit: revision-id: sven.sandberg@oracle.com-20110526105043-wykkmpktyye2kav6 parent: sergey.glukhov@oracle.com-20110526100639-o3tcac0chgy9m9ij committer: Sven Sandberg <sven.sandberg@oracle.com> branch nick: 5.1 timestamp: Thu 2011-05-26 12:50:43 +0200 message: BUG#12574820: binlog.binlog_tmp_table timing out in daily and weekly trunk run Problem: MYSQL_BIN_LOG::reset_logs acquires mutexes in wrong order. The correct order is first LOCK_thread_count and then LOCK_log. This function does it the other way around. This leads to deadlock when run in parallel with a thread that takes the two locks in correct order. For example, a thread that disconnects will take the locks in the correct order. Fix: change order of the locks in MYSQL_BIN_LOG::reset_logs: first LOCK_thread_count and then LOCK_log. But I think it is wrong that LOCK_thread_count should be locked before LOCK_index/LOCK_log. The LOCK_thread_count is a short-lived lock, holding it blocks _all_ threads from starting a new query. LOCK_log is a long log, it is held while doing I/O. I could not find where a disconnecting thread will lock LOCK_log while holding LOCK_thread_count, however if there is such a place I would suggest the correct fix is to switch the locking order in the disconnecting thread. In any case there are more places where LOCK_index is taken before LOCK_thread_count. For example MYSQL_BIN_LOG::purge_logs() which calls log_in_use(); the former holds LOCK_index and the latter takes LOCK_thread_count.
[12 Sep 2013 18:53]
MySQL Verification Team
Still happens on todays 5.7.3 code: Thread 1: reset master Thread 2: purge master logs before now() Thread 1 waits for thread 2 Thread 2 waits for thread 1 Thread 1: --------- Owns LOCK_thread_count Owns LOCK_log Waits for LOCK_index ntdll.dll!ZwWaitForSingleObject ntdll.dll!RtlpWaitOnCriticalSection ntdll.dll!RtlEnterCriticalSection mysqld-debug.exe!inline_mysql_mutex_lock mysqld-debug.exe!MYSQL_BIN_LOG::reset_logs mysqld-debug.exe!reset_master mysqld-debug.exe!reload_acl_and_cache mysqld-debug.exe!mysql_execute_command mysqld-debug.exe!mysql_parse mysqld-debug.exe!dispatch_command mysqld-debug.exe!do_command mysqld-debug.exe!handle_connection mysqld-debug.exe!pfs_spawn_thread mysqld-debug.exe!pthread_start mysqld-debug.exe!_callthreadstartex mysqld-debug.exe!_threadstartex Thread 2: --------- Owns LOCK_index Waits for LOCK_thread_count ntdll.dll!ZwWaitForSingleObject ntdll.dll!RtlpWaitOnCriticalSection ntdll.dll!RtlEnterCriticalSection mysqld-debug.exe!inline_mysql_mutex_lock mysqld-debug.exe!log_in_use mysqld-debug.exe!MYSQL_BIN_LOG::purge_logs_before_date mysqld-debug.exe!purge_master_logs_before_date mysqld-debug.exe!mysql_execute_command mysqld-debug.exe!mysql_parse mysqld-debug.exe!dispatch_command mysqld-debug.exe!do_command mysqld-debug.exe!handle_connection mysqld-debug.exe!pfs_spawn_thread mysqld-debug.exe!pthread_start mysqld-debug.exe!_callthreadstartex mysqld-debug.exe!_threadstartex
[12 Sep 2013 18:58]
MySQL Verification Team
cross referencing another deadlock problem: http://bugs.mysql.com/bug.php?id=69954
[4 Jul 2014 6:41]
Venkatesh Duggirala
This issue is fixed as part of bug#69954 (which is fixed in 5.5.39, 5.6.20, and 5.7.5 )

Description: PURGE BINARY LOGS takes mutexes in the wrong order, which can cause the entire server to deadlock: bool MYSQL_BIN_LOG::reset_logs(THD* thd) { /* The following mutex is needed to ensure that no threads call 'delete thd' as we would then risk missing a 'rollback' from this thread. If the transaction involved MyISAM tables, it should go into binlog even on rollback. */ mysql_mutex_lock(&LOCK_thread_count); mysql_mutex_lock(&LOCK_log); mysql_mutex_lock(&LOCK_index); Other parts of the server, for example RESET MASTER, take locks in the opposite order (first LOCK_index then LOCK_thread_count). Thus a deadlock is possible. This was reproduced in MySQL 5.5.17 with this bzr revision: revid:vasil.dimov@oracle.com-20110909071254-jb86bz87bx33zyje Note that since the deadlock involves LOCK_thread_count, when it occurs it blocks any new queries from running. How to repeat: Here is a test case that reproduces the deadlock, hanging the server hard. Since this is a race condition, the test case uses DEBUG_SYNC to reliably reproduce the problem. I was actually able to reproduce without DEBUG_SYNC on my machine (just delete all use of DEBUG_SYNC from the test case), however this may or may not work on another machine/os/whatever. To use the DEBUG_SYNC testcase, apply this patch: === modified file 'sql/log.cc' --- sql/log.cc 2011-07-03 23:48:19 +0000 +++ sql/log.cc 2011-10-04 07:28:36 +0000 @@ -38,6 +38,7 @@ #include "rpl_filter.h" #include "rpl_rli.h" #include "sql_audit.h" +#include "debug_sync.h" #include <my_dir.h> #include <stdarg.h> @@ -3373,6 +3374,7 @@ bool MYSQL_BIN_LOG::reset_logs(THD* thd) ha_reset_logs(thd); + DEBUG_SYNC(thd, "reset_logs_before_LOCK_thread_count"); /* The following mutex is needed to ensure that no threads call 'delete thd' as we would then risk missing a 'rollback' from this @@ -3381,6 +3383,8 @@ bool MYSQL_BIN_LOG::reset_logs(THD* thd) */ mysql_mutex_lock(&LOCK_thread_count); + DEBUG_SYNC(thd, "reset_logs_after_LOCK_thread_count"); + /* We need to get both locks to be sure that no one is trying to write to the index log file. @@ -3670,8 +3674,10 @@ int MYSQL_BIN_LOG::purge_logs(const char DBUG_ENTER("purge_logs"); DBUG_PRINT("info",("to_log= %s",to_log)); + DEBUG_SYNC(thd, "purge_logs_before_LOCK_index"); if (need_mutex) mysql_mutex_lock(&LOCK_index); + DEBUG_SYNC(thd, "purge_logs_after_LOCK_index"); if ((error=find_log_pos(&log_info, to_log, 0 /*no mutex*/))) { sql_print_error("MYSQL_BIN_LOG::purge_logs was called with file %s not " Then run this test case: --source include/have_log_bin.inc --source include/have_debug_sync.inc # Test case to reproduce a deadlock in binlog handling. # RESET MASTER locks first LOCK_index and then LOCK_thread_count # PURGE BINARY LOGS locks first LOCK_thread_count and LOCK_index. # This can lead to deadlock if the two commands run in parallel, with # one thread waiting to get LOCK_index and the other waiting for # LOCK_thread_count --connect(con1,localhost,root,,) --connect(con2,localhost,root,,) connection default; CREATE TABLE t1 (a INT PRIMARY KEY); FLUSH LOGS; INSERT INTO t1 VALUES (1); FLUSH LOGS; INSERT INTO t1 VALUES (2); FLUSH LOGS; SHOW BINARY LOGS; # Start a PURGE BINARY LOGS, pausing just before it takes the first lock. connection con1; SET DEBUG_SYNC= "purge_logs_before_LOCK_index SIGNAL con1_ready1 WAIT_FOR con1_cont1"; SET DEBUG_SYNC= "purge_logs_after_LOCK_index SIGNAL con1_ready2 WAIT_FOR con1_cont2 TIMEOUT 5"; send PURGE BINARY LOGS TO "master-bin.000002"; connection default; SET DEBUG_SYNC= "now WAIT_FOR con1_ready1"; # Start RESET MASTER, pausing just before it takes the first lock. connection con2; SET DEBUG_SYNC= "reset_logs_before_LOCK_thread_count SIGNAL con2_ready1 WAIT_FOR con2_cont1"; send RESET MASTER; connection default; SET DEBUG_SYNC= "now WAIT_FOR con2_ready1"; connection default; # Let PURGE BINARY LOGS take the first lock SET DEBUG_SYNC= "now SIGNAL con1_cont1"; SET DEBUG_SYNC= "now WAIT_FOR con1_ready2"; # Now let con2 continue. It will take LOCK_thread_count and then wait for # con1 to release LOCK_index. # After 5 seconds con1 will timeout and then wait for con2 to release # LOCK_thread_count -> deadlock. SET DEBUG_SYNC= "now SIGNAL con2_cont1"; connection con1; reap; connection con2; reap; DROP TABLE t1; The test case will time out due to the deadlock. The "analyze timeout" will fail due to LOCK_thread_count blocking all queries from running. Manually attaching gdb while the test case is running and doing `thread apply all bt` will show how it is deadlocked. Suggested fix: An easy fix is to change MYSQL_BIN_LOG::reset_logs() to only lock LOCK_thread_count _after_ LOCK_index. But it is not clear to me if this still ensures that we do not miss a rollback, as per the comment in the function.