| Bug #56299 | mutex deadlock with COM_BINLOG_DUMP, binlog purge, and processlist/kill | ||
|---|---|---|---|
| Submitted: | 26 Aug 2010 16:16 | Modified: | 7 Nov 2011 17:24 |
| Reporter: | Shane Bester (Platinum Quality Contributor) | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Locking | Severity: | S1 (Critical) |
| Version: | 5.1.50,5.5.5 | OS: | Any |
| Assigned to: | Sven Sandberg | CPU Architecture: | Any |
[27 Aug 2010 5:25]
MySQL Verification Team
testcase. run against a master with --max-binlog-size=4k. slave will reconnect a few times due to kill. hangup in a few seconds.
Attachment: bug56299.c (text/plain), 6.35 KiB.
[27 Aug 2010 5:31]
MySQL Verification Team
testcase will not work against 5.0 since it uses information_schema.processlist. master is running like this: mysqld --no-defaults --console --skip-gr --skip-na --log-bin --expire-logs-days=1 --server-id=1 --max-binlog-size=4096 slave is running like this: mysqld --no-defaults --server-id=2 --port=3307 --skip-gr --skip-na --console make sure the slave binlog dump thread is killed many times, and slave reconnects by itself.
[5 Nov 2010 17:00]
Sven Sandberg
The deadlock happens when the following four things happen at the same time:
- An old dump thread is waiting for the binlog to grow.
- The slave server that replicates from the old dump thread tries to reconnect.
During reconnection, the new dump thread tries to kill the old dump thread.
- A KILL statement tries to kill the old dump thread.
- An INSERT statement causes a binlog rotation.
The following lists the stacks for the four threads. Control flow go from top to bottom in the list, so the stack top is in the bottom. Lines that either do function calls or acquire a lock are listed.
old dump thread:
sql_parse:1440:dispatch_command
sql_repl.cc:668:mysql_binlog_send
log.cc:4879:MYSQL_BIN_LOG::wait_for_update
This thread is waiting without holding any mutexes. It has just set
THD::mysys_var::current_mutex = LOCK_log
INSERT thread:
sql_insert.cc:921:mysql_insert
sql_class.cc:3913:THD::binlog_query
log.cc:4263:MYSQL_BIN_LOG::write LOCK_log *
log.cc:4411:MYSQL_BIN_LOG::write
log.cc:4524:MYSQL_BIN_LOG::rotate_and_purge
log.cc:3631:MYSQL_BIN_LOG::purge_logs_before_date LOCK_index
log.cc:3639:MYSQL_BIN_LOG::purge_logs_before_date
sql_repl.cc:215:log_in_use LOCK_thread_count *
COM_BINLOG_DUMP thread:
sql_parse.cc:1435:dispatch_command
sql_repl.cc:1096:kill_zombie_dump_threads LOCK_thread_count *
sql_repl.cc:1106:kill_zombie_dump_threads old_dump_thd.LOCK_thd_data *
KILL thread:
sql_parse.cc:4031:mysql_execute_command
sql_parse.cc:7063:sql_kill
sql_parse.cc:7014:kill_one_thread old_dump_thd.LOCK_thd_data *
sql_parse.cc:7038:kill_one_thread
sql_class.cc:1092:old_dump_thd.awake old_dump_thd.mysys_var.mutex
sql_class.cc:1117:old_dump_thd.awake old_dump_thd.mysys_var.current_mutex
= LOCK_log *
Deadlock summary:
- INSERT takes LOCK_log before LOCK_thread_count
- COM_BINLOG_DUMP takes LOCK_thread_count before old_dump_thd.LOCK_thd_data
- KILL takes old_dump_thd.LOCK_thd_data before LOCK_log
Which thread has wrong lock order?
- It's probably necessary that kill_zombie_dump_threads takes
LOCK_thread_count before old_dump_thd.LOCK_thd_data, because that prevents
old_dump_thd from being deleted under the feet of kill_zombie_dump_threads.
- KILL is a generic statement that probably has a reason to take locks in
this order.
- So the lock order should be:
LOCK_thread_count, THD::LOCK_thd_data, LOCK_log
So it's the log rotation that violates lock order by taking LOCK_log before
LOCK_thread count. And in fact, it seems unnecessary to block creating or
destroying threads just because we rotate the log.
We can fix the bug as follows:
- Associate a counter with each binlog file.
- Increase and decrease the counter as dump threads start or stop reading a
log.
- In log_in_use, just read the counter instead of iterating through all active
threads in the system.
- Protect the counters with LOCK_index.
- Enforce this lock order:
LOCK_thread_count, THD::LOCK_thd_data, LOCK_log, LOCK_index
It seems that LOCK_index is always held anyways in all places when we need to access the counters, so we are only removing complexity, not adding complexity.
[7 Nov 2011 17:24]
Jon Stephens
Fixed in 5.6.4. Closed.
[7 Nov 2011 18:25]
Paul DuBois
Changelog entry: Deadlock could occur when these four things happened at the same time: 1) An old dump thread was waiting for the binary log to grow. 2) The slave server that replicates from the old dump thread tried to reconnect. During reconnection, the new dump thread tried to kill the old dump thread. 3) A KILL statement tried to kill the old dump thread. 4) An INSERT statement caused a binary log rotation.
[28 Mar 2012 11:08]
Simon Mudd
Even though 5.6 may be out "soon", 5.5 is GA. So how can you close this ticket when you fix the problem in the current "development version", without commenting on the problem that's reported in the GA MySQL 5.5 version? You even mark this as S1: Critical...
[28 Mar 2012 11:30]
MySQL Verification Team
I checked the code, and it really looks fixed in 5.5.19. Patch is visible here: http://lists.mysql.com/commits/141642 And listed here: http://dev.mysql.com/doc/refman/5.5/en/news-5-5-19.html
[29 Mar 2012 5:42]
Simon Mudd
Shane, thanks for checking and confirming that this is fixed in the current GA (5.5) version. That missing info is important and it is good it's now provided in this bug report.

Description: a master server gets deadlocked and no further connections/queries can be made. Mutexes involved: LOCK_thread_count, LOCK_thd_data, mysys_var->current_mutex 5.1.46sp1 stack traces for a sample deadlock: ------------ thread 5148: insert .. causing binlog rotation ------------ ntdll.dll!NtWaitForSingleObject ntdll.dll!RtlpWaitOnCriticalSection ntdll.dll!RtlEnterCriticalSection mysqld-debug.exe!log_in_use Line 213 mysqld-debug.exe!MYSQL_BIN_LOG::purge_logs_before_date Line 3639 mysqld-debug.exe!MYSQL_BIN_LOG::rotate_and_purge Line 4527 mysqld-debug.exe!MYSQL_BIN_LOG::write Line 4413 mysqld-debug.exe!THD::binlog_query Line 3912 mysqld-debug.exe!mysql_insert Line 924 mysqld-debug.exe!mysql_execute_command Line 3198 mysqld-debug.exe!mysql_parse Line 5990 mysqld-debug.exe!dispatch_command Line 1235 mysqld-debug.exe!do_command Line 874 mysqld-debug.exe!handle_one_connection Line 1134 mysqld-debug.exe!pthread_start Line 85 mysqld-debug.exe!_callthreadstart Line 295 mysqld-debug.exe!_threadstart Line 277 kernel32.dll!BaseThreadStart waiting for pthread_mutex_lock(&LOCK_thread_count); thread 2320 owns the mutex. ---------------- thread 2320: reconnecting slave doing COM_BINLOG_DUMP! ------------------ ntdll.dll!NtWaitForSingleObject ntdll.dll!RtlpWaitOnCriticalSection ntdll.dll!RtlEnterCriticalSection mysqld-debug.exe!kill_zombie_dump_threads Line 1103 mysqld-debug.exe!dispatch_command Line 1436 mysqld-debug.exe!do_command Line 874 mysqld-debug.exe!handle_one_connection Line 1134 mysqld-debug.exe!pthread_start Line 85 mysqld-debug.exe!_callthreadstart Line 295 mysqld-debug.exe!_threadstart Line 277 kernel32.dll!BaseThreadStart waiting for pthread_mutex_lock(&tmp->LOCK_thd_data); // Lock from delete thread 5040 holds the mutex. ------------- thread 5040: KILL .... --------------- ntdll.dll!NtWaitForSingleObject ntdll.dll!RtlpWaitOnCriticalSection ntdll.dll!RtlEnterCriticalSection mysqld-debug.exe!THD::awake Line 1117 mysqld-debug.exe!kill_one_thread Line 7039 mysqld-debug.exe!sql_kill Line 7063 mysqld-debug.exe!mysql_execute_command Line 4032 mysqld-debug.exe!mysql_parse Line 5990 mysqld-debug.exe!dispatch_command Line 1273 mysqld-debug.exe!do_command Line 874 mysqld-debug.exe!handle_one_connection Line 1134 mysqld-debug.exe!pthread_start Line 85 mysqld-debug.exe!_callthreadstart Line 295 mysqld-debug.exe!_threadstart Line 277 kernel32.dll!BaseThreadStart waiting for pthread_mutex_lock(mysys_var->current_mutex); thread 5148 holds the mutex. How to repeat: setup working master/slave pair, with --expire-logs-days=1 --max-binlog-size=4096 on master. in 4 threads, execute the following: --- insert into t1 values ('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'); show full processlist; select id from information_schema.processlist order by rand() limit 1 into @a; kill @a; --- the slave thread will keep reconnecting as needed. the binlog will be rotated frequently after4k inserts are reached.