Bug #60679 | Provide deadlock free way to dump mysql state during deadlock | ||
---|---|---|---|
Submitted: | 28 Mar 2011 16:34 | Modified: | 25 Dec 2011 19:50 |
Reporter: | Mark Callaghan | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Locking | Severity: | S4 (Feature request) |
Version: | 5.1.47 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | deadlock, Monitoring |
[28 Mar 2011 16:34]
Mark Callaghan
[28 Mar 2011 18:25]
Davi Arnaut
Should a hang (deadlock) in a single component bring down the entire server?
[28 Mar 2011 19:12]
Mark Callaghan
If the stuck threads have any of the big kernel locks in mysqld (LOCK_open, kernel_mutex, ...) then no more work will get done.
[28 Mar 2011 19:38]
Davi Arnaut
Quite a bit of work can be done without LOCK_open on 5.5, probably even more in the future. Another aspect is the case of user induced deadlocks (GET_LOCK, etc). It's not clear cut. I personally think it would be more interesting to have some kind of watchdog to detect hangs. We could periodically generate a per-thread interruption (e.g. timer signal) whose handler checks if the thread is doing any useful work. If the handler detects that the thread is stuck in some call stack, it can assume the thread is hung and generate a trap [*]. Anyway, I think the core problem is that we don't do a good job at dumping the state of the server. We somewhat fully rely on core files, which has its own set of problems. The server should be able to generate a mini dump while or without crashing the server so that we can have a stacktrace of all threads, which locks were being held, etc. * Trigger a user configurable action.
[28 Mar 2011 21:22]
Mark Callaghan
It is clear cut to me. GET_LOCK is not different than an InnoDB row lock. Both are dependent on users. The innodb watchdog thread is not confused by them and does not crash the server because of them. With respect to "the server should", there is no debugging information that lists lock owners/waiters provided by MySQL. InnoDB has had this for a long time. With respect to the use of core files one must realize that core files are going away at least until someone figures out how to exclude the InnoDB buffer pool from them. Most people do not allow many or any 100G core files in production. There are also platform specific issues on RHEL/CentOS that prevent the generation of core files because of user-switching done by mysql init scripts. InnoDB has deadlock detection that works without getting confused by long user-dependent waits on row locks. It detects hangs on internal mutexes and rw-locks over which users have no control. MySQL should have the same.
[28 Mar 2011 21:26]
Mark Callaghan
For the record I spent ~6 hours debugging http://bugs.mysql.com/bug.php?id=60682. I think MySQL can do more to shorten the debugging effort the next time I look at a deadlock.
[28 Mar 2011 21:32]
Mark Callaghan
These are from unmodified MySQL 5.1.47 Stack for thread stuck doing KILL Thread 2 (Thread 1230346560 (LWP 28363)): #0 0x00000031ef20c888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x00000031ef208911 in _L_mutex_lock_1133 () from /lib64/libpthread.so.0 #2 0x00000031ef20886a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000000000908301 in safe_mutex_lock (mp=0xebc780, try_lock=0 '\0', file=0x9fa378 "sql_class.cc", line=1154) at thr_mutex.c:149 #4 0x0000000000619ecf in THD::awake (this=0x3e686d0, state_to_set=THD::KILL_QUERY) at sql_class.cc:1154 #5 0x000000000063d7bf in kill_one_thread (thd=0x3eae310, id=3, only_kill_query=true) at sql_parse.cc:7038 #6 0x000000000063d866 in sql_kill (thd=0x3eae310, id=3, only_kill_query=true) at sql_parse.cc:7063 #7 0x0000000000649561 in mysql_execute_command (thd=0x3eae310) at sql_parse.cc:4031 #8 0x000000000064c899 in mysql_parse (thd=0x3eae310, inBuf=0x3e682c0 "KILL QUERY 3", length=12, found_semicolon=0x49558ee0) at sql_parse.cc:5986 #9 0x000000000064e155 in dispatch_command (command=COM_QUERY, thd=0x3eae310, packet=0x3eb0db1 "KILL QUERY 3", packet_length=12) at sql_parse.cc:1233 #10 0x000000000064f5b1 in do_command (thd=0x3eae310) at sql_parse.cc:874 #11 0x000000000063b89b in handle_one_connection (arg=0x3eae310) at sql_connect.cc:1134 #12 0x00000031ef2062f7 in start_thread () from /lib64/libpthread.so.0 #13 0x00000031ee6d1e3d in clone () from /lib64/libc.so.6 stack for thread running DROP TABLE Thread 3 (Thread 1230080320 (LWP 28362)): #0 0x00000031ef20a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000908674 in safe_cond_wait (cond=0x32d7648, mp=0x32d75d0, file=0x2aaaaadefa30 "os/os0sync.c", line=399) at thr_mutex.c:237 #2 0x00002aaaaad50235 in os_event_wait_low (event=0x32d75d0, reset_sig_count=1) at os/os0sync.c:399 #3 0x00002aaaaadb461a in sync_array_wait_event (arr=0x32d6920, index=2) at sync/sync0arr.c:451 #4 0x00002aaaaadb6777 in mutex_spin_wait (mutex=0x32d7588, file_name=0x2aaaaae000e8 "trx/trx0trx.c", line=200) at sync/sync0sync.c:609 #5 0x00002aaaaadcae46 in mutex_enter_func (mutex=0x32d7588, file_name=0x2aaaaae000e8 "trx/trx0trx.c", line=200) at ./include/sync0sync.ic:221 #6 0x00002aaaaadcad04 in trx_allocate_for_mysql () at trx/trx0trx.c:200 #7 0x00002aaaaad1dc14 in innobase_trx_allocate (thd=0x3e7dc50) at handler/ha_innodb.cc:1314 #8 0x00002aaaaad22cae in ha_innodb::delete_table (this=0x3e85370, name=0x49516090 "./test/t1") at handler/ha_innodb.cc:6906 #9 0x00000000007816fd in handler::ha_delete_table (this=0x3e85370, name=0x49516090 "./test/t1") at handler.cc:3373 #10 0x00000000007876f3 in ha_delete_table (thd=0x3e7dc50, table_type=0x32d9eb0, path=0x49516090 "./test/t1", db=0x3e85a60 "test", alias=0x3e8c910 "t1", generate_warning=true) at handler.cc :1996 #11 0x00000000007b27cc in mysql_rm_table_part2 (thd=0x3e7dc50, tables=0x3ec1310, if_exists=false, drop_temporary=false, drop_view=false, dont_log_query=false) at sql_table.cc:2059 #12 0x00000000007b2dc6 in mysql_rm_table (thd=0x3e7dc50, tables=0x3ec1310, if_exists=0 '\0', drop_temporary=0 '\0') at sql_table.cc:1833 #13 0x0000000000647318 in mysql_execute_command (thd=0x3e7dc50) at sql_parse.cc:3400 #14 0x000000000064c899 in mysql_parse (thd=0x3e7dc50, inBuf=0x3e81d40 "drop table t1", length=13, found_semicolon=0x49517ee0) at sql_parse.cc:5986 #15 0x000000000064e155 in dispatch_command (command=COM_QUERY, thd=0x3e7dc50, packet=0x3ea62f1 "drop table t1", packet_length=13) at sql_parse.cc:1233 #16 0x000000000064f5b1 in do_command (thd=0x3e7dc50) at sql_parse.cc:874 #17 0x000000000063b89b in handle_one_connection (arg=0x3e7dc50) at sql_connect.cc:1134 stack for SHOW INNODB STATUS Thread 4 (Thread 1229814080 (LWP 28361)): #0 0x00000031ef20c888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x00000031ef208911 in _L_mutex_lock_1133 () from /lib64/libpthread.so.0 #2 0x00000031ef20886a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000000000908301 in safe_mutex_lock (mp=0x3e69440, try_lock=0 '\0', file=0x9fa378 "sql_class.cc", line=413) at thr_mutex.c:149 #4 0x000000000061c350 in thd_security_context (thd=0x3e686d0, buffer=0x494d4800 "MySQL thread id 3, query id 17 localhost 127.0.0.1 root Waiting for table", length=1024, max_query_len=600 ) at sql_class.cc:413 #5 0x00002aaaaad21583 in innobase_mysql_print_thd (f=0x3885da0, thd=0x3e686d0, max_query_len=600) at handler/ha_innodb.cc:894 #6 0x00002aaaaadce313 in trx_print (f=0x3885da0, trx=0x3ebd818, max_query_len=600) at trx/trx0trx.c:1747 #7 0x00002aaaaad393cb in lock_print_info_all_transactions (file=0x3885da0) at lock/lock0lock.c:4462 #8 0x00002aaaaadae1a4 in srv_printf_innodb_monitor (file=0x3885da0, nowait=0, trx_start=0x494d5310, trx_end=0x494d5308) at srv/srv0srv.c:1761 #9 0x00002aaaaad1de0d in innodb_show_status (hton=0x32d9eb0, thd=0x3e73190, stat_print=0x77f52a <stat_print>) at handler/ha_innodb.cc:8671 #10 0x00002aaaaad1e104 in innobase_show_status (hton=0x32d9eb0, thd=0x3e73190, stat_print=0x77f52a <stat_print>, stat_type=HA_ENGINE_STATUS) at handler/ha_innodb.cc:8914 #11 0x0000000000780f12 in ha_show_status (thd=0x3e73190, db_type=0x32d9eb0, stat=HA_ENGINE_STATUS) at handler.cc:4448 #12 0x00000000006446a4 in mysql_execute_command (thd=0x3e73190) at sql_parse.cc:2482 #13 0x000000000064c899 in mysql_parse (thd=0x3e73190, inBuf=0x3ec1660 "SHOW INNODB STATUS", length=18, found_semicolon=0x494d6ee0) at sql_parse.cc:5986 #14 0x000000000064e155 in dispatch_command (command=COM_QUERY, thd=0x3e73190, packet=0x3e75c31 "", packet_length=18) at sql_parse.cc:1233 #15 0x000000000064f5b1 in do_command (thd=0x3e73190) at sql_parse.cc:874 #16 0x000000000063b89b in handle_one_connection (arg=0x3e73190) at sql_connect.cc:1134 Stack for SELECT Thread 5 (Thread 1229547840 (LWP 28360)): #0 0x00000031ef20c888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x00000031ef208911 in _L_mutex_lock_1133 () from /lib64/libpthread.so.0 #2 0x00000031ef20886a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000000000908301 in safe_mutex_lock (mp=0x3e66b88, try_lock=0 '\0', file=0xa5d235 "sql_base.cc", line=2195) at thr_mutex.c:149 #4 0x0000000000694f49 in wait_for_condition (thd=0x3e686d0, mutex=0xebc780, cond=0xebd280) at sql_base.cc:2195 #5 0x000000000069b324 in open_table (thd=0x3e686d0, table_list=0x3ec0ef0, mem_root=0x49493e30, refresh=0x49493ebf, flags=0) at sql_base.cc:2847 #6 0x000000000069c751 in open_tables (thd=0x3e686d0, start=0x49493f30, counter=0x49493f74, flags=0) at sql_base.cc:4605 #7 0x000000000069d0db in open_and_lock_tables_derived (thd=0x3e686d0, tables=0x3ec0ef0, derived=true) at sql_base.cc:5023 #8 0x0000000000650802 in open_and_lock_tables (thd=0x3e686d0, tables=0x3ec0ef0) at mysql_priv.h:1564 #9 0x0000000000642ee3 in execute_sqlcom_select (thd=0x3e686d0, all_tables=0x3ec0ef0) at sql_parse.cc:5032 #10 0x0000000000643dae in mysql_execute_command (thd=0x3e686d0) at sql_parse.cc:2263 #11 0x000000000064c899 in mysql_parse (thd=0x3e686d0, inBuf=0x3e66a90 "select i from t2", length=16, found_semicolon=0x49495ee0) at sql_parse.cc:5986 #12 0x000000000064e155 in dispatch_command (command=COM_QUERY, thd=0x3e686d0, packet=0x3e6b171 "select i from t2", packet_length=16) at sql_parse.cc:1233 #13 0x000000000064f5b1 in do_command (thd=0x3e686d0) at sql_parse.cc:874 #14 0x000000000063b89b in handle_one_connection (arg=0x3e686d0) at sql_connect.cc:1134
[28 Mar 2011 21:35]
Mark Callaghan
There was another problem that shows up in a variation of this deadlock: connection doing row_drop_table_for_mysql -- has dictionary lock -- waiting for kernel_mutex connection doing SELECT from renamed table (without deadlock statement should error) -- has LOCK_open -- waiting for dictionary lock in dict_table_get connection doing SHOW INNODB STATUS -- has kernel_mutex -- waiting for LOCK_thd_data in thd_security_context connection doing KILL tid -- has LOCK_thd_data -- waiting for LOCK_open purge_thread -- waiting in row_mysql_free_data_dictionary pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,row_mysql_freeze_data_dictionary_func,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone -- waiting on rw_lock_s_lock_func(&dict_operation_lock, ... pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,dict_table_get,ha_innobase::open_internal,handler::ha_open,open_table_from_share,open_unireg_entry,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone -- waiting on dict->sys_mutex __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,thd_security_context,innobase_mysql_print_thd,lock_print_info_all_transactions,srv_printf_innodb_transaction,innodb_show_status,innobase_show_status,ha_show_status,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,THD::awake,kill_one_thread,sql_kill,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
[28 Mar 2011 21:36]
Mark Callaghan
The last 2 updates don't belong in this bug
[28 Mar 2011 22:39]
Davi Arnaut
> It is clear cut to me. GET_LOCK is not different than an InnoDB row lock. Both are > dependent on users. The innodb watchdog thread is not confused by them and > does not crash the server because of them. OK. Perhaps you think that InnoDB's behavior is good. I disagree.
[28 Mar 2011 22:45]
Davi Arnaut
> With respect to the use of core files one must realize that core files are going away at > least until someone figures out how to exclude the InnoDB buffer pool from them. Perhaps Facebook can contribute a patch implementing MAP_NOCORE to the Linux kernel, or augment coredump_filter.
[31 Mar 2011 17:29]
James Day
InnoDB handling being good or bad depends on context and what you need. For places with excellent failover capability a complete server kill and loss of service can be best because it allows a flap-free signal that failover is required. Worth trying to kill all possibly involved threads successively to see if the problem can be cleared that way first, perhaps - generally better to try that first IMO if the server architecture had a way for a storage engine to tell the MySQL layer to kill a thread. It's not perfect but there has been one nice result from the InnoDB handling: almost every trip of the watchdog thread is now a false alarm and it's been modified to make those less likely.* Too dramatic but it was very effective at getting things looked at and fixed. * http://lists.mysql.com/commits/134235 likely to be in next 5.1/5.5/+ builds.
[1 Apr 2011 14:53]
Marc ALFF
Hi Mark. " With respect to "the server should", there is no debugging information that lists lock owners/waiters provided by MySQL. InnoDB has had this for a long time. " I have to disagree here, this is true only for 5.1 Starting with the performance schema in 5.5: - TABLE performance_schema.mutex_instances shows which thread hold a lock on a mutex. - TABLE performance_schema.events_waits_current shows which mutex (rwlock, condition) a thread is waiting for. When critical locks such as LOCK_open / LOCK_thread_count are still free, and even if there are some deadlocks within innob, a query should still be able to connect and execute a select on the performance_schema tables, and give results. This is because very few locks are needed to get to that code. What is probably missing, for cases when the server is badly deadlocked (LOCK_open / LOCK_thread_count), is to possibly dump the same information from a kill -15, to see in a dump file the current lock and waits. Starting with mysql-trunk (5.6), table locks are also instrumented, so waits to get a table lock will also be visible in table performance_schema.events_waits_current. While this data is not complete, and more can be instrumented and exposed, this is a huge step forward already compared to 5.1 Independently of the lock data exposed, detecting dead locks is only a partial solution. A better way is to prevent deadlocks in the first place, if possible, by enforcing less inter dependencies on different locks, and by enforcing a lock free design, with for example a documented and enforced lock order. See for example http://forge.mysql.com/worklog/task.php?id=3262 and in particular option 4. Regards, -- Marc
[1 Apr 2011 15:13]
Mark Callaghan
Marc, Note that only true for 5.1 is only true for <= 5.1 which means a huge majority of MySQL in production. Is P_S.mutex_instances always enabled including when you run benchmarks? If not then it isn't a solution.