Bug #63203 Possible deadlock due to lock order violation
Submitted: 11 Nov 2011 10:20 Modified: 30 Dec 2013 4:41
Reporter: Laurynas Biveinis (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Locking Severity:S2 (Serious)
Version:5.1.61, 5.5 OS:Any
Assigned to: CPU Architecture:Any

[11 Nov 2011 10:20] Laurynas Biveinis
Description:
Function show_status_array in sql_show.cc acquires the LOCK_global_system_variables lock.  This function is called by fill_variables, which acquires the LOCK_system_variables_hash lock around the call.

But this lock order is inversed from the previous lock order which is established by THD::init() called from plugin_init.  It acquires the LOCK_global_system_variables first thing in the method, then calls plugin_thdvar_init which in turn calls cleanup_variables which acquires the LOCK_system_variables_hash lock.  Which is a reverse order of the above.

This lock order violation is confirmed (or, actually found) by helgrind:
==27235== Thread #12: lock order "0xF14460 before 0xF141E0" violated
==27235==    at 0x4C2A1CB: pthread_mutex_lock (hg_intercepts.c:496)
==27235==    by 0x74A32E: show_status_array(THD*, char const*, st_mysql_show_var*, enum_var_type, system_status_var*, char const*, st_table*, bool, Item*) (sql_show.cc:2266)
==27235==    by 0x74A894: fill_variables(THD*, TABLE_LIST*, Item*) (sql_show.cc:5522)
==27235==    by 0x74E4DF: get_schema_tables_result(JOIN*, enum_schema_table_state) (sql_show.cc:6238)
==27235==    by 0x6A5E0C: JOIN::exec() (sql_select.cc:1863)
==27235==    by 0x6A7D72: mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*) (sql_select.cc:2553)
==27235==    by 0x6A87FC: handle_select(THD*, st_lex*, select_result*, unsigned long) (sql_select.cc:269)
==27235==    by 0x63A3E3: execute_sqlcom_select(THD*, TABLE_LIST*) (sql_parse.cc:5179)
==27235==    by 0x6435EA: mysql_execute_command(THD*) (sql_parse.cc:2309)
==27235==    by 0x64740B: _Z11mysql_parseP3THDPcjPPKc.part.127 (sql_parse.cc:6096)
==27235==    by 0x648F6C: dispatch_command(enum_server_command, THD*, char*, unsigned int) (sql_parse.cc:6045)
==27235==    by 0x6491E3: do_command(THD*) (sql_parse.cc:891)
==27235==   Required order was established by acquisition of lock at 0xF14460
==27235==    at 0x4C2A1CB: pthread_mutex_lock (hg_intercepts.c:496)
==27235==    by 0x61F605: THD::init() (sql_class.cc:841)
==27235==    by 0x6204D0: THD::THD() (sql_class.cc:707)
==27235==    by 0x7A503E: plugin_init(int*, char**, int) (sql_plugin.cc:1395)
==27235==    by 0x632436: init_server_components() (mysqld.cc:4035)
==27235==    by 0x561388: main (mysqld.cc:4504)
==27235==   followed by a later acquisition of lock at 0xF141E0
==27235==    at 0x4C2AFD5: pthread_rwlock_rdlock (hg_intercepts.c:1447)
==27235==    by 0x7A2EDA: cleanup_variables(THD*, system_variables*) (sql_plugin.cc:2566)
==27235==    by 0x7A6667: plugin_thdvar_init(THD*) (sql_plugin.cc:2523)
==27235==    by 0x61F60D: THD::init() (sql_class.cc:842)
==27235==    by 0x6204D0: THD::THD() (sql_class.cc:707)
==27235==    by 0x7A503E: plugin_init(int*, char**, int) (sql_plugin.cc:1395)
==27235==    by 0x632436: init_server_components() (mysqld.cc:4035)
==27235==    by 0x561388: main (mysqld.cc:4504)

How to repeat:
To repeat with helgrind it is enough to start server, no special my.cnf options.  Shutdown server.

Suggested fix:
Choose the correct order and stick to it.
[11 Nov 2011 10:27] Laurynas Biveinis
Deadlock seems possible if a new THD needs to be created for an incoming connection (i.e. thread cache full) at the same time as INFORMATION_SCHEMA.GLOBAL_VARIABLES is queried by another thread.
[15 Nov 2011 7:23] Valeriy Kravchuk
Verified just as described using Helgrind:

==2432== Thread #13: lock order "0x11AA1C8 before 0x119A6A0" violated
==2432==    at 0x4A06FCC: pthread_mutex_lock (hg_intercepts.c:464)
==2432==    by 0xB345B3: safe_mutex_lock (thr_mutex.c:151)
==2432==    by 0x8C9F0C: plugin_status(st_mysql_lex_string const*, int) (sql_plugin.cc:623)
==2432==    by 0x8CA019: sys_var_have_plugin::get_option() (sql_plugin.cc:647)
==2432==    by 0x6E7AEB: sys_var_have_option::value_ptr(THD*, enum_var_type, st_mysql_lex_string*) (set_var.h:1068)
==2432==    by 0x846995: show_status_array(THD*, char const*, st_mysql_show_var*, enum_var_type, system_status_var*, char const*, st_table*, bool, Item*) (sql_show.cc:2272)
==2432==    by 0x853431: fill_variables(THD*, TABLE_LIST*, Item*) (sql_show.cc:5522)
==2432==    by 0x855B5F: get_schema_tables_result(JOIN*, enum_schema_table_state) (sql_show.cc:6238)
==2432==    by 0x739AE9: JOIN::exec() (sql_select.cc:1863)
==2432==    by 0x73C313: mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*) (sql_select.cc:2553)
==2432==    by 0x733D2B: handle_select(THD*, st_lex*, select_result*, unsigned long) (sql_select.cc:269)
==2432==    by 0x6D2CEB: execute_sqlcom_select(THD*, TABLE_LIST*) (sql_parse.cc:5179)
==2432==   Required order was established by acquisition of lock at 0x11AA1C8
==2432==    at 0x4A06FCC: pthread_mutex_lock (hg_intercepts.c:464)
==2432==    by 0xB345B3: safe_mutex_lock (thr_mutex.c:151)
==2432==    by 0x8CF5EB: plugin_thdvar_cleanup(THD*) (sql_plugin.cc:2603)
==2432==    by 0x69C768: THD::~THD() (sql_class.cc:1018)
==2432==    by 0x69C9EB: THD::~THD() (sql_class.cc:1042)
==2432==    by 0x8CC5AE: plugin_load(st_mem_root*, int*, char**) (sql_plugin.cc:1463)
==2432==    by 0x8CBD58: plugin_init(int*, char**, int) (sql_plugin.cc:1256)
==2432==    by 0x6B9B07: init_server_components() (mysqld.cc:4035)
==2432==    by 0x6BA5DC: main (mysqld.cc:4504)
==2432==   followed by a later acquisition of lock at 0x119A6A0
==2432==    at 0x4A07D40: pthread_rwlock_rdlock (hg_intercepts.c:1414)
==2432==    by 0x8CF3D4: cleanup_variables(THD*, system_variables*) (sql_plugin.cc:2566)
==2432==    by 0x8CF61F: plugin_thdvar_cleanup(THD*) (sql_plugin.cc:2606)
==2432==    by 0x69C768: THD::~THD() (sql_class.cc:1018)
==2432==    by 0x69C9EB: THD::~THD() (sql_class.cc:1042)
==2432==    by 0x8CC5AE: plugin_load(st_mem_root*, int*, char**) (sql_plugin.cc:1463)
==2432==    by 0x8CBD58: plugin_init(int*, char**, int) (sql_plugin.cc:1256)
==2432==    by 0x6B9B07: init_server_components() (mysqld.cc:4035)
==2432==    by 0x6BA5DC: main (mysqld.cc:4504)

One has to connect to server and execute SHOW VBARIABLES to identify the problem.
[15 Nov 2011 8:27] Laurynas Biveinis
Valeriy -

I believe the helgrind stack trace in your comment does not exactly confirm the bug I reported but rather is another related bug.

My report concerns with LOCK_global_system_variables and LOCK_system_variables_hash.

Your stacktrace shows error with LOCK_system_variables_hash and LOCK_plugin.
[2 Jan 2012 10:15] Laurynas Biveinis
Also present on 5.5:

==7196== Thread #17: lock order "0xF690E0 before 0xF68D60" violated
==7196==    at 0x4C2A1CB: pthread_mutex_lock (hg_intercepts.c:496)
==7196==    by 0x5C6592: show_status_array(THD*, char const*, st_mysql_show_var*, enum_var_type, system_status_var*, char const*, TABLE*, bool, Item*) (mysql_thread.h:617)
==7196==    by 0x5C6C23: fill_variables(THD*, TABLE_LIST*, Item*) (sql_show.cc:6724)
==7196==    by 0x5CD005: get_schema_tables_result(JOIN*, enum_schema_table_state) (sql_show.cc:7420)
==7196==    by 0x5B579C: JOIN::exec() (sql_select.cc:1927)
==7196==    by 0x5B73E9: mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*) (sql_select.cc:2600)
==7196==    by 0x5B7EB3: handle_select(THD*, LEX*, select_result*, unsigned long) (sql_select.cc:312)
==7196==    by 0x57530A: execute_sqlcom_select(THD*, TABLE_LIST*) (sql_parse.cc:4735)
==7196==    by 0x57BB1D: mysql_execute_command(THD*) (sql_parse.cc:2279)
==7196==    by 0x716B7B: sp_instr_stmt::exec_core(THD*, unsigned int*) (sp_head.cc:3191)
==7196==    by 0x71C4E2: sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*) (sp_head.cc:2988)
==7196==    by 0x71C95C: sp_instr_stmt::execute(THD*, unsigned int*) (sp_head.cc:3123)
==7196==   Required order was established by acquisition of lock at 0xF690E0
==7196==    at 0x4C2A1CB: pthread_mutex_lock (hg_intercepts.c:496)
==7196==    by 0x558323: THD::init() (mysql_thread.h:617)
==7196==    by 0x55A2AC: THD::THD() (sql_class.cc:1022)
==7196==    by 0x586F0F: plugin_load(st_mem_root*, int*, char**) (sql_plugin.cc:1443)
==7196==    by 0x589D60: plugin_init(int*, char**, int) (sql_plugin.cc:1356)
==7196==    by 0x5136B4: init_server_components() (mysqld.cc:4009)
==7196==    by 0x517902: mysqld_main(int, char**) (mysqld.cc:4683)
==7196==    by 0x605830C: (below main) (libc-start.c:226)
==7196==   followed by a later acquisition of lock at 0xF68D60
==7196==    at 0x4C2AFD5: pthread_rwlock_rdlock (hg_intercepts.c:1447)
==7196==    by 0x587EDA: cleanup_variables(THD*, system_variables*) (mysql_thread.h:758)
==7196==    by 0x58804B: plugin_thdvar_init(THD*) (sql_plugin.cc:2641)
==7196==    by 0x55833D: THD::init() (sql_class.cc:1342)
==7196==    by 0x55A2AC: THD::THD() (sql_class.cc:1022)
==7196==    by 0x586F0F: plugin_load(st_mem_root*, int*, char**) (sql_plugin.cc:1443)
==7196==    by 0x589D60: plugin_init(int*, char**, int) (sql_plugin.cc:1356)
==7196==    by 0x5136B4: init_server_components() (mysqld.cc:4009)
==7196==    by 0x517902: mysqld_main(int, char**) (mysqld.cc:4683)
==7196==    by 0x605830C: (below main) (libc-start.c:226)
[10 Sep 2012 5:32] Laurynas Biveinis
Any update on this bug? The bug status still readsĀ "needs triage". Has it not been triaged yet?

Thanks.
[10 Jan 2013 14:44] Erlend Dahl
5.5.19 removed LOCK_system_variables_hash in cleanup_variables function, hence marking as closed.
[10 Jan 2013 14:45] Erlend Dahl
See bug#56652
[30 Dec 2013 4:41] Laurynas Biveinis
See bug 71236.  It appears to match the original Helgrind error reported here, thus the original bug appears to be still present.  Re-closing as duplicate of 71236 (which has a reproducible testcase for all versions).