Bug #47711 infinite hang and 100% cpu usage with concurrent engine=merge alters
Submitted: 29 Sep 2009 12:20 Modified: 29 Sep 2009 12:58
Reporter: Shane Bester (Platinum Quality Contributor) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Merge storage engine Severity:S1 (Critical)
Version:5.0.85, 5.1.39 OS:Any
Assigned to: CPU Architecture:Any

[29 Sep 2009 12:20] Shane Bester
Description:
concurrently altering a table and making it merge engine leads to thread hanging and consuming 100% cpu usage;

 Query   |   63 | After create | alter table `t1` engine=merge |
 Query   |   62 | After create | alter table `t1` engine=merge |

How to repeat:
delimiter ;

flush tables;
drop table if exists t1;
create table `t1` (`c1` int) engine=myisam;
drop procedure if exists p1;
delimiter $
create procedure p1(num int)
begin
 declare i int default '0';
 repeat
  alter table `t1` engine=merge;
  select i as cnt;
  set i:=i+1;
 until i>num end repeat;
end $

delimiter ;

call p1(10000); #call this in two connections concurrently.
[29 Sep 2009 12:26] MySQL Verification Team
thread stacks of the 1 hung and 1 looping thread.

Attachment: bug47711_thread_stacks.txt (text/plain), 6.09 KiB.

[29 Sep 2009 12:58] MySQL Verification Team
Thank you for the bug report. Verified as described on Ubuntu 64-bit.
[21 Oct 2009 15:23] Martin Dimitrov
I was also able to reproduce this bug, and tried to do some debugging as well.

I printed out a trace, showing that the two threads are stuck in an infinite loop in function remove_table_from_cache(). 

Thread 1
9933972 T@-1264641: | | | | | | | | | | | | >hash_first
9933973 T@-1264641: | | | | | | | | | | | | | exit: found key at 6
9933974 T@-1264641: | | | | | | | | | | | | <hash_first
9933975 T@-1264641: | | | | | | | | | | | | info: Found another active instance of the table
9933976 T@-1264641: | | | | | | | | | | | | >mysql_lock_abort_for_thread
9933977 T@-1264641: | | | | | | | | | | | | | >get_lock_data
9933978 T@-1264641: | | | | | | | | | | | | | | >my_malloc
9933979 T@-1264641: | | | | | | | | | | | | | | | my: size: 20  my_flags: 0
9933980 T@-1264641: | | | | | | | | | | | | | | | exit: ptr: 0x9c0fa48
9933981 T@-1264641: | | | | | | | | | | | | | | <my_malloc
9933982 T@-1264641: | | | | | | | | | | | | | <get_lock_data
9933983 T@-1264641: | | | | | | | | | | | | | >my_free
9933984 T@-1264641: | | | | | | | | | | | | | | my: ptr: 0x9c0fa48
9933985 T@-1264641: | | | | | | | | | | | | | <my_free
9933986 T@-1264641: | | | | | | | | | | | | <mysql_lock_abort_for_thread
9933987 T@-1264641: | | | | | | | | | | | | info: Waiting for refresh signal

Thread 2
9933988 T@-1264837: | | | | | | | | | | | | >hash_first
9933989 T@-1264837: | | | | | | | | | | | | | exit: found key at 6
9933990 T@-1264837: | | | | | | | | | | | | <hash_first
9933991 T@-1264837: | | | | | | | | | | | | info: Found another active instance of the table
9933992 T@-1264837: | | | | | | | | | | | | >mysql_lock_abort_for_thread
9933993 T@-1264837: | | | | | | | | | | | | | >get_lock_data
9933994 T@-1264837: | | | | | | | | | | | | | | >my_malloc
9933995 T@-1264837: | | | | | | | | | | | | | | | my: size: 20  my_flags: 0
9933996 T@-1264837: | | | | | | | | | | | | | | | exit: ptr: 0x9c0fa48
9933997 T@-1264837: | | | | | | | | | | | | | | <my_malloc
9933998 T@-1264837: | | | | | | | | | | | | | <get_lock_data
9933999 T@-1264837: | | | | | | | | | | | | | >my_free
9934000 T@-1264837: | | | | | | | | | | | | | | my: ptr: 0x9c0fa48
9934001 T@-1264837: | | | | | | | | | | | | | <my_free
9934002 T@-1264837: | | | | | | | | | | | | <mysql_lock_abort_for_thread
9934003 T@-1264837: | | | | | | | | | | | | info: Waiting for refresh signal
 
And this repeats into infinity... 

In the PREREQUISITES for remove_table_from_cache(), it is mentioned that the thread must own LOCK_open upon entering this function. Thus I became curious how come both threads are stuck in this function if they need the mutex LOCK_open. Interestingly, around line sql_base.cc:6197 we release the LOCK_open lock, thus allowing the other thread to advance (at this time the other thread was waiting for LOCK_open in function mysql_create_table(), see my stack trace below).  

6196           set_timespec(abstime, 10);
6197           pthread_cond_timedwait(&COND_refresh, &LOCK_open, &abstime);

This was confirmed using GDB, which shows that both threads alternate holding LOCK_open, each one releasing it after line 6197 and allowing the other one to grab it. 

Hardware watchpoint 7: LOCK_open->thread

Old value = 3031100304
New value = 3030903696
safe_cond_timedwait (cond=0x8665160, mp=0x8664c80, abstime=0xb4a7b0f0, file=0x84d4700 "sql_base.cc", line=6197) at thr_mutex.c:284

-------------
stack traces: 

[Switching to Thread -1265439856 (LWP 31561)]

Breakpoint 1, remove_table_from_cache (thd=0xa4f4140, db=0xa4c3360 "test", table_name=0xa4c31d0 "t1", flags=2) at sql_base.cc:6139
6139              DBUG_PRINT("info", ("Found another active instance of the table"));
(gdb) bt
#0  remove_table_from_cache (thd=0xa4f4140, db=0xa4c3360 "test", table_name=0xa4c31d0 "t1", flags=2) at sql_base.cc:6139
#1  0x081dc5fe in wait_while_table_is_used (thd=0xa4f4140, table=0xa4c49f0, function=HA_EXTRA_PREPARE_FOR_DELETE) at sql_table.cc:1969
#2  0x081dcc11 in close_cached_table (thd=0xa4f4140, table=0xa4c49f0) at sql_table.cc:1994
#3  0x081e4759 in mysql_alter_table (thd=0xa4f4140, new_db=0xa510f80 "test", new_name=0xa5121a8 "t1", create_info=0xb492c85c, table_list=0xa5121d0, alter_info=0xb492c8dc, order_num=0, order=0x0,
    ignore=false) at sql_table.cc:3985
#4  0x0810547a in mysql_execute_command (thd=0xa4f4140) at sql_parse.cc:3547
#5  0x0823b601 in sp_instr_stmt::exec_core (this=0xa512370, thd=0xa4f4140, nextp=0xb492ce10) at sp_head.cc:2698
#6  0x0823b7d4 in sp_lex_keeper::reset_lex_and_exec_core (this=0xa512398, thd=0xa4f4140, nextp=0xb492ce10, open_tables=false, instr=0xa512370) at sp_head.cc:2543
#7  0x08242b7a in sp_instr_stmt::execute (this=0xa512370, thd=0xa4f4140, nextp=0xb492ce10) at sp_head.cc:2651
#8  0x08240210 in sp_head::execute (this=0xa510ca0, thd=0xa4f4140) at sp_head.cc:1154
#9  0x08240b74 in sp_head::execute_procedure (this=0xa510ca0, thd=0xa4f4140, args=0xa4f533c) at sp_head.cc:1815
#10 0x08106387 in mysql_execute_command (thd=0xa4f4140) at sql_parse.cc:4949
#11 0x0810832a in mysql_parse (thd=0xa4f4140, inBuf=0xa4fd658 "call p1(10)", length=11, found_semicolon=0xb492dfa8) at sql_parse.cc:6437
#12 0x08108961 in dispatch_command (command=COM_QUERY, thd=0xa4f4140, packet=0xa4f5629 "call p1(10)", packet_length=12) at sql_parse.cc:1959
#13 0x0810aa0d in handle_one_connection (arg=0xa4f4140) at sql_parse.cc:1640
#14 0x083d76da in start_thread ()
#15 0x0841418e in clone ()

(gdb) info threads
* 13 Thread -1265439856 (LWP 31561)  remove_table_from_cache (thd=0xa4f4140, db=0xa4c3360 "test", table_name=0xa4c31d0 "t1", flags=2) at sql_base.cc:6139
  12 Thread -1265243248 (LWP 31559)  0x083dae24 in __lll_lock_wait ()

(gdb) thread 12
[Switching to thread 12 (Thread -1265243248 (LWP 31559))]#0  0x083dae24 in __lll_lock_wait ()
(gdb) bt
#0  0x083dae24 in __lll_lock_wait ()
#1  0x083d8ef5 in _L_lock_836 ()
#2  0x083d8e36 in pthread_mutex_lock ()
#3  0x083adb65 in safe_mutex_lock (mp=0x8664c80, try_lock=0 '\0', file=0x84eb5c0 "sql_table.cc", line=1739) at thr_mutex.c:149
#4  0x081e07ad in mysql_create_table (thd=0xa4b9b50, db=0xa503028 "test", table_name=0xb495beec "#sql-7b38_1", create_info=0xb495c85c, alter_info=0xb495bf3c, internal_tmp_table=true,
    select_field_count=0) at sql_table.cc:1739
#5  0x081e40c6 in mysql_alter_table (thd=0xa4b9b50, new_db=0xa503028 "test", new_name=0xa504250 "t1", create_info=0xb495c85c, table_list=0xa504278, alter_info=0xb495c8dc, order_num=0, order=0x0,
    ignore=false) at sql_table.cc:3786
#6  0x0810547a in mysql_execute_command (thd=0xa4b9b50) at sql_parse.cc:3547
#7  0x0823b601 in sp_instr_stmt::exec_core (this=0xa504418, thd=0xa4b9b50, nextp=0xb495ce10) at sp_head.cc:2698
#8  0x0823b7d4 in sp_lex_keeper::reset_lex_and_exec_core (this=0xa504440, thd=0xa4b9b50, nextp=0xb495ce10, open_tables=false, instr=0xa504418) at sp_head.cc:2543
#9  0x08242b7a in sp_instr_stmt::execute (this=0xa504418, thd=0xa4b9b50, nextp=0xb495ce10) at sp_head.cc:2651
#10 0x08240210 in sp_head::execute (this=0xa502d48, thd=0xa4b9b50) at sp_head.cc:1154
#11 0x08240b74 in sp_head::execute_procedure (this=0xa502d48, thd=0xa4b9b50, args=0xa4bad4c) at sp_head.cc:1815
#12 0x08106387 in mysql_execute_command (thd=0xa4b9b50) at sql_parse.cc:4949
#13 0x0810832a in mysql_parse (thd=0xa4b9b50, inBuf=0xa4f2138 "call p1(1000)", length=13, found_semicolon=0xb495dfa8) at sql_parse.cc:6437
#14 0x08108961 in dispatch_command (command=COM_QUERY, thd=0xa4b9b50, packet=0xa4ea109 "", packet_length=14) at sql_parse.cc:1959
#15 0x0810aa0d in handle_one_connection (arg=0xa4b9b50) at sql_parse.cc:1640
#16 0x083d76da in start_thread ()
#17 0x0841418e in clone ()

---------------------

I hope this additional information helps. I am not very familiar with this code, and I still cannot understand the rest of this problem.