Description:
In our test environment sometimes we found that STOP SLAVE won't complete. Using gdb, we found that the "STOP SLAVE" thread has stopped the IO thread, but SQL thread was blocked at the following position:
Thread 2 (Thread 0x45520950 (LWP 7248)):
#0 Relay_log_info::is_in_group (this=0x47e00000000) at rpl_rli.h:411
#1 <function called from gdb>
#2 0x00007f821eb71d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#3 0x0000000000d01e50 in safe_cond_wait (cond=0xca0e2a0, mp=0xca0dcf8,
file=0xe6afff "log.cc", line=4639) at thr_mutex.c:237
#4 0x00000000008001b4 in MYSQL_BIN_LOG::wait_for_update (this=0xca0dcf0, thd=0xca46158,
is_slave=true) at log.cc:4639
#5 0x00000000008eecef in next_event (rli=0xca0d860) at slave.cc:4215
#6 0x00000000008f35e8 in exec_relay_log_event (thd=0xca46158, rli=0xca0d860) at slave.cc:2242
#7 0x00000000008f435d in handle_slave_sql (arg=0xca0c470) at slave.cc:3023
#8 0x00007f821eb6dfc7 in start_thread () from /lib/libpthread.so.0
#9 0x00007f821d8d25ad in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
The SQL thread have executed all binlog events read by IO thread, however it was still wait for some more events. After some debuging, we thought the problem may be the following codes in function sql_slave_killed.
if (rli->abort_slave && rli->is_in_group() &&
thd->transaction.all.modified_non_trans_table)
DBUG_RETURN(0);
We found rli->abort_slave in 1, rli->is_in_group() is true, and thd->transaction.all.modified_non_trans_table is true. So SQL thread were in the middle of executing a transaction, and this transactions modifies non-transactional table(this was true, we use non-transactional tables). Because stop in the middle of a such transaction is not safe, so SQL thread decided to continue, hopping for completing this transaction. However, because IO thread has been stopped, SQL thread could not get more binlogs, so it hangs forever.
How to repeat:
Make a lot of mixed transaction that modifies both transactional and non-transactional tables and do replication. However, its hard to repeat. For most of the time its just ok.
Suggested fix:
Stop SQL thread before stopping IO thread?