MySQL Bugs: #71047: shutdown hang if semisync is enabled and slave abort

Bug #71047	shutdown hang if semisync is enabled and slave abort
Submitted:	2 Dec 2013 4:50	Modified:	29 Aug 2014 16:20
Reporter:	zhai weixiang (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.2, 5.7.3-m13	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Repeat:

Step 1. enable semisync and set a big value for rpl_semi_sync_master_timeout.

Step 2. kill io thread of slave, so the threads on master will keep on WAITING ACK FROM SLAVE.

Step 3.shutdown the master ---hang

More seriously, the  io thread of slave can't connect to master during step 3. And the only way we can do  is to kill the instance.

And the backtrace from my test:

     98 pthread_cond_wait,inline_mysql_cond_wait,Stage_manager::enroll_for,MYSQL_BIN_LOG::change_stage,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,handle_connection,pfs_spawn_thread,start_thread,clone
     18 pthread_cond_wait,os_event::wait,os_event::wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone
      3 pthread_cond_wait,os_event::wait,os_event::wait_low,srv_worker_thread,start_thread,clone
      1 sigwait,signal_hand,pfs_spawn_thread,start_thread,clone
      1 pthread_cond_wait,os_event::wait,os_event::wait_low,srv_purge_coordinator_suspend,srv_purge_coordinator_thread,start_thread,clone
      1 pthread_cond_wait,os_event::wait,os_event::wait_low,buf_dump_thread,start_thread,clone
      1 pthread_cond_wait,inline_mysql_cond_wait,mysqld_main,__libc_start_main,_start
      1 pthread_cond_wait,inline_mysql_cond_wait,close_connections,kill_server,kill_server_thread,pfs_spawn_thread,start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_error_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,lock_wait_timeout_thread,start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,ib_wqueue_timedwait,fts_optimize_thread,start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,dict_stats_thread,start_thread,clone
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,buf_flush_page_cleaner_thread,start_thread,clone
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait,ReplSemiSyncMaster::cond_timewait,ReplSemiSyncMaster::commitTrx,Binlog_storage_delegate::after_sync,call_after_sync_hook,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,handle_connection,pfs_spawn_thread,start_thread,clone
      1 nanosleep,os_thread_sleep,srv_master_sleep,srv_master_thread,start_thread,clone
      1 __lll_lock_wait,_L_lock_995,pthread_mutex_lock,inline_mysql_mutex_lock,MYSQL_BIN_LOG::change_stage,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,handle_connection,pfs_spawn_thread,start_thread,clone

How to repeat:
easy to repeat.

Suggested fix:
N/A

Hello Weixiang,

Thank you for the bug report and test case.
Verified as described.

Thanks,
Umesh

// 5.7.3-m13 is also affected

Stacktrace from pmp

      1 pthread_join,mysqld_main,__libc_start_main,_start
      1 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=1),buf_dump_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=16),srv_purge_coordinator_suspend,optimized,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,inline_mysql_cond_wait,at,close_connections,optimized,pfs_spawn_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait,time_in_usec=<value,srv_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait,time_in_usec=<value,srv_error_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait,time_in_usec=<value,page_cleaner_sleep_if_needed,optimized,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait,time_in_usec=<value,lock_wait_timeout_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait,time_in_usec=<value,ib_wqueue_timedwait,fts_optimize_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait,time_in_usec=<value,dict_stats_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,inline_mysql_cond_timedwait,wait_time=0x7fe5411cb490),ReplSemiSyncMaster::commitTrx,Binlog_storage_delegate::after_sync,call_after_sync_hook,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,handle_connection,pfs_spawn_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe546218e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe545a17e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe545216e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe544a15e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe544214e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe543a13e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe543212e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe542a11e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe542210e58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 ??,os_aio_linux_collect,message1=0x7fe541a0fe58,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 nanosleep,os_thread_sleep,srv_master_sleep,optimized,start_thread,clone

Posted by developer:
 
Weixiang,

Thanks for reporting the bug.

Would you please tell us if 'rpl_semi_sync_master_wait_no_slave' is ON? By default it is on,
it means semisync master should wait for acknowledgement even there is no semisync slave.

If it is ON, then please switch the variable to OFF and try again to see if the problem still there.

Yes, I am using default value of rpl_semi_sync_master_wait_no_slave. 
And  everything goes well if I turned off this option.

Posted by developer:
 
Hi Luis,

Unfortunately, listen socket is already closed, the server cannot accept any connection since then.
So there is no way to replicate the left events to slave.
However, for the case that there is a dump thread. it will wait for all events to be replicated and then shut down.
This fix doesn't impact the case.

The following was added to the 5.6.21 and 5.7.5 changelog with commit 4097:
On a master that is using semisynchronous replication, where rpl_semi_sync_master_wait_no_slave is enabled and rpl_semi_sync_master_timeout is set to long timeout, killing the I/O thread could cause the server to hang on shutdown. This fix ensures that if the dump thread finds that there no semisynchronous slaves connected to the master, the setting of rpl_semi_sync_master_wait_no_slave is ignored and the shutdown proceeds correctly.

revno: 6112
committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com>
branch nick: Bug17879675_mysql-5.6
timestamp: Mon 2014-08-11 12:06:08 +0530
message:
  Bug#17879675:SHUTDOWN HANG IF SEMISYNC IS ENABLED AND SLAVE
  ABORT
  
  Problem:
  ========
  Step 1. enable semisync and set a big value for
  rpl_semi_sync_master_timeout.
  
  Step 2. kill io thread of slave, so the threads on master
  will keep on WAITING ACK FROM SLAVE.
  
  Step 3. shutdown the master ---hang
  
  Analysis:
  ========
  On master when semisync is enabled the transaction thread
  will wait for an ack from the slave in a conditional timed
  wait. Since the IO thread is already killed and on master
  a huge timeout value is set transaction thread will continue
  to wait for an ack for a long time. At this time if shutdown
  command is issued dump thread will try to go down. But dump
  thread when goes down it has a check if user has enabled
  rpl_semi_sync_master_wait_no_slave or not. If it is enabled
  dump thread will go down without switching off semi sync.
  Transaction thread will continue to wait for an ack even
  when slave count drops to zero. Because of this shutdown
  also will wait for the transaction thread to go down and
  shutdown hangs.
  
  Fix:
  ===
  At the time of shutdown if dump thread finds that there no
  semi sync slaves that are connected to master it will switch
  off semisync even when rpl_semi_sync_master_wait_no_slave
  is set to on. i.e if shutdown finds no slaves are there to
  provide ack it will not honour
  rpl_semi_sync_master_wait_no_slave and it will switch off
  semisync. The transaction thread when woken up realises the
  semi sync is switched off it will not wait any more for ack
  it will continue and shutdown will be complete. When semi
  sync is forcefully shutdown at the time of waiting for an
  ack a warning is written to error log to convey to users
  that it was forceful shutdown.
  
  There can be a scenario where dump thread is already gone
  and transaction thread still waits for an ack. Hence a
  similar fix is implemented for the waiting transaction
  thread. I.e at the time of shutdown the thread gets a
  wakeup signal and it checks if shutdown is in progress
  if no slaves are connected then it will switch off the
  semi sync.

Posted by developer:
 
Reopened by merge, no new changelog entries needed. Set back to Closed.