MySQL Bugs: #89247: Deadlock with MTS when slave_preserve_commit

Bug #89247	Deadlock with MTS when slave_preserve_commit_order = ON.
Submitted:	15 Jan 2018 14:39	Modified:	1 Sep 2021 13:10
Reporter:	Jean-François Gagné	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	8.0.3, 5.7.20	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

I have a "deadlock" with MTS (slave_parallel_workers = 32) when running with slave_preserve_commit_order = ON.  If I run the same binlog (after restoring a backup) with slave_preserve_commit_order = OFF, everything works fine.  The master is a 8.0.3 Intermediate Master that is using Write Set to identify parallelism (transaction_write_set_extraction = XXHASH64 and binlog_transaction_dependency_tracking = WRITESET).

I am writing deadlock above because when the slave blocks and if I try to run "STOP SLAVE", it hangs (for at least 20 minutes after which I lost patience).  I tried killing the "Waiting" threads from "SHOW PROCESSLIST" but it did not help.  I then tried killing the coordinator thread (Waiting for workers to exit) and the IO THREAD (Waiting for the slave SQL thread to free enough relay log space) but it did not help.  I had to kill -9 mysqld.

I have the following in the error logs:

[...]
2018-01-15T12:45:50.690505Z 12365 [Note] [004182] Multi-threaded slave statistics for channel '': seconds elapsed = 142; events assigned = 83579905; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 1663110876300 waited (count) when Workers occupied = 131673 waited when Workers occupied = 226387714600
2018-01-15T12:46:10.695187Z 12397 [ERROR] [000000] Slave SQL for channel '': worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 0
2018-01-15T12:46:10.695320Z 12365 [ERROR] [001756] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2018-01-15T12:46:10.695361Z 12365 [Note] [004210] Slave SQL thread for channel '' exiting, replication stopped in log 'binlog.000338' at position 40421875
2018-01-15T12:46:10.698774Z 12396 [ERROR] [000000] Slave SQL for channel '': worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 0
[...]

And the last line is repeated 4 more times.

Also in the error log, I then have many lines like below for Workers 9, 7, 5, 20, 22, 23, 11, 2, 15, 18, 22, 20, 5, 18, 2, 11, 9, 23, 7 15, 22, 5, 11, ...  Each worker has a different binlog position.

2018-01-15T12:47:01.374221Z 12374 [Warning] [001205] Slave SQL for channel '': Worker N failed executing transaction 'ANONYMOUS' at master log binlog.000338, end_log_pos 40429704; Could not execute Delete_rows event on table schema.table; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log binlog.000338, end_log_pos 40429704, Error_code: 1205

Stil in the error log, I then get another line about "retried transaction 10 time(s) in vain":

2018-01-15T12:48:43.386433Z 12367 [ERROR] [001205] Slave SQL for channel '': worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205

And more lines about "Worker N failed" with HA_ERR_LOCK_WAIT_TIMEOUT, always in the error log.

I have the following in "SHOW SLAVE STATUS": see sss.txt.  Notice the error message but Slave_SQL_Running still to Yes.

I have the following in "SHOW FULL PROCESSLIST": see sfp.txt.  Notice that we do not have the 32 workers in the SHOW PROCESSLIST.  It looks that some threads finished but others are stuck.

I do not have a test case to share, this is coming from production data, but it happens systematically at the same binlog position (I am at my 4th reproduction of the problem after restoring a backup).

This looks very similar to Bug#86079.

Thanks for looking into that,

JFG

How to repeat:
No simple test case yet.

SHOW SLAVE STATUS

Attachment: sss.txt (text/plain), 2.80 KiB.

SHOW FULL PROCESSLIST

Attachment: sfp.txt (text/plain), 7.19 KiB.

I reproduced the problem on 5.7.20 (replicating from 8.0.3).

The error message is saying: "Consider raising the value of the slave_transaction_retries variable", so I tried.  With 32 workers and slave_transaction_retries = 50, there is no deadlock.  However, the fact that there is a deadlock (no possibility to stop/start the slave) with 10 retries should still be addressed.

Hi,
I managed to reproduce this one but it's not easy on slow system. Looks like the faster your system is the easier this is to reproduce. I had to lower the retry number to reproduce faster, with higher retry number I can't reproduce.

Verifying it
thanks for the report
Bogdan

I don't really believe this should be S2 since increasing retry number works around it but I'll leave it as is and let replication team decide :)

all best
bogdan

Great that you were able to reproduce Bogdan, I have to say that I was not expecting this to be verified so quickly: you rock !

About S2, because the way to get out of this is a "kill -9", I thought it was deserving more than S3.

Thanks for you work on this,

JFG

Hi,

I reproduced it rather quickly and easily (second try), maybe I was just lucky before the long weekend :D .. let's see how will it go again on Tuesday :) 

all best
Bogdan

Verified on 5.7.22

in my.cnf: 

slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 16
slave_preserve_commit_order = 1

I encountered this problem in MGR of version 5.7.22, and mysqld hangs whenever I issued  'stop grou_replication' or 'mysqladmin shutdown'.

the only workaround is to 'kill -9' and then issue 'set global slave_parallel_workers=1' before 'start group_replication'

---- my.cnf ----
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 16
slave_preserve_commit_order = 1

pt-pmp info:

     10 pthread_cond_wait,os_event_wait_low,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     10 pthread_cond_wait,Commit_order_manager::wait_for_its_turn,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit,Xid_log_event::do_commit,Xid_apply_log_event::do_apply_event_worker,slave_worker_exec_job_group,handle_slave_worker,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,Per_thread_connection_handler::block_until_new_connection,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event_wait_low,srv_worker_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event_wait_low,lock_wait_suspend_thread,row_mysql_handle_errors,::??,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,Update_rows_log_event::do_exec_row,Rows_log_event::do_apply_row,Rows_log_event::do_index_scan_and_update,Rows_log_event::do_apply_event,Slave_worker::read_and_apply_events,Slave_worker::retry_transaction,slave_worker_exec_job_group,handle_slave_worker,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event_wait_low,lock_wait_suspend_thread,row_mysql_handle_errors,::??,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,Update_rows_log_event::do_exec_row,Rows_log_event::do_apply_row,Rows_log_event::do_index_scan_and_update,Rows_log_event::do_apply_event,slave_worker_exec_job_group,handle_slave_worker,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event_wait_low,buf_flush_page_cleaner_worker,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwaitinfo(libc.so.6),::??,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sched_yield(libc.so.6),Mts_submode_logical_clock::get_least_occupied_worker,Log_event::get_slave_worker,Log_event::apply_event,apply_event_and_update_pos,handle_slave_sql,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event_wait_low,srv_purge_coordinator_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event_wait_low,buf_resize_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event_wait_low,buf_dump_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,Gcs_xcom_engine::process(group_replication.so),process_notification_thread(group_replication.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,compress_gtid_table,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,Applier_module::applier_thread_handle(group_replication.so),group_replication::??(group_replication.so),pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_monitor_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_error_monitor_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,lock_wait_timeout_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,ib_wqueue_timedwait,fts_optimize_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,dict_stats_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,buf_flush_page_cleaner_coordinator,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,Certifier_broadcast_thread::dispatcher(group_replication.so),group_replication::??(group_replication.so),pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 poll(libc.so.6),Mysqld_socket_listener::listen_for_connection_event,mysqld_main,__libc_start_main(libc.so.6),_start
      1 poll(libc.so.6),group_replication::??(group_replication.so),task_loop(group_replication.so),xcom_taskmain2(group_replication.so),Gcs_xcom_proxy_impl::xcom_init(group_replication.so),group_replication::??(group_replication.so),start_thread(libpthread.so.0),clone(libc.so.6)
      1 nanosleep(libpthread.so.0),os_thread_sleep,srv_master_thread,start_thread(libpthread.so.0),clone(libc.so.6)

Also seen in 5.7.28.

Also seen in 5.7.29

Looks like all the below bugs are duplicates of each other.

Bug #87796 Commit_order_manager can't terminate MTS worker properly when deadlock happens
Bug #89247 Deadlock with MTS when slave_preserve_commit_order = ON.
Bug #95249 stop slave permanently blocked
Bug #99440 Threads of MTS Slave randomly stuck

Here is the root cause for the bug.

When --slave-preserve-commit-order is enabled on slave and if the
waiting thread has locked the rows which are needed by the thread
executing the previous transaction(as per their order in the relay log),
then the innodb deadlock detection algorithm detects the deadlock
between workers and will ask the waiting thread to rollback (only if its
sequence number is lesser than that of the waiting thread).

When this happens, the waiting thread wakes up from the cond_wait(SPCO)
and it gets to know that it was asked to rollback by its preceding
transaction as it was holding a lock that is needed by the other
transaction to progress. It then rolls back its transaction so that the
the preceding transaction can be committed and retries the transaction.

The above logic sometimes caused the worker thread to miss the signals
resulting in the replica server to hang. One of such hang is mentioned
below.

Consider a replica server which is configured with
slave_parallel_workers=3, slave_parallel_type=LOGICAL_CLOCK,
slave_preserve_commit_order=1 and slave_transaction_retries=0. When MTS
is enabled, it is quite possible that workers execute out of order
causing the below state.

Worker 1 - Processing the events of Transaction T1
Worker 2 - Executed Transaction T2 and is waiting for T1 to commit.
Worker 3 - Processing the events of Transaction T3

1. If T1 and T2 are modifying same rows in InnodB, then the worker 1
   detects deadlock and asks worker 2 to rollback by signalling.
2. Worker 2 wakes up from the cond_wait. It gets to know that it was
   asked to roll back by the other transaction and returns with an
   error.
3. Worker 2 rolls back the transaction and comes to the retry part of
   the code and checks the value of slave_transaction_retries. Since it
   is 0, it returns from the handle_slave_worker loop and enters the
   error handling part.
4. As part of error handling, Worker 2 notifies the co-ordinator that it
   is exiting, and then calls report_rollback() function to unregister
   itself from the SPCO queue.
5. While executing report_rollback(), Worker 2 will again enter
   wait_for_its_turn(). But before entering the wait, it checks the
   commit_order_deadlock flag. Since the flag is already set, Worker2
   immediately returns from the function with error.
6. Co-ordinator thread gets this information and sets the
   rli->abort_slave=1 to stop replication and waits till all workers
   exit.
7. Worker 2 exits. There is no worker 2 from here onwards.

   Now the status is,
   Worker 1 - Processing the events of Transaction T1
   Worker 2 - Not running.
   Worker 3 - Processing the events of Transaction T3

8. Now the worker 1 proceeds and executes the transaction and enters the
   Commit_order_manager::wait_for_its_turn.
9. Worker 1 finds out that the previous worker(Worker 2) failed because
   of an error.
10. Worker 1 signals next transaction/worker to proceed.
11. Worker 3 executes the transaction and enters the
    Commit_order_manager::wait_for_its_turn.
12. Worker 1 rolls back the transaction and eventually exits.
13. There will be no one to signal Worker 3 and thus waits forever.

This resulted in a system hang as the co-ordinator thread will be
waiting for the worker thread to finish and the worker thread will be
waiting for the signal to proceed with the commit.

mysql> show processlist; 
+----+-------------+-----------------+------+---------+------+-----------------------------------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | 
+----+-------------+-----------------+------+---------+------+-----------------------------------------+
| 2 | root | localhost:55708 | test | Query | 0 | starting | show processlist | 0 | 0 | 
| 3 | system user | | NULL | Connect | 107 | Waiting for master to send event | NULL | 0 | 0 | 
| 4 | system user | | NULL | Connect | 77 | Waiting for workers to exit | NULL | 0 | 0 | 
| 7 | system user | | NULL | Connect | 84 | Waiting for preceding transaction to commit | NULL | 0 | 0 | 
+----+-------------+-----------------+------+---------+------+-----------------------------------------+

Contributed patch on top of 5.7.31 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-MTS-deadlock-with-slave_preserve_commit_order.patch (text/x-patch), 37.06 KiB.

Contributed patch on top of 8.0.21 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-MTS-deadlock-with-slave_preserve_commit_order-8.0.patch (text/x-patch), 37.04 KiB.

Internal bug is closed as duplicate of Bug #87796, thus closing this as a duplicate of Bug #87796.

regards,
Umesh