Bug #87796 Commit_order_manager can't terminate MTS worker properly when deadlock happens
Submitted: 19 Sep 2017 5:11 Modified: 30 Nov 2020 16:44
Reporter: Seunguck Lee Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.7.19 OS:Linux (Linux localhost 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x)
Assigned to: CPU Architecture:Any

[19 Sep 2017 5:11] Seunguck Lee
Description:
Below binary log events are part of actual binary logevents on production server.
(Update statements are modified for security issue, But gap lock for partial pk update logic is same with real-case)
----------------------------------------
# at 21263840
#170914  9:28:16 server id 1738162  end_log_pos 21263905 CRC32 0xce1c3887       GTID    last_committed=31944    sequence_number=31946
update mts_test set fd1=now() where id=22

# at 21264263
#170914  9:28:16 server id 1738162  end_log_pos 21264328 CRC32 0x3b128e0d       GTID    last_committed=31944    sequence_number=31947
update mts_test set fd1=now() where id=19

# at 21264686
#170914  9:28:16 server id 1738162  end_log_pos 21264751 CRC32 0xf0595ace       GTID    last_committed=31946    sequence_number=31948
update mts_test set fd1=now() where id=23

# at 21265109
#170914  9:28:16 server id 1738162  end_log_pos 21265174 CRC32 0xf1997cc0       GTID    last_committed=31946    sequence_number=31949
update mts_test set fd1=now() where id=25

# at 21265532
#170914  9:28:16 server id 1738162  end_log_pos 21265597 CRC32 0x6d520c84       GTID    last_committed=31946    sequence_number=31950
update mts_test set fd1=now() where id=21

# at 21266752
#170914  9:28:16 server id 1738162  end_log_pos 21266817 CRC32 0xbdc2015f       GTID    last_committed=31947    sequence_number=31953
update mts_test set fd1=now() where id=24
----------------------------------------

In MTS, above binary log events can be run in this order (after commit of sequence_number=31947).
    1) update mts_test set fd1=now() where id=25   => Commit order 2
    2) update mts_test set fd1=now() where id=21   => Commit order 3
    3) update mts_test set fd1=now() where id=24   => Commit order 4
    4) update mts_test set fd1=now() where id=23   => Commit order 1

But with this order, worker-4 will wait for worker-3 and worker-3 will wait for worker-1.
Worker-4 thread will call thd_report_row_lock_wait() with wait_for="Worker-3".
Then Worker-3 will be terminiated for MTS commit order deadlock in Commit_order_manager::wait_for_its_turn(https://github.com/mysql/mysql-server/blob/61e2172a1885172aec9814fa7ece5a3176547634/sql/rp...).

Unfourtunately, In this scenario,
Worker-3 can not be terminated because Worker-3 can not call Commit_order_manager::wait_for_its_turn() because Worker-3 still waiting for Worker-1.
So Worker-3 and Worker-4 are terminate query for innodb_lock_wait_timeout. 
More worse, once this case happen MTS can't be proceed because update order is not changed for Worker-3 and Worker-4 forever.
So Worker-3 and Worker-4 will terminated by "Lock wait timeout exceeded" error and retry until slave_transaction_retries times.
And eventually, MTS slave is stopped.

Of course, binary log event can be run different order and no problems. But this case happens too often to us.

How to repeat:
-- MySQL configuration --
slave_parallel_workers=8
slave_parallel_type=LOGICAL_CLOCK
slave_preserve_commit_order=1
binlog_format=[STATEMENT | MIXED]
transaction_isolation=REPEATABLE-READ
enforce_gtid_consistency=ON
gtid_mode=ON

-- Prepare test data --
CREATE TABLE mts_test (
  id BIGINT NOT NULL,
  ts DATETIME NOT NULL,
  fd1 DATETIME,
  PRIMARY KEY(id, ts)
);

prepare(){
    for(var idx=0; idx<MAX_TEST_ID; idx++){
        exec("insert into mts_test values (NULL, now(), now())");
    }
}

-- Run test case --
var last_id = 1;

getNextId(){
    if(last_id>=MAX_TEST_ID)
        return 0;
	return last_id++;
}

test(){
    var id = getNextId();
    while(id>0){
	    ThreadPool.exec("update mts_test set fd1=now() where id=" + id);
	    id = getNextId();
    }
}

This case is very timing issue, so you may not see the case.
And our real service, acutally table has 60 partitions, so update might be slower than this case. (update statement does not have partition key condition).

Suggested fix:
I think "sequence_number=31953" event should not start until last_committed=31946 groups are completed. But this may make MTS slower.

or Change Commit_order_manager::wait_for_its_turn() to terminate worker which is waiting for another lock.
[28 Sep 2017 4:28] MySQL Verification Team
Hi,

I managed to reproduce this but it's not consistent and I can't reproduce it on-demand. Not 100% sure what's going on but will push this trough as verified.

Thank for the report
Bogdan
[24 Feb 2020 8:06] kfpanda kf
Also seen in 5.7.27
[24 Sep 2020 7:48] Venkatesh Prasad Venugopal
Looks like all the below bugs are duplicates of each other.

Bug #87796 Commit_order_manager can't terminate MTS worker properly when deadlock happens
Bug #89247 Deadlock with MTS when slave_preserve_commit_order = ON.
Bug #95249 stop slave permanently blocked
Bug #99440 Threads of MTS Slave randomly stuck

Here is the root cause for the bug.

When --slave-preserve-commit-order is enabled on slave and if the
waiting thread has locked the rows which are needed by the thread
executing the previous transaction(as per their order in the relay log),
then the innodb deadlock detection algorithm detects the deadlock
between workers and will ask the waiting thread to rollback (only if its
sequence number is lesser than that of the waiting thread).

When this happens, the waiting thread wakes up from the cond_wait(SPCO)
and it gets to know that it was asked to rollback by its preceding
transaction as it was holding a lock that is needed by the other
transaction to progress. It then rolls back its transaction so that the
the preceding transaction can be committed and retries the transaction.

The above logic sometimes caused the worker thread to miss the signals
resulting in the replica server to hang. One of such hang is mentioned
below.

Consider a replica server which is configured with
slave_parallel_workers=3, slave_parallel_type=LOGICAL_CLOCK,
slave_preserve_commit_order=1 and slave_transaction_retries=0. When MTS
is enabled, it is quite possible that workers execute out of order
causing the below state.

Worker 1 - Processing the events of Transaction T1
Worker 2 - Executed Transaction T2 and is waiting for T1 to commit.
Worker 3 - Processing the events of Transaction T3

1. If T1 and T2 are modifying same rows in InnodB, then the worker 1
   detects deadlock and asks worker 2 to rollback by signalling.
2. Worker 2 wakes up from the cond_wait. It gets to know that it was
   asked to roll back by the other transaction and returns with an
   error.
3. Worker 2 rolls back the transaction and comes to the retry part of
   the code and checks the value of slave_transaction_retries. Since it
   is 0, it returns from the handle_slave_worker loop and enters the
   error handling part.
4. As part of error handling, Worker 2 notifies the co-ordinator that it
   is exiting, and then calls report_rollback() function to unregister
   itself from the SPCO queue.
5. While executing report_rollback(), Worker 2 will again enter
   wait_for_its_turn(). But before entering the wait, it checks the
   commit_order_deadlock flag. Since the flag is already set, Worker2
   immediately returns from the function with error.
6. Co-ordinator thread gets this information and sets the
   rli->abort_slave=1 to stop replication and waits till all workers
   exit.
7. Worker 2 exits. There is no worker 2 from here onwards.

   Now the status is,
   Worker 1 - Processing the events of Transaction T1
   Worker 2 - Not running.
   Worker 3 - Processing the events of Transaction T3

8. Now the worker 1 proceeds and executes the transaction and enters the
   Commit_order_manager::wait_for_its_turn.
9. Worker 1 finds out that the previous worker(Worker 2) failed because
   of an error.
10. Worker 1 signals next transaction/worker to proceed.
11. Worker 3 executes the transaction and enters the
    Commit_order_manager::wait_for_its_turn.
12. Worker 1 rolls back the transaction and eventually exits.
13. There will be no one to signal Worker 3 and thus waits forever.

This resulted in a system hang as the co-ordinator thread will be
waiting for the worker thread to finish and the worker thread will be
waiting for the signal to proceed with the commit.

mysql> show processlist; 
+----+-------------+-----------------+------+---------+------+-----------------------------------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | 
+----+-------------+-----------------+------+---------+------+-----------------------------------------+
| 2 | root | localhost:55708 | test | Query | 0 | starting | show processlist | 0 | 0 | 
| 3 | system user | | NULL | Connect | 107 | Waiting for master to send event | NULL | 0 | 0 | 
| 4 | system user | | NULL | Connect | 77 | Waiting for workers to exit | NULL | 0 | 0 | 
| 7 | system user | | NULL | Connect | 84 | Waiting for preceding transaction to commit | NULL | 0 | 0 | 
+----+-------------+-----------------+------+---------+------+-----------------------------------------+
[24 Sep 2020 7:54] Venkatesh Prasad Venugopal
Contributed patch on top of 5.7.31

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-MTS-deadlock-with-slave_preserve_commit_order.patch (text/x-patch), 37.06 KiB.

[24 Sep 2020 18:11] OCA Admin
Contribution submitted via Github - Bug #87796 Commit_order_manager can't terminate MTS worker properly when deadloc 
(*) Contribution by Venkatesh Prasad Venugopal (Github venkatesh-prasad-v, mysql-server/pull/308#issuecomment-698472987): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_492231026.txt (text/plain), 37.21 KiB.

[29 Sep 2020 5:20] Venkatesh Prasad Venugopal
Contributed patch on top of 8.0.21

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-MTS-deadlock-with-slave_preserve_commit_order-8.0.patch (text/x-patch), 37.04 KiB.

[30 Nov 2020 16:44] Margaret Fisher
Posted by developer:
 
Changelog entry added for MySQL 8.0.23:

On a multi-threaded replica where the commit order is preserved, worker threads must wait for all transactions that occur earlier in the relay log to commit before committing their own transactions. If a deadlock occurs because a thread waiting to commit a transaction later in the commit order has locked rows needed by a transaction earlier in the commit order, a deadlock detection algorithm signals the waiting thread to roll back its transaction. Previously, if transaction retries were not available, the worker thread that rolled back its transaction would exit immediately without signalling other worker threads in the commit order, which could stall replication. A worker thread in this situation now waits for its turn to call the rollback function, which means it signals the other threads correctly.
[7 Dec 2020 23:28] Jean-François Gagné
Hi Margaret, I see this will be fixed in 8.0.23, can we hope for a fix in 5.7 ?  Thanks, JFG
[19 Feb 2021 21:26] Margaret Fisher
Posted by developer:
 
Changelog entry now also added for MySQL 5.7.33.
[1 Sep 2021 13:10] MySQL Verification Team
Bug #89247 marked as duplicate of this one.