Bug #87796 Commit_order_manager can't terminate MTS worker properly when deadlock happens
Submitted: 19 Sep 5:11 Modified: 28 Sep 4:28
Reporter: Seunguck Lee Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.7.19 OS:Linux (Linux localhost 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x)
Assigned to:

[19 Sep 5:11] Seunguck Lee
Description:
Below binary log events are part of actual binary logevents on production server.
(Update statements are modified for security issue, But gap lock for partial pk update logic is same with real-case)
----------------------------------------
# at 21263840
#170914  9:28:16 server id 1738162  end_log_pos 21263905 CRC32 0xce1c3887       GTID    last_committed=31944    sequence_number=31946
update mts_test set fd1=now() where id=22

# at 21264263
#170914  9:28:16 server id 1738162  end_log_pos 21264328 CRC32 0x3b128e0d       GTID    last_committed=31944    sequence_number=31947
update mts_test set fd1=now() where id=19

# at 21264686
#170914  9:28:16 server id 1738162  end_log_pos 21264751 CRC32 0xf0595ace       GTID    last_committed=31946    sequence_number=31948
update mts_test set fd1=now() where id=23

# at 21265109
#170914  9:28:16 server id 1738162  end_log_pos 21265174 CRC32 0xf1997cc0       GTID    last_committed=31946    sequence_number=31949
update mts_test set fd1=now() where id=25

# at 21265532
#170914  9:28:16 server id 1738162  end_log_pos 21265597 CRC32 0x6d520c84       GTID    last_committed=31946    sequence_number=31950
update mts_test set fd1=now() where id=21

# at 21266752
#170914  9:28:16 server id 1738162  end_log_pos 21266817 CRC32 0xbdc2015f       GTID    last_committed=31947    sequence_number=31953
update mts_test set fd1=now() where id=24
----------------------------------------

In MTS, above binary log events can be run in this order (after commit of sequence_number=31947).
    1) update mts_test set fd1=now() where id=25   => Commit order 2
    2) update mts_test set fd1=now() where id=21   => Commit order 3
    3) update mts_test set fd1=now() where id=24   => Commit order 4
    4) update mts_test set fd1=now() where id=23   => Commit order 1

But with this order, worker-4 will wait for worker-3 and worker-3 will wait for worker-1.
Worker-4 thread will call thd_report_row_lock_wait() with wait_for="Worker-3".
Then Worker-3 will be terminiated for MTS commit order deadlock in Commit_order_manager::wait_for_its_turn(https://github.com/mysql/mysql-server/blob/61e2172a1885172aec9814fa7ece5a3176547634/sql/rp...).

Unfourtunately, In this scenario,
Worker-3 can not be terminated because Worker-3 can not call Commit_order_manager::wait_for_its_turn() because Worker-3 still waiting for Worker-1.
So Worker-3 and Worker-4 are terminate query for innodb_lock_wait_timeout. 
More worse, once this case happen MTS can't be proceed because update order is not changed for Worker-3 and Worker-4 forever.
So Worker-3 and Worker-4 will terminated by "Lock wait timeout exceeded" error and retry until slave_transaction_retries times.
And eventually, MTS slave is stopped.

Of course, binary log event can be run different order and no problems. But this case happens too often to us.

How to repeat:
-- MySQL configuration --
slave_parallel_workers=8
slave_parallel_type=LOGICAL_CLOCK
slave_preserve_commit_order=1
binlog_format=[STATEMENT | MIXED]
transaction_isolation=REPEATABLE-READ
enforce_gtid_consistency=ON
gtid_mode=ON

-- Prepare test data --
CREATE TABLE mts_test (
  id BIGINT NOT NULL,
  ts DATETIME NOT NULL,
  fd1 DATETIME,
  PRIMARY KEY(id, ts)
);

prepare(){
    for(var idx=0; idx<MAX_TEST_ID; idx++){
        exec("insert into mts_test values (NULL, now(), now())");
    }
}

-- Run test case --
var last_id = 1;

getNextId(){
    if(last_id>=MAX_TEST_ID)
        return 0;
	return last_id++;
}

test(){
    var id = getNextId();
    while(id>0){
	    ThreadPool.exec("update mts_test set fd1=now() where id=" + id);
	    id = getNextId();
    }
}

This case is very timing issue, so you may not see the case.
And our real service, acutally table has 60 partitions, so update might be slower than this case. (update statement does not have partition key condition).

Suggested fix:
I think "sequence_number=31953" event should not start until last_committed=31946 groups are completed. But this may make MTS slower.

or Change Commit_order_manager::wait_for_its_turn() to terminate worker which is waiting for another lock.
[28 Sep 4:28] Bogdan Kecman
Hi,

I managed to reproduce this but it's not consistent and I can't reproduce it on-demand. Not 100% sure what's going on but will push this trough as verified.

Thank for the report
Bogdan