Bug #87489 MTS ordered commit may cause deadlock with "FLUSH TABLES WITH READ LOCK"of slave
Submitted: 21 Aug 2017 8:15 Modified: 14 Sep 2017 14:05
Reporter: Eunbin Baek Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.16 OS:Any
Assigned to: Bogdan Kecman CPU Architecture:Any
Tags: deadlock, MTS, slave_preserve_commit_order

[21 Aug 2017 8:15] Eunbin Baek
Description:
We are running MySQL slave server 5.7.16 with MTS enabled.
Our replication configuration is as follows:
enforce_gtid_consistency=ON
gtid_mode=ON
slave_parallel_workers=8
slave_parallel_type=LOGICAL_CLOCK
slave_preserve_commit_order=1
binlog_order_commits=1
log_slave_updates=ON

Sometimes replication gets stalled with "FLUSH TABLES WITH READ LOCK" command.

Below is how the processlist and InnoDB status look in that case:
+---------+-------------+-----------------+--------+---------+----------+---------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id      | User        | Host            | db     | Command | Time     | State                                       | Info                                                                                                 |
+---------+-------------+-----------------+--------+---------+----------+---------------------------------------------+------------------------------------------------------------------------------------------------------+
|  725369 | system user |                 | NULL   | Connect | 14825828 | Waiting for master to send event            | NULL                                                                                                 |
| 6995159 | system user |                 | NULL   | Connect |     7208 | Waiting for dependent transaction to commit | NULL                                                                                                 |
| 6995160 | system user |                 | test   | Connect |     7208 | Waiting for global read lock                | update `tb1` set `status` = 'waiting', `col2` = null, `col3 |
| 6995161 | system user |                 | test   | Connect |     7208 | Waiting for global read lock                | update `tb1` set `status` = 'waiting', `col2` = null, `col3 |
| 6995162 | system user |                 | NULL   | Connect |     7208 | Waiting for commit lock                     | NULL                                                                                                 |
| 6995163 | system user |                 | NULL   | Connect |     7208 | Waiting for preceding transaction to commit | NULL                                                                                                 |
| 6995164 | system user |                 | NULL   | Connect |     7208 | Waiting for preceding transaction to commit | NULL                                                                                                 |
| 6995165 | system user |                 | test   | Connect |     7208 | Waiting for global read lock                | update `tb1` set `status` = 'waiting', `col2` = null, `col3 |
| 6995166 | system user |                 | test   | Connect |     7208 | Waiting for global read lock                | update `tb1` set `status` = 'waiting', `col2` = null, `col3 |
| 6995167 | system user |                 | test   | Connect |     7208 | Waiting for global read lock                | update `tb1` set `status` = 'waiting', `col2` = null, `col3 |
| 8034088 | dbmon       | 127.0.0.1:49568 | NULL   | Sleep   |       31 |                                             | NULL                                                                                                 |
| 8141536 | admin       | 127.0.0.1:54602 | NULL   | Query   |     7208 | Waiting for commit lock                     | FLUSH TABLES WITH READ LOCK                                                                          |
| 8145557 | admin       | localhost       | NULL   | Query   |        0 | starting                                    | show processlist                                                                                     |
+---------+-------------+-----------------+--------+---------+----------+---------------------------------------------+------------------------------------------------------------------------------------------------------+

---TRANSACTION 13299932771, ACTIVE 7170 sec
2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1
MySQL thread id 6995162, OS thread handle 140424943101696, query id 13353809683 Waiting for commit lock
TABLE LOCK table `test`.`tb1` trx id 13299932771 lock mode IX
RECORD LOCKS space id 389 page no 2643 n bits 192 index PRIMARY of table `test`.`tb1` trx id 13299932771 lock_mode X locks rec but not gap
Record lock, heap no 43 PHYSICAL RECORD: n_fields 27; compact format; info bits 0

---TRANSACTION 13299932768, ACTIVE (PREPARED) 7170 sec
2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1
MySQL thread id 6995163, OS thread handle 140424942573312, query id 13353809685 Waiting for preceding transaction to commit
TABLE LOCK table `test`.`tb1` trx id 13299932768 lock mode IX
RECORD LOCKS space id 389 page no 1880 n bits 152 index PRIMARY of table `test`.`tb1` trx id 13299932768 lock_mode X locks rec but not gap
Record lock, heap no 30 PHYSICAL RECORD: n_fields 27; compact format; info bits 0

---TRANSACTION 13299932767, ACTIVE (PREPARED) 7170 sec
2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1
MySQL thread id 6995164, OS thread handle 140424941516544, query id 13353809687 Waiting for preceding transaction to commit
TABLE LOCK table `test`.`tb1` trx id 13299932767 lock mode IX
RECORD LOCKS space id 389 page no 2685 n bits 176 index PRIMARY of table `test`.`tb1` trx id 13299932767 lock_mode X locks rec but not gap
Record lock, heap no 58 PHYSICAL RECORD: n_fields 27; compact format; info bits 0

I think this situation can occur at any time in MTS with slave_preserve_commit_order = 1.
"FLUSH TABLES WITH READ LOCK" is used for MySQL backup, so this situation can occur at least once in a day.
I hope this problem will be solved soon.

How to repeat:
It's timing issue, so might take time to reproduce.

Master : run UPDATE with multi-threads
Slave : run "FLUSH TABLES WITH READ LOCK" repeatedly

Suggested fix:
Detect this deadlock situation internally, and execute a rollback or throw an error with stopping replication.
[23 Aug 2017 14:01] Bogdan Kecman
Hi,

While trying to reproduce (unsuccessful so far) I have to ask, why do you execute flush tables with read lock?

Another question, the slave that fails, it is running on bare metal or vm ?

all best
Bogdan
[24 Aug 2017 2:50] Eunbin Baek
Hi,
The slave is running on physical machine.
We are using Percona XtraBackup on slave server for data backup.
so, flush tables with read lock was executed by XtraBackup while backup had been running on the slave.
[14 Sep 2017 14:05] Bogdan Kecman
Hi,

I tried this with multiple mysql binaries and I can't get this issue to happen again. Running a flush tables with read lock on master or on slave works as expected. Is this oracle binary you are using or percona's binary you are having issues with? Could you maybe share a complete config from master and slave to try with setup more similar to yours.

all best
Bogdan
[7 Feb 5:30] Prasad N
I am running into same issue with mysql 5.7.21 but when executing  SET GLOBAL super_read_only=ON  command on the slave instead of FTWRL.

I have enabled multi threaded slaves 30 in number. My Show processlist shows the following : truncated to keep focus:

| Id  | User        | Host            | db     | Command | Time  | State                                       | Info                                                                                                             
+-----+-------------+-----------------+--------+---------+-------+---------------------------------------------+----------------------------------------------
|   7 | system user |                 | NULL   | Connect | 40907 | Waiting for dependent transaction to commit | NULL                                                                                                             
|   8 | system user |                 | NULL   | Connect | 40993 | Waiting for an event from Coordinator       | NULL                                                                                                             
|   9 | system user |                 | NULL   | Connect | 40993 | Waiting for an event from Coordinator       | NULL                                                                                                             
|  10 | system user |                 | LOADDB | Connect | 40993 | Waiting for global read lock                | INSERT INTO LOAD_TABLE (UUID,COUNTER,STATUS) |
|  11 | system user |                 | NULL   | Connect | 40993 | Waiting for preceding transaction to commit | NULL                                                                                                             |
|  12 | system user |                 | NULL   | Connect | 40993 | Waiting for preceding transaction to commit | NULL                                                                                                             
|  13 | system user |                 | NULL   | Connect | 40993 | Waiting for preceding transaction to commit | NULL                                                                                                             |
|  14 | system user |                 | NULL   | Connect | 40993 | Waiting for preceding transaction to commit | NULL                                                                                                             |
|  15 | system user |                 | NULL   | Connect | 40994 | Waiting for an event from Coordinator       | NULL                                                                                                             |
  31 | system user |                 | NULL   | Connect | 44441 | Waiting for an event from Coordinator       | NULL                                                                                                             |                                                                                                             |
| 226 | system user |                 | NULL   | Connect | 43075 | Waiting for master to send event            | NULL                                                                                                             |
| 430 | sgroot      | 127.0.0.1:45190 | NULL   | Query   | 40907 | Waiting for commit lock                     | SET GLOBAL super_read_only=ON                                                                                    |
| 802 | sgroot      | localhost       | NULL   | Query   |     0 | starting                                    | show full processlist                                                                                            |

1: Can you provide an explanation of why this is happening ?
2: Any workarounds to safely execute my SET GLOBAL super_read_only=ON 
3: Fix the issue.
[15 Feb 21:45] Bogdan Kecman
@Prasad, are you using Oracle binaries or? Because I cannot reproduce this with Oracle binaries.

all best
Bogdan