Bug #93089 parallel workers+slave_preserve_commit_order+flushtables with read lock deadlock
Submitted: 6 Nov 2018 3:37 Modified: 7 Nov 2018 7:45
Reporter: ashe sun (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[6 Nov 2018 3:37] ashe sun
Description:
slave config:
mysql> show global variables like '%slave%'
    -> ;
+-------------------------------------------+-----------------------+
| Variable_name                             | Value                 |
+-------------------------------------------+-----------------------+
| init_slave                                |                       |
| log_slave_updates                         | ON                    |
| log_slow_slave_statements                 | ON                    |
| rpl_semi_sync_master_wait_for_slave_count | 1                     |
| rpl_semi_sync_master_wait_no_slave        | ON                    |
| rpl_semi_sync_slave_enabled               | OFF                   |
| rpl_semi_sync_slave_trace_level           | 32                    |
| rpl_stop_slave_timeout                    | 31536000              |
| slave_allow_batching                      | OFF                   |
| slave_checkpoint_group                    | 512                   |
| slave_checkpoint_period                   | 300                   |
| slave_compressed_protocol                 | OFF                   |
| slave_exec_mode                           | STRICT                |
| slave_load_tmpdir                         | /data/mysql/mysql_tmp |
| slave_max_allowed_packet                  | 1073741824            |
| slave_net_timeout                         | 30                    |
| slave_parallel_type                       | LOGICAL_CLOCK         |
| slave_parallel_workers                    | 16                    |
| slave_pending_jobs_size_max               | 16777216              |
| slave_preserve_commit_order               | ON                    |
| slave_rows_search_algorithms              | INDEX_SCAN,HASH_SCAN  |
| slave_skip_errors                         | OFF                   |
| slave_sql_verify_checksum                 | ON                    |
| slave_transaction_retries                 | 128                   |
| slave_type_conversions                    |                       |
| sql_slave_skip_counter                    | 0                     |
+-------------------------------------------+-----------------------+

dead lock:
mysql> show processlist;l
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
| Id | User        | Host      | db                 | Command | Time | State                                                  | Info                                       |
+----+-------------+-----------+--------------------+---------+------+--------------------------------------------------------+--------------------------------------------+
|  3 | root        | localhost | test_dead_lock     | Query   | 2264 | Waiting for commit lock                                | flush tables with read lock                |
|  4 | system user |           | NULL               | Connect | 2708 | Waiting for master to send event                       | NULL                                       |
|  5 | system user |           | NULL               | Connect | 2289 | Slave has read all relay log; waiting for more updates | NULL                                       |
|  6 | system user |           | NULL               | Connect | 2319 | Waiting for global read lock                           | insert into ashe(id,name) values(40,'aaa') |
|  7 | system user |           | NULL               | Connect | 2308 | Waiting for preceding transaction to commit            | NULL                                       |
|  8 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
|  9 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 10 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 11 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 12 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 13 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 14 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 15 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |
| 16 | system user |           | NULL               | Connect | 2708 | Waiting for an event from Coordinator                  | NULL                                       |

deadlock:
flush tables with read lock --> thread 7 --> thread 6 --> flush tables with read lock

How to repeat:
step 0
on master 
create table ashe(id int not null auto_increment primary key,name varchar(10));

step1
on slave
gdb attach slave with none-stop mode and set breakpoint at function open_and_lock_tables
 
step2

2.1 Generate  2 transaction that can be apply on slave in parallel.

session1                                                               session2
begin;                                                                   begin;
insert into ashe values(40,'aaa');                       insert into ashe values(41,'aaa');
commit;                                                                commit

step 3
in gdb
3.1 let first worker thread with transaction 1 stopped in open_and_lock_tables
3.2 let second worker thread with transaction 2 continue,and it will  wait_for_its_turn(trx1)

step 3
on slave
3.1 flush tables with read lock .

step 4
in gdb
let first worker go or exit gdb.

That's all. thanks.

Suggested fix:
1. Block  coordinator befor flush tables with read lock.

2.Remind users stop slave sql_thread befor flush tables with read lock.

3.Detect dead lock conditions when flush tables with read lock and exit and Remind users err message.
[6 Nov 2018 3:39] ashe sun
2.1 Generate  2 transaction that can be apply on slave in parallel. (do insert on master.)
[6 Nov 2018 10:27] MySQL Verification Team
Hi,

Thanks for submission. Can you reproduce this issue without using GDB ?

thanks
bogdan
[6 Nov 2018 13:01] MySQL Verification Team
reminds me of https://bugs.mysql.com/bug.php?id=89247
[7 Nov 2018 1:52] gang chen
Hi Bogdan

This bug can not repeat by debug_sync inner mysqld ? I am not sure about that ,but I will try.

 maybe you can run sysbench oltp test on master ,and send "flush tables with read lock" to slave again and again,good lock to you.
[7 Nov 2018 7:45] ashe sun
Hi Bogdan
 
I have read all documents of Debug Sync Facility,and there is no way to control slave worker threads. But an easy way is here. (Tell me how to control internal threads using Debug Sync Facility if you got it)

This is my way to repeat.
Prepare your environment with master-slave replication and open parallel slave sql workers.

step 0
on master 
create table ashe(id int not null auto_increment primary key,name varchar(10));
create table delay_table(id int);

step1

1.1 Generate  2 transaction that can be apply on slave in parallel.
session1                                                               session2
set binlog_format=statement;                            set binlog_format=statement;
begin;                                                                   begin;
insert into tt select sleep(10);                            
insert into ashe values(40,'aaa');                       insert into ashe values(41,'aaa');
commit;                                                                commit

step 2
on slave
flush tables with read lock .

and then you will see that on slave instance.

mysql> show processlist;
+----+-------------+-----------+----------------+---------+------+--------------------------------------------------------+---------------------------------------+
| Id | User        | Host      | db             | Command | Time | State                                                  | Info                                  |
+----+-------------+-----------+----------------+---------+------+--------------------------------------------------------+---------------------------------------+
|  3 | root        | localhost | NULL           | Query   |   13 | Waiting for commit lock                                | flush  tables with read lock          |
| 22 | root        | localhost | NULL           | Query   |    0 | starting                                               | show processlist                      |
| 41 | root        | localhost | NULL           | Sleep   | 1493 |                                                        | NULL                                  |
| 48 | system user |           | NULL           | Connect |  688 | Waiting for master to send event                       | NULL                                  |
| 49 | system user |           | NULL           | Connect |   20 | Slave has read all relay log; waiting for more updates | NULL                                  |
| 50 | system user |           | test_dead_lock | Connect |   25 | Waiting for global read lock                           | insert into ashe(name) values('aaaa') |
| 51 | system user |           | NULL           | Connect |   33 | Waiting for preceding transaction to commit            | NULL                                  |
| 52 | system user |           | NULL           | Connect |  657 | Waiting for an event from Coordinator                  | NULL                                  |