Description:
After a master crashes during binlogging replication event, where the slave already started to copy it to it's relay log, but didn't finish, a STOP SLAVE on such multi-threaded slave takes longer then expected.
In the test case below, it takes over one minute. That long delay may be a problem for failover automation scripts.
The issue does not happen when slave_parallel_workers=0.
How to repeat:
Create simple master-slave 5.7 pair (I used dbdeployer here), and set the parallel replication on slave:
slave_parallel_workers=2
slave_parallel_type=LOGICAL_CLOCK
Created one big table with sysbench and an empty table copy:
$ sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb --table-size=100000 --mysql-db=test --mysql-user=msandbox --mysql-password=msandbox --db-driver=mysql --mysql-socket=/tmp/mysql_sandbox19226.sock prepare
sysbench 1.0.17 (using bundled LuaJIT 2.1.0-beta2)
Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...
master [localhost:19226] {msandbox} (test) > create table sbtest10 like sbtest1;
Query OK, 0 rows affected (0.75 sec)
Then, execute INSERT INTO ... SELECT to populate the second copy, and sent kill signal to the master immediately after:
$ rsandbox_ps5_7_25/m test -e "insert into sbtest10 select * from sbtest1"; kill -9 $(pgrep -f mysql_sandbox19226);
The slave is trying to reconnect:
slave1 [localhost:19227] {msandbox} ((none)) > show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Reconnecting after a failed master event read
Master_Host: 127.0.0.1
Master_User: rsandbox
Master_Port: 19226
Connect_Retry: 60
Master_Log_File: mysql-bin.000011
Read_Master_Log_Pos: 11476121
Relay_Log_File: mysql-relay.000030
Relay_Log_Pos: 367
Relay_Master_Log_File: mysql-bin.000011
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
(...)
and we can see the Relay_Log_Space increased, which proves that some of that transaction was delivered. An attempt to stop the slave is greatly delayed:
slave1 [localhost:19227] {msandbox} ((none)) > stop slave;
Query OK, 0 rows affected (1 min 2.66 sec)
From another session, it looks like this:
slave1 [localhost:19227] {msandbox} ((none)) > show processlist;
+-----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+-----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+-----------+---------------+
| 39 | msandbox | localhost | NULL | Query | 63 | Killing slave | stop slave | 0 | 0 |
| 109 | system user | | NULL | Connect | 82 | Reconnecting after a failed master event read | NULL | 0 | 0 |
| 110 | system user | | NULL | Connect | 72 | Waiting for workers to exit | NULL | 0 | 0 |
| 111 | system user | | NULL | Connect | 77 | Executing event | NULL | 0 | 0 |
| 113 | msandbox | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 |
+-----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+-----------+---------------+
5 rows in set (0.00 sec)
The problem is fully reproducible this way.
Suggested fix:
STOP SLAVE command should return as fast as for single-threaded slave. Especially that in the test above, there was only one transaction being executed. Or at least there should be some variable we could adjust this timeout (if I am not missing any existing one here).