MySQL Bugs: #96400: MTS STOP SLAVE takes over a minute when master crashed during event logging

Bug #96400	MTS STOP SLAVE takes over a minute when master crashed during event logging
Submitted:	1 Aug 2019 13:30	Modified:	2 Aug 2019 10:49
Reporter:	Przemyslaw Malkowski	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.25, 5.7.27	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
After a master crashes during binlogging replication event, where the slave already started to copy it to it's relay log, but didn't finish, a STOP SLAVE on such multi-threaded slave takes longer then expected.
In the test case below, it takes over one minute. That long delay may be a problem for failover automation scripts.
The issue does not happen when slave_parallel_workers=0.

How to repeat:
Create simple master-slave 5.7 pair (I used dbdeployer here), and set the parallel replication on slave:

slave_parallel_workers=2
slave_parallel_type=LOGICAL_CLOCK

Created one big table with sysbench and an empty table copy:

$ sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb  --table-size=100000 --mysql-db=test --mysql-user=msandbox --mysql-password=msandbox --db-driver=mysql --mysql-socket=/tmp/mysql_sandbox19226.sock prepare
sysbench 1.0.17 (using bundled LuaJIT 2.1.0-beta2)

Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...

master [localhost:19226] {msandbox} (test) > create table sbtest10 like sbtest1;
Query OK, 0 rows affected (0.75 sec)

Then, execute INSERT INTO ... SELECT to populate the second copy, and sent kill signal to the master immediately after:

$ rsandbox_ps5_7_25/m test -e "insert into sbtest10 select * from sbtest1"; kill -9 $(pgrep -f mysql_sandbox19226);

The slave is trying to reconnect:
slave1 [localhost:19227] {msandbox} ((none)) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19226
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000011
          Read_Master_Log_Pos: 11476121
               Relay_Log_File: mysql-relay.000030
                Relay_Log_Pos: 367
        Relay_Master_Log_File: mysql-bin.000011
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
(...)

and we can see the Relay_Log_Space increased, which proves that some of that transaction was delivered. An attempt to stop the slave is greatly delayed:

slave1 [localhost:19227] {msandbox} ((none)) > stop slave;
Query OK, 0 rows affected (1 min 2.66 sec)

From another session, it looks like this:

slave1 [localhost:19227] {msandbox} ((none)) > show processlist;
+-----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+-----------+---------------+
| Id  | User        | Host      | db   | Command | Time | State                                         | Info             | Rows_sent | Rows_examined |
+-----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+-----------+---------------+
|  39 | msandbox    | localhost | NULL | Query   |   63 | Killing slave                                 | stop slave       |         0 |             0 |
| 109 | system user |           | NULL | Connect |   82 | Reconnecting after a failed master event read | NULL             |         0 |             0 |
| 110 | system user |           | NULL | Connect |   72 | Waiting for workers to exit                   | NULL             |         0 |             0 |
| 111 | system user |           | NULL | Connect |   77 | Executing event                               | NULL             |         0 |             0 |
| 113 | msandbox    | localhost | NULL | Query   |    0 | starting                                      | show processlist |         0 |             0 |
+-----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+-----------+---------------+
5 rows in set (0.00 sec)

The problem is fully reproducible this way.

Suggested fix:
STOP SLAVE command should return as fast as for single-threaded slave. Especially that in the test above, there was only one transaction being executed. Or at least there should be some variable we could adjust this timeout (if I am not missing any existing one here).

Hello Przemyslaw,

Thank you for the report.

regards,
Umesh

Test results - 5.7.27

Attachment: 96400_5.7.27.results (application/octet-stream, text), 10.56 KiB.

5.7.27 patch for the bug based on Facebook's implementation with improvements from Percona

Attachment: bug96400_5.7.27.patch (text/x-patch), 15.32 KiB.

Careful about using the patch mentioned above, there is a bug in it.  The fix can be found in [1].
[1]: https://jira.percona.com/browse/PS-7147