Bug #75525 MTS STOP SLAVE takes way too long (when worker threads are slow)
Submitted: 16 Jan 2015 8:59 Modified: 1 Jul 2015 15:11
Reporter: Andrii Nikitin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.22 OS:Any
Assigned to: CPU Architecture:Any

[16 Jan 2015 8:59] Andrii Nikitin
Description:
STOP SLAVE waits workers to catch up the queue, which may take a lot of time (above 20 min and more).
SLAVE STATUS is blocked as well until STOP SLAVE completes.

How to repeat:
1. Setup Multi Thread replication with slave_parallel_workers=2
2. On Master:
create database d1; create table d1.a(i int) engine=innodb;
create database d2; create table d2.a(i int) engine=innodb;

3. On Slave only: (to emulate long processing of SQL commands):
create trigger d1.iai after insert on d1.a for each row do sleep(1);
create trigger d2.iai after insert on d2.a for each row do sleep(1);

4. Generate data:
select concat("insert into d", floor(rand()+1.5)), ".a values(1);" from mysql.help_topic a, mysql.help_topic b limit 1000 into outfile 'datagen.sql';

5. load generated data into master :
> source datagen.sql

6. What load completes, execute STOP SLAVE

7. Observe STOP SLAVE is hanging ~5 min, SLAVE STATUS is hanging as well (workers continue until they catch up with )

slave1 [localhost] > stop slave;
Query OK, 0 rows affected (4 min 41.31 sec)

Few Processlist outputs (no other connections, no InnoDB locks, etc):
|  5 | xxxxxxxx    | localhost | mysql | Query   |    0 | init                                          | show processlist |
|  6 | xxxxxxxx    | localhost | NULL  | Query   |    6 | Killing slave                                 | stop slave       |
|  7 | system user |           | NULL  | Connect |  101 | Waiting for master to send event              | NULL             |
|  8 | system user |           | NULL  | Connect |    6 | Waiting for Slave Worker to release partition | NULL             |
|  9 | xxxxxxxx    | localhost | d2    | Connect |  104 | User sleep                                    | do sleep(1)      |
| 10 | xxxxxxxx    | localhost | d1    | Connect |  103 | User sleep                                    | do sleep(1)      |

|  5 | xxxxxxxx    | localhost | mysql | Query   |    0 | init                                          | show processlist |
|  6 | xxxxxxxx    | localhost | NULL  | Query   |  222 | Killing slave                                 | stop slave       |
|  7 | system user |           | NULL  | Connect |  317 | Waiting for master to send event              | NULL             |
|  8 | system user |           | NULL  | Connect |  222 | Waiting for Slave Worker to release partition | NULL             |
|  9 | xxxxxxxx    | localhost | d2    | Connect |  272 | User sleep                                    | do sleep(1)      |
| 10 | xxxxxxxx    | localhost | d1    | Connect |  276 | User sleep                                    | do sleep(1)      |

Suggested fix:
STOP SLAVE must be executed quickly, even if workers are slow.
E.g. introduce max_worker_lag parameter.
[16 Jan 2015 9:01] Andrii Nikitin
Please note: with slower workers (increase sleep() parameter), the STOP SLAVE would hang even more.
[24 Feb 2015 17:00] Andrei Elkin
There's no check for non-gap history by Workers. Only emptiness of assignment
queue that matters currently.
Some marking in the Worker private queue could be considered so reaching a marked event would trigger Worker's exit.
Selection of the marked event in each worker private queue aims at leaving
the execution history without gaps.
[1 Jul 2015 15:03] David Moss
The following was noted in the 5.7.8 and 5.6.26 changelog:
When using a multi-threaded slave, each worker thread has its own queue of transactions to process. In previous MySQL versions, STOP SLAVE waited for all workers to process their entire queue. This logic has been changed so that STOP SLAVE first finds the newest transaction that was committed by any worker thread. Then, it waits for all workers to complete transactions older than that. Newer transactions are not processed. The new logic allows STOP SLAVE to complete faster in case some worker queues contain multiple transactions.