MySQL Bugs: #93649: STOP SLAVE SQL_THREAD deadlocks if done while holding LOCK INSTANCE FOR BACKUP

Bug #93649	STOP SLAVE SQL_THREAD deadlocks if done while holding LOCK INSTANCE FOR BACKUP
Submitted:	18 Dec 2018 3:23	Modified:	12 Apr 2019 10:11
Reporter:	Sergei Glushchenko	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	8.0.13	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
STOP SLAVE SQL_THREAD being executed while holding LOCK INSTANCE FOR BACKUP may hang forever. It turns out that sql thread is waiting on backup lock and STOP SLAVE will wait for sql thread to complete current statement which results in deadlock.

How to repeat:
I put together simple MTR test case (you may need to run it couple of times to get the deadlock):

--source include/master-slave.inc

--connection master

let $n=10;

while ($n)
{
  eval CREATE TABLE t$n (a INT);
  dec $n;
}

--connection slave

LOCK INSTANCE FOR BACKUP;
STOP SLAVE SQL_THREAD;
START SLAVE SQL_THREAD;
UNLOCK INSTANCE;

--connection master

let $n=10;

while ($n)
{
  eval DROP TABLE t$n;
  dec $n;
}

--source include/sync_slave_sql_with_master.inc
--source include/rpl_end.inc

If you check slave status, it is 'Slave_SQL_Running_State: Waiting for backup lock'. And

mysql> show processlist;
+----+-----------------+-----------------+------+---------+------+----------------------------------+-------------------------+-----------+---------------+
| Id | User            | Host            | db   | Command | Time | State                            | Info                    | Rows_sent | Rows_examined |
+----+-----------------+-----------------+------+---------+------+----------------------------------+-------------------------+-----------+---------------+
|  4 | event_scheduler | localhost       | NULL | Daemon  |  318 | Waiting on empty queue           | NULL                    |         0 |             0 |
|  9 | root            | localhost:61541 | test | Sleep   |  317 |                                  | NULL                    |         1 |             0 |
| 10 | root            | localhost:61542 | test | Sleep   |  317 |                                  | NULL                    |         1 |             0 |
| 11 | system user     |                 | NULL | Connect |  317 | Waiting for master to send event | NULL                    |         0 |             0 |
| 12 | system user     |                 | test | Query   |  317 | Waiting for backup lock          | CREATE TABLE t3 (a INT) |         0 |             0 |
| 13 | root            | localhost:61548 | test | Query   |  316 | Killing slave                    | STOP SLAVE SQL_THREAD   |         0 |             0 |
| 14 | root            | localhost:61549 | test | Sleep   |  317 |                                  | NULL                    |         1 |             0 |
| 15 | root            | localhost       | NULL | Query   |    0 | starting                         | show processlist        |         0 |             0 |
+----+-----------------+-----------------+------+---------+------+----------------------------------+-------------------------+-----------+---------------+

Suggested fix:
If you replace LOCK INSTANCE FOR BACKUP with FLUSH TABLES WITH READ LOCK, you'll get following error: "Can't execute the given command because you have active locked tables or an active transaction". Consider making STOP SLAVE to fail with the same error in case of LOCK INSTANCE FOR BACKUP too.

Hi,

Thanks for report and test case. Took bit more then "few" times but I verified the bug. 

Kind regards
Bogdan

Posted by developer:
 
Changelog entry added for MySQL 8.0.17:
If a LOCK INSTANCE FOR BACKUP statement was used to acquire an instance-level backup lock, then a  STOP SLAVE statement was issued, a deadlock could be created with the SQL thread waiting on the backup lock and the STOP SLAVE statement waiting on the SQL thread to complete its current action. To prevent this situation, the STOP SLAVE process now tries to acquire the backup lock before proceeding, and returns an error if the lock cannot be acquired.

Posted by developer:
 
Changelog entry reinstated.

Hello, after this bug was fixed, we encountered another problem
Our MySQL cluster is such a master database, two slave databases, and semi synchronization is enabled. One of the slave databases is executing clone, which is the donor.
At this time, the master database failure triggers the automatic switch logic. However, when the slave database of clone executes stop slave, it fails, and the high availability switch program exits abnormally. That is to say, the high availability switchover cannot be successfully performed during the clone, while the clone may take a long time, which is unacceptable. It is recommended not to fix this bug or fix it gracefully.