Bug #89229 FLUSH PRIVILEGES may cause MTS deadlock
Submitted: 15 Jan 2018 3:18
Reporter: Libing Song Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[15 Jan 2018 3:18] Libing Song
FLUSH PRIVILEGES commit its transaction and release the table locks before binlog. That means a following GRANT statement could be binlogged in the same commit window with the FLUSH PRIVILEGES statement. Thus the FLUSH PRIVILEGES and the GRANT can be applied parallel on slave. That may cause a deadlock if slave-preserve-commit-order is ON and GRANT gets the locks before FLUSH PRIVILEGES. 

The deadlock looks like:
- GRANT (applied before FLUSH PRIVILEGES and the locks will be hold until commit)
  waiting for FLUSH PRIVILEGES to commit
- FLUSH PRIVILEGES(blocked when acquiring the locks.)
  waiting for GRANT to release the locks. 

How to repeat:
--source include/have_binlog_format_row.inc
--connect(con1, localhost, root)
--connect(con2, localhost, root)
--connection default

CREATE USER slb@localhost;

SET debug_sync= "bgc_after_sync_stage_before_commit_stage SIGNAL after_sync WAIT_FOR go_commit";

--connection con1

sleep 1;

show processlist;

SET debug_sync='now WAIT_FOR after_sync';
show processlist;
SET debug_sync= "bgc_after_flush_stage_before_sync_stage SIGNAL after_flush WAIT_FOR go_commit1";
--send GRANT ALL PRIVILEGES ON *.* TO 'slb'@localhost

--connection con2
SET debug_sync= 'now WAIT_FOR after_flush';
show processlist;
SET debug_sync='now SIGNAL go_commit';
show processlist;
SET debug_sync='now SIGNAL go_commit1';
show processlist;

--connection default
--connection con1

using mysqlbinlog to check the events, FLUSH PRIVILEGES and GRANT have same last_committed number
[15 Mar 2021 9:53] WANG GUANGYOU
hit the issue.
[20 Aug 2022 10:04] Tsubasa Tanaka
I faced this issue in 8.0.19 and I can reproduce this easily by using binlog_group_commit_sync_delay = 1000000.

And I can't reproduce yet (at least) 8.0.28 and later.
[1 Sep 2022 4:40] Tsubasa Tanaka
Is it fixed by WL#13574 ?

[16 Sep 2022 14:02] Sven Sandberg
Posted by developer:
There are two parts of this bug:

 1. The source server does not determine that FLUSH PRIVILEGES conflicts with GRANT, because FLUSH PRIVILEGES releases a conflicting lock early (violates two-phase locking). Therefore, it marks them as non-conflicting in the binary log. Therefore, the replica is able to execute these statements in parallel, which leads to a deadlock when replica-preserve-commit-order is used.

 2. Up until 8.0.23/WL#13574, replicas were unable to detect and resolve deadlocks where one of the parties of the deadlock was a worker thread waiting for preceding workers to commit (according to replica-preserve-commit-order). This was fixed in 8.0.23/WL#13574, so deadlocks are now detected. So the more recent transaction (GRANT in this case) is forced to rollback, which unblocks the older transaction (FLUSH in this case) so that it can proceed, and then the GRANT statement is retried according to replica_transaction_retries.

The test case in the 'how to repeat' section shows only problem 1 (and it is implicit that it results in problem 2). So I'd say the described bug is not fixed, although the defect has been mitigated. It would still be better from the replication perspective that FLUSH PRIVILEGES (and all other statements) followed two-phase locking. (so that retries are not necessary)

So, let's keep this bug open even if the symptoms are less severe now.