MySQL Bugs: #95612: replication breakage due to Can not lock user management caches for processing

Bug #95612	replication breakage due to Can not lock user management caches for processing
Submitted:	3 Jun 2019 12:22	Modified:	21 Aug 2019 17:07
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	8.0.16	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	GTID, parallel, replication, writeset

Description:
I have seen an issue where I'm using 8.0.16 and replicating grants from a master to a slave. In this case using parallel replication with WRITESET

SQL Error seen:

Slave SQL Running: No, SQL Error 3547: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction '07ea2fc1-859c-11e9-80e0-525400460e96:31' at master log binlog.000001, end_log_pos 7839. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. (10 workers)

And running select * from performance_schema.replication_applier_status_by_worker;

gives a clearer error:

Worker 2 failed executing transaction '07ea2fc1-859c-11e9-80e0-525400460e96:31' at master log binlog.000001, end_log_pos 7839; Error 'Can not lock user management caches for processing.' on query. Default database: ''. Query: 'DROP USER IF EXISTS 'xxx'@'yyy''

Server cconfiguration:

log-bin = ../log/binlog
binlog_format = ROW
binlog_row_image = minimal
transaction_write_set_extraction = XXHASH64
binlog_transaction_dependency_tracking = WRITESET
slave_parallel_workers = 10
slave_parallel_type = LOGICAL_CLOCK

Note: I have seen other related breakages which I may report as a separate bug.

How to repeat:
Run lots of grant statements doing something like:

# to trigger error above
DROP USER IF EXISTS .....

# also run
CREATE USER IF NOT EXISTS .....
ALTER USER .... (to set credentials)
GRANT <whatever> on <db>.* TO .... -- usually here 3-10 statements.

Either way DROP USER IF EXISTS should _always_ work, irrespective of the users in the system.

Suggested fix:
For this specific error I think that a retry would make sense. I'm not sure that's happening now.
If it is then it's not very apparent.

Hello Simon,

Thank you for the report.

regards,
Umesh

Test results - 8.0.16

Attachment: 95612_8.0.16.results (application/octet-stream, text), 12.20 KiB.

Workaround is easy: start slave usually works. Clearly doing this manually or via scripting is tedious.

Bug #95778 marked as duplicate of this one

Posted by developer:
 
Fixed in 8.0.18.

Retrying a failed access-control statement could permit another
thread to acquire a lock on the access-control cache during a window
when metadata locks where released and reacquired, resulting in a
deadlock. The locks are now not released during the retry operation.