Bug #95612 replication breakage due to Can not lock user management caches for processing
Submitted: 3 Jun 12:22 Modified: 21 Aug 17:07
Reporter: Simon Mudd (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:8.0.16 OS:Any
Assigned to: CPU Architecture:Any
Tags: GTID, parallel, replication, writeset

[3 Jun 12:22] Simon Mudd
Description:
I have seen an issue where I'm using 8.0.16 and replicating grants from a master to a slave. In this case using parallel replication with WRITESET

SQL Error seen:

Slave SQL Running: No, SQL Error 3547: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction '07ea2fc1-859c-11e9-80e0-525400460e96:31' at master log binlog.000001, end_log_pos 7839. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. (10 workers)

And running select * from performance_schema.replication_applier_status_by_worker;

gives a clearer error:

Worker 2 failed executing transaction '07ea2fc1-859c-11e9-80e0-525400460e96:31' at master log binlog.000001, end_log_pos 7839; Error 'Can not lock user management caches for processing.' on query. Default database: ''. Query: 'DROP USER IF EXISTS 'xxx'@'yyy''

Server cconfiguration:

log-bin = ../log/binlog 
binlog_format = ROW 
binlog_row_image = minimal 
transaction_write_set_extraction = XXHASH64 
binlog_transaction_dependency_tracking = WRITESET 
slave_parallel_workers = 10 
slave_parallel_type = LOGICAL_CLOCK 

Note: I have seen other related breakages which I may report as a separate bug.

How to repeat:
Run lots of grant statements doing something like:

# to trigger error above 
DROP USER IF EXISTS ..... 

# also run  
CREATE USER IF NOT EXISTS ..... 
ALTER USER .... (to set credentials) 
GRANT <whatever> on <db>.* TO .... -- usually here 3-10 statements. 

Either way DROP USER IF EXISTS should _always_ work, irrespective of the users in the system.

Suggested fix:
For this specific error I think that a retry would make sense. I'm not sure that's happening now.
If it is then it's not very apparent.
[5 Jun 8:01] Umesh Shastry
Hello Simon,

Thank you for the report.

regards,
Umesh
[5 Jun 8:01] Umesh Shastry
Test results - 8.0.16

Attachment: 95612_8.0.16.results (application/octet-stream, text), 12.20 KiB.

[8 Jun 6:13] Simon Mudd
Workaround is easy: start slave usually works. Clearly doing this manually or via scripting is tedious.
[17 Jun 5:22] Umesh Shastry
Bug #95778 marked as duplicate of this one
[21 Aug 17:07] Paul Dubois
Posted by developer:
 
Fixed in 8.0.18.

Retrying a failed access-control statement could permit another
thread to acquire a lock on the access-control cache during a window
when metadata locks where released and reacquired, resulting in a
deadlock. The locks are now not released during the retry operation.