Bug #104619 long MVCC snapshots will cause commit stalls via GTID persistor periodic compres
Submitted: 13 Aug 2021 14:45 Modified: 16 Aug 2021 11:25
Reporter: Domas Mituzas Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[13 Aug 2021 14:45] Domas Mituzas
Description:
From time to time one will see operations slow down or lock up, and following will happen:

---TRANSACTION 160445247003, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
2052 lock struct(s), heap size 270544, 436840 row lock(s)
MySQL thread id 0, OS thread handle 139837308401408, query id 0 Compressing gtid_executed table

Thats because there's a background activity which tries to "compress" the gtid_executed table, meanwhile acquiring locks (possibly millions of them) and stalling both InnoDB locking kernel and commit path.

This is also a problem because this behavior is not tunable (that is, only half of the logic is tunable, and it seems to be needed for Clone functionality, which may be needed for Clone plugin, but it isn't loaded.

As gtid_executed gets row versions for every transaction, any read from that table is expensive and cannot be done in prod environment easily - so it needs to be based off a flattened snapshot (or Bug#74919 from 2014 has to be addressed) - or be excluded from purge holdout. 

How to repeat:
establish a long snapshot on a busy OLTP system
watch for stalls as history list length reaches >10M \o/

Suggested fix:
allow reducing write rate to gtid_executed table (e.g. once per log rotation) or figure out a way to do this without relying on locking the hidden rows.
[16 Aug 2021 11:25] MySQL Verification Team
Hi Domas,

Thanks for the report and the feature request. 

all best
Bogdan
[18 Aug 2021 11:25] MySQL Verification Team
Hi Domas,

Are you running this on 8.0.23 or newer? There is improvement on the compression part fix there (31599938) so we'd like to be sure you are running 8.0.23 or later.

With regards to "That's because there's a background activity which tries to "compress" the gtid_executed table, meanwhile acquiring locks (possibly millions of them)and stalling both InnoDB locking kernel and commit path." sustaining team would like to know if you can give us bit more data on this observation, how did you come to this conclusion, any way you can help us come to the same conclusion.

Thanks
Bogdan