Bug #100118 Server doesn't restart because of too many gaps in the mysql.gtid_executed table
Submitted: 6 Jul 2020 5:44 Modified: 21 Dec 2020 17:49
Reporter: Venkatesh Prasad Venugopal Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:8.0, 8.0.20, 8.0.21 OS:Any
Assigned to: CPU Architecture:Any
Tags: Clone, Contribution, regression

[6 Jul 2020 5:44] Venkatesh Prasad Venugopal
Description:
Since the introduction of a dedicated thread for persisting GTIDs of InnoDB transactions by https://dev.mysql.com/worklog/task/?id=9211, new behavior for updating the mysql.gtid_executed table is as follows,

- If binary log is enabled, the mysql.gtid_executed table is updated on next binlog rotation (either by FLUSH LOGS or server restart).
- If binary log is disabled or log_slave_updates is disabled (for slave threads), then
    - If it is an InnoDB transaction, it leaves it to the gtid persister thread to update.
    - Otherwise, it writes its GTID into mysql.gtid_executed table.

This logic works like charm if updates are only on InnoDB. However, causes a disaster on a binlogless slave if the overall workload has both transactional and non-transactional updates (including empty transactions). See below for the reasons.

When workload has both transactional and non-transactional updates, as per the above mentioned behavior, transactional(InnoDB) updates keep on adding GTIDs to the flush list maintained by the GTID persister thread until some threshold is reached, and the non-transactional updates keep inserting the GTIDs directly to the mysql.gtid_executed table. So, it is expected to see gaps in the mysql.gtid_executed table. The table will more or less be like:

mysql> select * from (select * from mysql.gtid_executed order by interval_end desc LIMIT 5) as T order by interval_start; // selecting the last 5 rows.
+--------------------------------------+----------------+--------------+
| source_uuid                          | interval_start | interval_end |
+--------------------------------------+----------------+--------------+
| e6ac64af-b6e0-11ea-9f7a-74d83e29c093 |         210335 |       210335 |
| e6ac64af-b6e0-11ea-9f7a-74d83e29c093 |         210343 |       210343 | -> No 336-343
| e6ac64af-b6e0-11ea-9f7a-74d83e29c093 |         210344 |       210344 |
| e6ac64af-b6e0-11ea-9f7a-74d83e29c093 |         210347 |       210347 | -> No 345-346
| e6ac64af-b6e0-11ea-9f7a-74d83e29c093 |         210349 |       210349 | -> No 348
+--------------------------------------+----------------+--------------+
5 rows in set (6.85 sec)

As per the above behavior, it is evident that, all missing GTIDs were of InnoDB, and are in the GTID persister's list which has not yet been flushed to the main table. This is not an issue as long as the persister thread periodically updates the table and fills the missing gaps. But, these gaps become critical when the GTID persister fails to update the table in a timely manner, thus causing the table size to grow to millions.

As per the current design, when the GTID persister thread reaches the threshold (either once per 1k transactions or every 1 second), it flushes its list (updates the table) to the mysql.gtid_executed table and tries to compress the gtid_executed table (See Clone_persist_gtid::flush_gtids() function). It succeeds compressing only the first few rows (because it just filled the gaps present in the beginning of the table by merging transactional updates) and fails to compress further rows of the table (because of the gaps introduced by non-transactional updates).

In every attempt it tries to merge the table, it merges only a few consecutive rows and leaves the other rows as is, thereby taking more time for scanning the full table. By the time it finishes the table scan, the slave applier thread would have inserted few more entries to the end of the table. As a result, the GTID persister thread, if it once starts scanning the table, it is more likely that the scan never stops.

Likewise, the time taken by the persister thread to compress the table is even more when slave is MTS enabled. When there are multiple slave worker threads updating the mysql.gtid_executed table, there will be only one thread compressing the mysql.gtid_executed table. As a result, the GTID persister thread falls behind and the gaps keep on increasing. It could so happen that, in a production server, it would reach millions in few minutes and the persister thread would still be trying to compress the table despite having millions of entries in its flush list.

When I checked it in one of the production servers when the persister thread was compressing. I could see that there were more than 100 million GTIDs unflused to the table, because of the above issue..

(gdb) p m_gtids[1]
$3 = std::vector of length 103692763, capacity 134217728 = {{
    _M_elems = "58596ffd-5bd8-11e9-baea-509a4c6350c5:5329268711", '\000' <repeats 16 times>
  }, {
    _M_elems = "58596ffd-5bd8-11e9-baea-509a4c6350c5:5329268716", '\000' <repeats 16 times>
  }, {
..

and the table had 301 million rows and still growing.

mysql> select format(count(0),0) from mysql.gtid_executed;
+--------------------+
| format(count(0),0) |
+--------------------+
| 301,509,128        |
+--------------------+
1 row in set (47.56 sec)

This being a critical bug, can cause resource issues by consuming more memory for storing millions of unflushed GTIDs. This memory growth is unbounded and can be eventually killed by OOM Killer. With such a huge list, in case of a server crash, all the in-memory contents of persister thread's flush list are lost. On next server restart, InnoDB rebuilds the flush list by reading GTIDs from the undo logs. But, because of the huge number of unflushed GTIDs, even rebuilding the list takes more time and exceeds the default timeout of 5 minutes and prints the below error to the error log.

2020-03-09T10:38:40.806020Z 0 [Note] [MY-011975] [InnoDB] Waiting for Clone GTID thread 
2020-03-09T10:38:40.808668Z 0 [ERROR] [MY-011975] [InnoDB] Wait for GTID thread to start timed out 

After this, if the InnoDB recovery is successful, then the server startup thread proceeds and it tries to read the executed GTIDs from the table. And at the same time, even the GTID persister thread which was started as part of server restart, also starts reading from the GTIDs from the table by calling Gtid_table_persistor::fetch_gtids().

Since the Gtid_table_persistor::fetch_gtids() locks the global_sid_lock, only one thread proceeds and this causes the server restart to takes hours together. It is more or less like a denial of service as the clients cannot connect to the server even after many hours after restart.

If in case the recovery fails, server's main thread takes the shutdown path and eventually hits the below assertion and makes the server completely unusable even after many restarts.

2020-06-18T06:51:54.881474Z 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: srv0start.cc:3474:trx_sys_any_active_transactions() == 0 thread 139789738445184
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
....
InnoDB: about forcing recovery.
07:05:06 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x800000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x1e823be]
/usr/sbin/mysqld(handle_fatal_signal+0x2f3) [0x10f69a3]
/lib64/libpthread.so.0(+0xf5d0) [0x7f23554e45d0]
/lib64/libc.so.6(gsignal+0x37) [0x7f2353795207]
/lib64/libc.so.6(abort+0x148) [0x7f23537968f8]
/usr/sbin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x2cf) [0x214116f]
/usr/sbin/mysqld(srv_pre_dd_shutdown()+0x5f3) [0x20e5473]
/usr/sbin/mysqld() [0xd04c29]
/usr/sbin/mysqld(plugin_foreach_with_mask(THD*, bool (**)(THD*, st_plugin_int*, void*), int, unsigned int, void*)+0x1c5) [0xfe0825]
/usr/sbin/mysqld(plugin_foreach_with_mask(THD*, bool (*)(THD*, st_plugin_int*, void*), int, unsigned int, void*)+0x1d) [0xfe0a4d]
/usr/sbin/mysqld() [0xe74ebd]
/usr/sbin/mysqld(mysqld_main(int, char**)+0x5217) [0xe82467]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f23537813d5]
/usr/sbin/mysqld() [0xcfa917]

How to repeat:
Apply the attached file on top of 8.0.20, build it as debug binary, and run the rpl_gtid_create_gtid_gaps.test by running

./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=9999

While the test case is running, connect a client to the slave server and observe the increasing count in the mysql.gtid_executed table.

$ while true; do <path>/mysql -uroot -h127.0.0.1 -P13001 -e 'select format(count(0),0) from mysql.gtid_executed;'; sleep 2; done

Suggested fix:
Don't allow GTID persister thread to compress the mysql.gtid_executed table and instead signal the GTID compressor thread unless there is an explicit request.
[6 Jul 2020 5:45] Venkatesh Prasad Venugopal
Test case to reproduce the gaps

Attachment: 0001-MTR-test-for-creating-gaps-in-the-GTID-executed-tabl.patch (text/x-patch), 6.86 KiB.

[6 Jul 2020 5:47] Venkatesh Prasad Venugopal
Below stak trace shows that both the compressor and the persister thread waits on the global_sid_lock when calling Gtid_table_persistor::fetch_gtids().

Server main thread calling Gtid_table_persistor::fetch_gtids().
(gdb) bt
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x0000000001b27efc in native_rw_wrlock (rwp=0x7f9e7bffe478) at /root/PS-6990/include/thr_rwlock.h:102
#2  inline_mysql_rwlock_wrlock (src_file=0x2623de8 "/root/PS-6990/sql/rpl_gtid.h", src_line=464, that=0x7f9e7bffe478)
    at /root/PS-6990/include/mysql/psi/mysql_rwlock.h:381
#3  wrlock (this=0x7f9e7bffe470) at /root/PS-6990/sql/rpl_gtid.h:464
#4  Gtid_table_persistor::fetch_gtids (this=0x7f9e7bfff090, gtid_set=0x7f9e9444e948) at /root/PS-6990/sql/rpl_gtid_persist.cc:689
#5  0x0000000001b3048b in Gtid_state::read_gtid_executed_from_table (this=<optimized out>) at /root/PS-6990/sql/rpl_gtid_state.cc:747
#6  0x0000000000e816d5 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /root/PS-6990/sql/mysqld.cc:6933
#7  0x00007f9e965473d5 in __libc_start_main (main=0xcce430 <main(int, char**)>, argc=1, argv=0x7fffd0ffaca8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffd0ffac98) at ../csu/libc-start.c:266
#8  0x0000000000cfa977 in _start ()

GTID persister calling Gtid_table_persistor::fetch_gtids().
(gdb) bt
#0  next (this=<synthetic pointer>) at /root/PS-6990/sql/rpl_gtid.h:1804
#1  Gtid_set::add_gno_interval (this=this@entry=0x7f9e9444e948, ivitp=ivitp@entry=0x7fffd0ffa468, start=start@entry=5329743022, end=5329743028, lock=lock@entry=0x7fffd0ffa470) at /root/PS-6990/sql/rpl_gtid_set.cc:333
#2  0x0000000001b2bcae in Gtid_set::add_gtid_text (this=this@entry=0x7f9e9444e948, text=0x7f9e7cbf1e58 "58596ffd-5bd8-11e9-baea-509a4c6350c5:5329743022-5329743027", anonymous=anonymous@entry=0x0, starts_with_plus=starts_with_plus@entry=0x0) at /root/PS-6990/sql/rpl_gtid_set.cc:546
#3  0x0000000001b27f37 in Gtid_table_persistor::fetch_gtids (this=0x7f9e7bfff090, gtid_set=0x7f9e9444e948) at /root/PS-6990/sql/rpl_gtid_persist.cc:690
#4  0x0000000001b3048b in Gtid_state::read_gtid_executed_from_table (this=<optimized out>) at /root/PS-6990/sql/rpl_gtid_state.cc:747
#5  0x0000000000e816d5 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /root/PS-6990/sql/mysqld.cc:6933
#6  0x00007f9e965473d5 in __libc_start_main (main=0xcce430 <main(int, char**)>, argc=1, argv=0x7fffd0ffaca8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffd0ffac98) at ../csu/libc-start.c:266
#7  0x0000000000cfa977 in _start ()
[8 Jul 2020 12:00] MySQL Verification Team
Hello Venu,

Thank you for the report and feedback.
I've applied your patch on top of 8.0.20 and ran provided mtr test case but still not seeing any issues even after multiple attempts(rather mtr test case timing out no matter how big value set). Could you please provide exact cmake option used in your environment along with gcc version or any info which is missing? Thank you.

-
-- patch on top of 8.0.20 source

export LD_LIBRARY_PATH=/export/umesh/utils/gcc-9.2/lib64
export CC=/export/umesh/utils/gcc-9.2/bin/gcc
export CPP=/export/umesh/utils/gcc-9.2/bin/cpp
export CXX=/export/umesh/utils/gcc-9.2/bin/c++

rm -rf bld/
mkdir bld && cd bld
rm -rf CMakeCache.txt
/export/umesh/utils/cmake-3.14.4/bin/cmake ../mysql-8.0.20 \
-DBUILD_CONFIG=mysql_release                  \
-DCMAKE_INSTALL_PREFIX=$PWD                   \
-DWITH_BOOST=../mysql-8.0.20/boost                         \
-DCMAKE_BUILD_TYPE=debug -DWITH_UNIT_TESTS=OFF -DWITH_ROUTER=OFF

make -j 32
make install
cd mysql-test

./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=9999
Logging: ./mtr  rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=9999
MySQL Version 8.0.20
Checking supported features
 - Binaries are debug compiled
Using 'all' suites
Collecting tests
 - Adding combinations for rpl_gtid
Checking leftover processes
 - found old pid 15430 in 'mysqld.2.pid', killing it...
   ok!
 - found old pid 15428 in 'mysqld.1.pid', killing it...
   ok!
Removing old var directory
Creating var directory '/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test/var'
 - symlinking 'var' to '/dev/shm/var_auto_Lhmk'
Installing system database
Using parallel: 1

==============================================================================
                  TEST NAME                       RESULT  TIME (ms) COMMENT
------------------------------------------------------------------------------
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
The servers were restarted 0 times
The servers were reinitialized 0 times
Spent 0.000 of 18019 seconds executing testcases

Timeout: All 0 tests were successful.

Test suite timeout! Terminating...
mysql-test-run: *** ERROR: Test suite aborted
[umshastr@hod03]/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=99999
Logging: ./mtr  rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=99999
MySQL Version 8.0.20
Checking supported features
 - Binaries are debug compiled
Using 'all' suites
Collecting tests
 - Adding combinations for rpl_gtid
Checking leftover processes
 - found old pid 32407 in 'mysqld.2.pid', killing it...
   process did not exist!
 - found old pid 32405 in 'mysqld.1.pid', killing it...
   process did not exist!
Removing old var directory
Creating var directory '/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test/var'
 - symlinking 'var' to '/dev/shm/var_auto_ed86'
Installing system database
Using parallel: 1

==============================================================================
                  TEST NAME                       RESULT  TIME (ms) COMMENT
------------------------------------------------------------------------------
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
The servers were restarted 0 times
The servers were reinitialized 0 times
Spent 0.000 of 18017 seconds executing testcases

Timeout: All 0 tests were successful.

Test suite timeout! Terminating...
mysql-test-run: *** ERROR: Test suite aborted
[umshastr@hod03]/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=500000
Logging: ./mtr  rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=500000
MySQL Version 8.0.20
Checking supported features
 - Binaries are debug compiled
Using 'all' suites
Collecting tests
 - Adding combinations for rpl_gtid
Checking leftover processes
 - found old pid 28259 in 'mysqld.2.pid', killing it...
   process did not exist!
 - found old pid 28257 in 'mysqld.1.pid', killing it...
   process did not exist!
Removing old var directory
Creating var directory '/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test/var'
 - symlinking 'var' to '/dev/shm/var_auto_vwb2'
Installing system database
Using parallel: 1

==============================================================================
                  TEST NAME                       RESULT  TIME (ms) COMMENT
------------------------------------------------------------------------------
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps
The servers were restarted 0 times
The servers were reinitialized 0 times
Spent 0.000 of 18017 seconds executing testcases

Timeout: All 0 tests were successful.

Test suite timeout! Terminating...
mysql-test-run: *** ERROR: Test suite aborted

regards,
Umesh
[8 Jul 2020 14:24] Venkatesh Prasad Venugopal
Hi Umesh,

I tested with GCC 9.3 with below cmake options

$ cmake -DBUILD_CONFIG=mysql_release -DWITH_BOOST=$HOME/utilities/boost/ -DCMAKE_BUILD_TYPE=debug -DWITH_UNIT_TESTS=OFF -DWITH_ROUTER=OFF ../

and I was able to reproduce the bug with the same commands you used.

Did you try to query the mysql.gtid_executed table when the testcase is running?

To see the bug, we should connect a client to the slave server from a different terminal session and observe the increasing count in the mysql.gtid_executed table. 

It can be done by just executing below command.

$ while true; do mysql -uroot -h127.0.0.1 -P13001 -e 'select format(count(0),0) from mysql.gtid_executed;'; sleep 2; done

Regards,
Venu
[9 Jul 2020 10:12] MySQL Verification Team
Thank you, Venu.
Let me quickly build again and come back to you if I have further issues.

regards,
Umesh
[9 Jul 2020 12:21] MySQL Verification Team
Thank you Venu.

regards,
Umesh
[20 Aug 2020 7:23] Venkatesh Prasad Venugopal
Contributed patch on top of 8.0.21

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-Bug-100118-Server-doesn-t-restart-because-of-too-man.patch (text/x-patch), 42.26 KiB.

[20 Aug 2020 7:32] MySQL Verification Team
Thank you Venu for the contribution.

regards,
Umesh
[20 Aug 2020 17:57] OCA Admin
Contribution submitted via Github - Bug#100118 Server doesn't restart because of too many gaps in the mysql.gtid_exe 
(*) Contribution by Venkatesh Prasad Venugopal (Github venkatesh-prasad-v, mysql-server/pull/301#issuecomment-675871904): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_444561040.txt (text/plain), 42.37 KiB.

[8 Dec 2020 10:34] Yakir Gibraltar
I'm able to reproduce the issue also with log_slave_updates and log_bin disabled:
```
abc625:(none)> select @@log_slave_updates, @@log_bin;
+---------------------+-----------+
| @@log_slave_updates | @@log_bin |
+---------------------+-----------+
|                   0 |         0 |
+---------------------+-----------+
1 row in set (0.00 sec)
abc625:(none)>  select format(count(0),0) from mysql.gtid_executed;
+--------------------+
| format(count(0),0) |
+--------------------+
| 31,335,250         |
+--------------------+
1 row in set (5.55 sec)
```
And restart to MySQL failed:
[ERROR] [MY-011975] [InnoDB] Wait for GTID thread to start timed out

Thank you, Yakir.
[21 Dec 2020 17:49] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 8.0.23 release, and here's the proposed changelog entry from the documentation team:

In a replication scenario involving a replica with binary logging or
log_slave_updates disabled, the server failed to start due to an excessive
number of gaps in the mysql.gtid_executed table. The gaps occurred for
workloads that included both InnoDB and non-InnoDB transactions. GTIDs for
InnoDB transactions are flushed to the mysql.gtid_executed table by the
GTID persister thread, which runs periodically, while GTIDs for non-InnoDB
transactions are written to the to the mysql.gtid_executed table directly
by replica server threads. The GTID persister thread fell behind as it
cycled through merging entries and compressing the mysql.gtid_executed
table. As a result, the size of the GTID flush list for InnoDB
transactions grew over time along with the number of gaps in the
mysql.gtid_executed table, eventually causing a server failure and
subsequent startup failures. To address this issue, the GTID persister
thread now writes GTIDs for both InnoDB and non-InnoDB transactions, and
foreground commits are forced to wait if the GTID persister thread falls
behind. Also, the gtid_executed_compression_period default setting was
changed from 1000 to 0 to disabled explicit compression of the
mysql.gtid_executed table by default. 

Thanks to Venkatesh Prasad for the contribution.
[20 Apr 2021 7:37] Frederic Descamps
Hello, 

Just a small update, it's now part of 8.0.24 and the contribution was part of the fix.

Thank you for your contribution !