Bug #100118 | Server doesn't restart because of too many gaps in the mysql.gtid_executed table | ||
---|---|---|---|
Submitted: | 6 Jul 2020 5:44 | Modified: | 21 Dec 2020 17:49 |
Reporter: | Venkatesh Prasad Venugopal | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 8.0, 8.0.20, 8.0.21 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Clone, Contribution, regression |
[6 Jul 2020 5:44]
Venkatesh Prasad Venugopal
[6 Jul 2020 5:45]
Venkatesh Prasad Venugopal
Test case to reproduce the gaps
Attachment: 0001-MTR-test-for-creating-gaps-in-the-GTID-executed-tabl.patch (text/x-patch), 6.86 KiB.
[6 Jul 2020 5:47]
Venkatesh Prasad Venugopal
Below stak trace shows that both the compressor and the persister thread waits on the global_sid_lock when calling Gtid_table_persistor::fetch_gtids(). Server main thread calling Gtid_table_persistor::fetch_gtids(). (gdb) bt #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85 #1 0x0000000001b27efc in native_rw_wrlock (rwp=0x7f9e7bffe478) at /root/PS-6990/include/thr_rwlock.h:102 #2 inline_mysql_rwlock_wrlock (src_file=0x2623de8 "/root/PS-6990/sql/rpl_gtid.h", src_line=464, that=0x7f9e7bffe478) at /root/PS-6990/include/mysql/psi/mysql_rwlock.h:381 #3 wrlock (this=0x7f9e7bffe470) at /root/PS-6990/sql/rpl_gtid.h:464 #4 Gtid_table_persistor::fetch_gtids (this=0x7f9e7bfff090, gtid_set=0x7f9e9444e948) at /root/PS-6990/sql/rpl_gtid_persist.cc:689 #5 0x0000000001b3048b in Gtid_state::read_gtid_executed_from_table (this=<optimized out>) at /root/PS-6990/sql/rpl_gtid_state.cc:747 #6 0x0000000000e816d5 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /root/PS-6990/sql/mysqld.cc:6933 #7 0x00007f9e965473d5 in __libc_start_main (main=0xcce430 <main(int, char**)>, argc=1, argv=0x7fffd0ffaca8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffd0ffac98) at ../csu/libc-start.c:266 #8 0x0000000000cfa977 in _start () GTID persister calling Gtid_table_persistor::fetch_gtids(). (gdb) bt #0 next (this=<synthetic pointer>) at /root/PS-6990/sql/rpl_gtid.h:1804 #1 Gtid_set::add_gno_interval (this=this@entry=0x7f9e9444e948, ivitp=ivitp@entry=0x7fffd0ffa468, start=start@entry=5329743022, end=5329743028, lock=lock@entry=0x7fffd0ffa470) at /root/PS-6990/sql/rpl_gtid_set.cc:333 #2 0x0000000001b2bcae in Gtid_set::add_gtid_text (this=this@entry=0x7f9e9444e948, text=0x7f9e7cbf1e58 "58596ffd-5bd8-11e9-baea-509a4c6350c5:5329743022-5329743027", anonymous=anonymous@entry=0x0, starts_with_plus=starts_with_plus@entry=0x0) at /root/PS-6990/sql/rpl_gtid_set.cc:546 #3 0x0000000001b27f37 in Gtid_table_persistor::fetch_gtids (this=0x7f9e7bfff090, gtid_set=0x7f9e9444e948) at /root/PS-6990/sql/rpl_gtid_persist.cc:690 #4 0x0000000001b3048b in Gtid_state::read_gtid_executed_from_table (this=<optimized out>) at /root/PS-6990/sql/rpl_gtid_state.cc:747 #5 0x0000000000e816d5 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /root/PS-6990/sql/mysqld.cc:6933 #6 0x00007f9e965473d5 in __libc_start_main (main=0xcce430 <main(int, char**)>, argc=1, argv=0x7fffd0ffaca8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffd0ffac98) at ../csu/libc-start.c:266 #7 0x0000000000cfa977 in _start ()
[8 Jul 2020 12:00]
MySQL Verification Team
Hello Venu, Thank you for the report and feedback. I've applied your patch on top of 8.0.20 and ran provided mtr test case but still not seeing any issues even after multiple attempts(rather mtr test case timing out no matter how big value set). Could you please provide exact cmake option used in your environment along with gcc version or any info which is missing? Thank you. - -- patch on top of 8.0.20 source export LD_LIBRARY_PATH=/export/umesh/utils/gcc-9.2/lib64 export CC=/export/umesh/utils/gcc-9.2/bin/gcc export CPP=/export/umesh/utils/gcc-9.2/bin/cpp export CXX=/export/umesh/utils/gcc-9.2/bin/c++ rm -rf bld/ mkdir bld && cd bld rm -rf CMakeCache.txt /export/umesh/utils/cmake-3.14.4/bin/cmake ../mysql-8.0.20 \ -DBUILD_CONFIG=mysql_release \ -DCMAKE_INSTALL_PREFIX=$PWD \ -DWITH_BOOST=../mysql-8.0.20/boost \ -DCMAKE_BUILD_TYPE=debug -DWITH_UNIT_TESTS=OFF -DWITH_ROUTER=OFF make -j 32 make install cd mysql-test ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=9999 Logging: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=9999 MySQL Version 8.0.20 Checking supported features - Binaries are debug compiled Using 'all' suites Collecting tests - Adding combinations for rpl_gtid Checking leftover processes - found old pid 15430 in 'mysqld.2.pid', killing it... ok! - found old pid 15428 in 'mysqld.1.pid', killing it... ok! Removing old var directory Creating var directory '/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test/var' - symlinking 'var' to '/dev/shm/var_auto_Lhmk' Installing system database Using parallel: 1 ============================================================================== TEST NAME RESULT TIME (ms) COMMENT ------------------------------------------------------------------------------ worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps The servers were restarted 0 times The servers were reinitialized 0 times Spent 0.000 of 18019 seconds executing testcases Timeout: All 0 tests were successful. Test suite timeout! Terminating... mysql-test-run: *** ERROR: Test suite aborted [umshastr@hod03]/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=99999 Logging: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=99999 MySQL Version 8.0.20 Checking supported features - Binaries are debug compiled Using 'all' suites Collecting tests - Adding combinations for rpl_gtid Checking leftover processes - found old pid 32407 in 'mysqld.2.pid', killing it... process did not exist! - found old pid 32405 in 'mysqld.1.pid', killing it... process did not exist! Removing old var directory Creating var directory '/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test/var' - symlinking 'var' to '/dev/shm/var_auto_ed86' Installing system database Using parallel: 1 ============================================================================== TEST NAME RESULT TIME (ms) COMMENT ------------------------------------------------------------------------------ worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps The servers were restarted 0 times The servers were reinitialized 0 times Spent 0.000 of 18017 seconds executing testcases Timeout: All 0 tests were successful. Test suite timeout! Terminating... mysql-test-run: *** ERROR: Test suite aborted [umshastr@hod03]/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=500000 Logging: ./mtr rpl_gtid_create_gtid_gaps.test --mem --testcase-timeout=500000 MySQL Version 8.0.20 Checking supported features - Binaries are debug compiled Using 'all' suites Collecting tests - Adding combinations for rpl_gtid Checking leftover processes - found old pid 28259 in 'mysqld.2.pid', killing it... process did not exist! - found old pid 28257 in 'mysqld.1.pid', killing it... process did not exist! Removing old var directory Creating var directory '/export/umesh/server/source/bugs/src_build/fb_builds/100118/bld/mysql-test/var' - symlinking 'var' to '/dev/shm/var_auto_vwb2' Installing system database Using parallel: 1 ============================================================================== TEST NAME RESULT TIME (ms) COMMENT ------------------------------------------------------------------------------ worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps worker[1] Test still running: rpl_gtid.rpl_gtid_create_gtid_gaps The servers were restarted 0 times The servers were reinitialized 0 times Spent 0.000 of 18017 seconds executing testcases Timeout: All 0 tests were successful. Test suite timeout! Terminating... mysql-test-run: *** ERROR: Test suite aborted regards, Umesh
[8 Jul 2020 14:24]
Venkatesh Prasad Venugopal
Hi Umesh, I tested with GCC 9.3 with below cmake options $ cmake -DBUILD_CONFIG=mysql_release -DWITH_BOOST=$HOME/utilities/boost/ -DCMAKE_BUILD_TYPE=debug -DWITH_UNIT_TESTS=OFF -DWITH_ROUTER=OFF ../ and I was able to reproduce the bug with the same commands you used. Did you try to query the mysql.gtid_executed table when the testcase is running? To see the bug, we should connect a client to the slave server from a different terminal session and observe the increasing count in the mysql.gtid_executed table. It can be done by just executing below command. $ while true; do mysql -uroot -h127.0.0.1 -P13001 -e 'select format(count(0),0) from mysql.gtid_executed;'; sleep 2; done Regards, Venu
[9 Jul 2020 10:12]
MySQL Verification Team
Thank you, Venu. Let me quickly build again and come back to you if I have further issues. regards, Umesh
[9 Jul 2020 12:21]
MySQL Verification Team
Thank you Venu. regards, Umesh
[20 Aug 2020 7:23]
Venkatesh Prasad Venugopal
Contributed patch on top of 8.0.21 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: 0001-Bug-100118-Server-doesn-t-restart-because-of-too-man.patch (text/x-patch), 42.26 KiB.
[20 Aug 2020 7:32]
MySQL Verification Team
Thank you Venu for the contribution. regards, Umesh
[20 Aug 2020 17:57]
OCA Admin
Contribution submitted via Github - Bug#100118 Server doesn't restart because of too many gaps in the mysql.gtid_exe (*) Contribution by Venkatesh Prasad Venugopal (Github venkatesh-prasad-v, mysql-server/pull/301#issuecomment-675871904): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: git_patch_444561040.txt (text/plain), 42.37 KiB.
[8 Dec 2020 10:34]
Yakir Gibraltar
I'm able to reproduce the issue also with log_slave_updates and log_bin disabled: ``` abc625:(none)> select @@log_slave_updates, @@log_bin; +---------------------+-----------+ | @@log_slave_updates | @@log_bin | +---------------------+-----------+ | 0 | 0 | +---------------------+-----------+ 1 row in set (0.00 sec) abc625:(none)> select format(count(0),0) from mysql.gtid_executed; +--------------------+ | format(count(0),0) | +--------------------+ | 31,335,250 | +--------------------+ 1 row in set (5.55 sec) ``` And restart to MySQL failed: [ERROR] [MY-011975] [InnoDB] Wait for GTID thread to start timed out Thank you, Yakir.
[21 Dec 2020 17:49]
Daniel Price
Posted by developer: Fixed as of the upcoming 8.0.23 release, and here's the proposed changelog entry from the documentation team: In a replication scenario involving a replica with binary logging or log_slave_updates disabled, the server failed to start due to an excessive number of gaps in the mysql.gtid_executed table. The gaps occurred for workloads that included both InnoDB and non-InnoDB transactions. GTIDs for InnoDB transactions are flushed to the mysql.gtid_executed table by the GTID persister thread, which runs periodically, while GTIDs for non-InnoDB transactions are written to the to the mysql.gtid_executed table directly by replica server threads. The GTID persister thread fell behind as it cycled through merging entries and compressing the mysql.gtid_executed table. As a result, the size of the GTID flush list for InnoDB transactions grew over time along with the number of gaps in the mysql.gtid_executed table, eventually causing a server failure and subsequent startup failures. To address this issue, the GTID persister thread now writes GTIDs for both InnoDB and non-InnoDB transactions, and foreground commits are forced to wait if the GTID persister thread falls behind. Also, the gtid_executed_compression_period default setting was changed from 1000 to 0 to disabled explicit compression of the mysql.gtid_executed table by default. Thanks to Venkatesh Prasad for the contribution.
[20 Apr 2021 7:37]
Frederic Descamps
Hello, Just a small update, it's now part of 8.0.24 and the contribution was part of the fix. Thank you for your contribution !