Bug #102175 | log_status has wrong binary_log_position of gtid_executed | ||
---|---|---|---|
Submitted: | 7 Jan 2021 12:11 | Modified: | 14 Mar 15:24 |
Reporter: | Rahul Malik | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 8.0.22 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Contribution |
[7 Jan 2021 12:11]
Rahul Malik
[7 Jan 2021 12:21]
Rahul Malik
it is breaking replication.
[7 Jan 2021 13:41]
Venkatesh Prasad Venugopal
When client threads update the executed_gtids gtid set, it doesn't take write lock on the global_sid_lock (https://github.com/percona/percona-server/blob/8.0/sql/rpl_gtid_state.cc#L174) but instead takes read_lock for updating the structure. When log_status query tries to take write lock on global_sid_lock, it will wait until all reader threads to finish updating the executed gtids structure. But these client threads, after updating the executed_gtids are allowed to execute, till they wait for the lock on the binary log to update the log position. Since, the required lock is held by the log_status query, we will always get the consistent binlog file and the position, but we may not always get the consistent executed_gtids if there are already some threads updating it when log_status query tries to acquire the write lock on the global_sid_lock.
[11 Jan 2021 14:12]
MySQL Verification Team
Hi, I'm having an issue reproducing this. The explanation of what happens makes sense but looks like it's not just write load that's important for reproducing this. Quick question - what kind of hw are you using to reproduce this consistently? thanks Bogdan
[27 Jan 2021 11:02]
Rahul Malik
I can easily reproduce on Ubuntu 18.04.4 LTS with config Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz 8 core with 32 GB Let me know if you need more details
[27 Jan 2021 17:51]
MySQL Verification Team
Hi, > I can easily reproduce on Ubuntu 18.04.4 LTS with > config Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz Are you sure you are using Oracle APT repository when reproducing this. I tried on bit smaller machine (i7-6700 CPU @ 3.40GHz, Debian 10.7) and could not reproduce.
[28 Jan 2021 0:50]
Rahul Malik
Using mysql 8.0.23 with commit id 7ed30a748964c009d4909cb8b4b22036ebdef239 (HEAD -> 8.0, tag: mysql-cluster-8.0.23, tag: mysql-8.0.23, origin/HEAD, origin/8.0) Author: mysql-builder@oracle.com <> Date: Tue Dec 15 09:25:02 2020 +0100
[28 Jan 2021 11:46]
Venkatesh Prasad Venugopal
Hi Bogdan! This issue is easily reproducible by adding sleep(1) just after the client thread takes the global_sid_lock to update the gtid as part of binlog group commit. === diff begin === diff --git a/sql/rpl_gtid_state.cc b/sql/rpl_gtid_state.cc index 3f5288dab4d..296583b51ed 100644 --- a/sql/rpl_gtid_state.cc +++ b/sql/rpl_gtid_state.cc @@ -163,6 +163,7 @@ void Gtid_state::update_commit_group(THD *first_thd) { DEBUG_SYNC(first_thd, "update_gtid_state_before_global_sid_lock"); global_sid_lock->rdlock(); DEBUG_SYNC(first_thd, "update_gtid_state_after_global_sid_lock"); + sleep(1); update_gtids_impl_lock_sidnos(first_thd); === diff end === Steps to reproduce: 1. Apply the above patch on 8.0.23 code and build the server. 2. Start the server $ ./mtr --start --mem --suite=binlog_gtid 3. Run sysbench prepare to generate initial data Prepare: sysbench /usr/share/sysbench/oltp_insert.lua --mysql-db=test --mysql-user=root --threads=24 --time=1500 --report-interval=1 --mysql-host=127.0.0.1 --mysql-port=13000 --events=0 --db-driver=mysql prepare Run: sysbench /usr/share/sysbench/oltp_insert.lua --mysql-db=test --mysql-user=root --threads=24 --time=1500 --report-interval=1 --mysql-host=127.0.0.1 --mysql-port=13000 --events=0 --db-driver=mysql run 4. While sysbench is running on one session, connect to server and execute select LOCAL from performance_schema.log_status\G 5. Compare the value of gtid_executed returned from the perfschema against the output of mysqlbinlog and SHOW BINLOG EVENTS. Below is what I got during my tests on my Ubuntu Laptop with Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, 6 cores with 64G RAM. Here, the perfschema query says that the transaction with GTID number 30 ends at position 1921863. But, the transaction ends at 1921388. mysql> select LOCAL from performance_schema.log_status\G *************************** 1. row *************************** LOCAL: {"gtid_executed": "0f9b691a-612b-11eb-bbf5-74d83e29c093:1-30", "binary_log_file": "binlog.000001", "binary_log_position": 1921863} 1 row in set (3.51 sec) SHOW BINLOG EVENTS; | binlog.000001 | 1920913 | Gtid | 1 | 1920992 | SET @@SESSION.GTID_NEXT= '0f9b691a-612b-11eb-bbf5-74d83e29c093:30'| | binlog.000001 | 1920992 | Query | 1 | 1921067 | BEGIN| | binlog.000001 | 1921067 | Table_map | 1 | 1921132 | table_id: 106 (test.sbtest1)| | binlog.000001 | 1921132 | Write_rows | 1 | 1921357 | table_id: 106 flags: STMT_END_F| | binlog.000001 | 1921357 | Xid | 1 | 1921388 | COMMIT /* xid=53 */| | binlog.000001 | 1921388 | Gtid | 1 | 1921467 | SET @@SESSION.GTID_NEXT= '0f9b691a-612b-11eb-bbf5-74d83e29c093:31'| | binlog.000001 | 1921467 | Query | 1 | 1921542 | BEGIN| | binlog.000001 | 1921542 | Table_map | 1 | 1921607 | table_id: 106 (test.sbtest1)| | binlog.000001 | 1921607 | Write_rows | 1 | 1921832 | table_id: 106 flags: STMT_END_F| | binlog.000001 | 1921832 | Xid | 1 | 1921863 | COMMIT /* xid=60 */ Here is another sample, the perfschema query says that the GTID number 30 ends at binlog position 1932788. But, the actual transaction at that position is GTID 54, but not 31. mysql> select LOCAL from performance_schema.log_status\G *************************** 1. row *************************** LOCAL: {"gtid_executed": "0f9b691a-612b-11eb-bbf5-74d83e29c093:1-31", "binary_log_file": "binlog.000001", "binary_log_position": 1932788} 1 row in set (3.29 sec) SHOW BINLOG EVENTS; | binlog.000001 | 1932313 | Gtid | 1 | 1932392 | SET @@SESSION.GTID_NEXT= '0f9b691a-612b-11eb-bbf5-74d83e29c093:54'| | binlog.000001 | 1932392 | Query | 1 | 1932467 | BEGIN| | binlog.000001 | 1932467 | Table_map | 1 | 1932532 | table_id: 106 (test.sbtest1)| | binlog.000001 | 1932532 | Write_rows | 1 | 1932757 | table_id: 106 flags: STMT_END_F| | binlog.000001 | 1932757 | Xid | 1 | 1932788 | COMMIT /* xid=62 */ Regards, Venkatesh
[29 Jan 2021 7:17]
MySQL Verification Team
Hi Venkatesh, Yes this helps reproducing. I see how this could happen in real life scenario, just could not get my test setup to get there :) verified All best Bogdan
[29 Jul 2021 15:03]
Venkatesh Prasad Venugopal
Hi Bogdan, Any updates on this?
[21 Sep 2021 16:41]
Marcelo Altmann
Here is a description of the problem A commit has 3 stages: 1. Flushing - protected by LOCK_log 2. Syncing - we release LOCK_log and acquire LOCK_sync 3. Committing - we release LOCK_sync and acquire LOCK_commit. In this stage we also acquire global_sid_lock and update gtid_executed. When querying log_status table, we acquire LOCK_log for binlog at Log_resource_binlog_wrapper::lock() and global_sid_lock for gtid_executed at Log_resource_gtid_state_wrapper::lock(). In a high level, Gtid works as follow: 1. a THD ask for the next GTID no for the server ID it's executing the transaction. 2. it generates the binlog event with said GTID. 3. at stage 3 if THD owns any gtid, it updates gtid_executed The problem comes when a query to ps.log_status comes after stage 1 of binlog commit. At this stage, commit has released LOCK_log, binlg file and position are already updated, Log_resource_binlog_wrapper::lock() will move forward. Then while binlog is still processing stage 2, Log_resource_gtid_state_wrapper::lock() will be invoked and will acquire global_sid_lock. At this point stage 3 of commit will not be able to update gtid_executed leaving the log_status query with a mismatch between binlog position (new values) vs gtid_executed (old values). Proposed Fix: Increase the lock at log_status table to also take LOCK_sync and LOCK_commit if GTID wrapper has been invoked. This way the query to log_status will have to wait until gtid_executed has been fully updated guaranteeing consistency with binlog position.
[21 Sep 2021 16:42]
Marcelo Altmann
Patch generated based on tag mysql-8.0.26 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug_102175.patch (application/octet-stream, text), 9.86 KiB.
[13 Sep 2023 9:33]
MySQL Verification Team
Bug #112331 marked as duplicate of this one
[14 Mar 15:24]
Jon Stephens
Documented fix as follows in the MySQL 8.0.42, 8.4.5, and 9.3.0 changelogs: When the server was under a heavy write load, the binary log position of gtid_executed as shown in the Performance Schema log_status table did not match that of the gtid shown in the binary log file. We fix this by increasing the scope of the lock on the log_status table when querying it to ensure that transactions in the commit pipeline are completed. This ensures that a query against the log_status table waits until gtid_executed has been fully updated, thereby guaranteeing consistency with the position in the binary log. Closed.