Bug #103636 | Slave hangs with slave_preserve_commit_order On | ||
---|---|---|---|
Submitted: | 10 May 2021 7:23 | Modified: | 12 Nov 2021 16:02 |
Reporter: | zhai weixiang (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 8.0.24 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[10 May 2021 7:23]
zhai weixiang
[13 May 2021 12:17]
MySQL Verification Team
Hi, Have you maybe tried reproducing with 8.0.25? I am running a test for longer than 24 hours with 8.0.25 and I'm not able to reproduce this. Please test with 8.0.25. All best Bogdan
[15 May 2021 4:35]
zhai weixiang
I finally got a chance to gdb the stack by reducing timeout value. Let's look at the function Commit_order_manager::finish_one auto this_seq_nr{0}; so this_seq_nr is given a type int possiblely. then let's check the gdb while problem happens: (gdb) p this_seq_nr $3 = -2147482990 (gdb) p next_seq_nr $4 = -2147482989 (gdb) p sizeof(this_seq_nr) $5 = 4 (gdb) p (unsigned int) this_seq_nr $6 = 2147484306 (gdb) p (unsigned long long) this_seq_nr $7 = 18446744071562068626 (gdb) p next_seq_nr $8 = -2147482989 (gdb) p sizeof(next_seq_nr) $9 = 4 (gdb) p (unsigned long long) next_seq_nr $10 = 18446744071562068627 while invoking this->m_workers[next_worker].freeze_commit_sequence_nr, next_seq_nr is transfered to unsigned long long, so it's not expected value and return false, the following worker will not be wakeup I'll keep testing to verify my guest.
[16 May 2021 3:55]
zhai weixiang
After running one day, the hang disappears. Note I used a very powerful machine, so the overflow of int can happen in 24 hours under heavy workload. The following patch may solve the problem: diff --git a/sql/rpl_slave_commit_order_manager.cc b/sql/rpl_slave_commit_order_manager.cc index afce898..54ec05e 100644 --- a/sql/rpl_slave_commit_order_manager.cc +++ b/sql/rpl_slave_commit_order_manager.cc @@ -267,10 +267,10 @@ void Commit_order_manager::finish_one(Slave_worker *worker) { assert(this->m_workers.front() == worker->id); assert(!this->m_workers.is_empty()); - auto this_seq_nr{0}; + cs::apply::Commit_order_queue::sequence_type this_seq_nr = 0; auto this_worker{cs::apply::Commit_order_queue::NO_WORKER}; std::tie(this_worker, this_seq_nr) = this->m_workers.pop(); - auto next_seq_nr = this_seq_nr + 1; + cs::apply::Commit_order_queue::sequence_type next_seq_nr = this_seq_nr + 1; assert(worker->id == this_worker);
[17 May 2021 9:02]
MySQL Verification Team
Hi, Thanks for the update to the report. I'm verifying it. all best Bogdan
[12 Nov 2021 16:02]
Margaret Fisher
Posted by developer: Changelog entry added for MySQL 8.0.28: If a replica server with the system variable replica_preserve_commit_order = 1 set was used under intensive load for a long period, the instance could run out of commit order sequence tickets. Incorrect behavior after the maximum value was exceeded caused the applier to hang and the applier worker threads to wait indefinitely on the commit order queue. The commit order sequence ticket generator now wraps around correctly. Thanks to Zhai Weixiang for the contribution.
[28 Apr 2023 2:57]
kiran kumar salla
is this Bug Fixed if so in which version can help with the details pls. we hit this bug on version 8.0.26
[28 Apr 2023 5:55]
MySQL Verification Team
Please use 8.0.33.