MySQL Bugs: #89582: Node may not switch to ONLINE under consistent load

Bug #89582	Node may not switch to ONLINE under consistent load
Submitted:	8 Feb 2018 5:04	Modified:	19 Dec 2018 12:16
Reporter:	Zhenghu Wen (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.4	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	recovery

Description:
According to https://bugs.mysql.com/bug.php?id=87631 and https://github.com/mysql/mysql-server/commit/82150e1a1d03a5482b97290d99b8e0e683972be6 , in MySQL 8.0.4, even if there is a consistent load of transactions, a node could switch to ONLINE if applied transactions is bigger than the number of initial certifier queue size, or if the certifier queue is empty.

But in my test, node still could not swith to ONLINE, then i add some debug message and found the node is waiting MTS replication Coordinator thread into stage "Slave has read all relay log; waiting for more updates" and Worker threads into stage "Waiting for an event from Coordinator".

if there is a consistent load of transactions, this threads is hard into that stage. I think it is the reason why the node could not switch to ONLINE.

How to repeat:
1. Create a group with 3 members (node1,node2,node3), using single-primary mode
2. prepare database sbtest with sysbench(10tables each with 5000000 records)
3. and run oltp test with many threads in primary without time limit
4. sleep 5 minutes then kill -9 primary node, restart it and start group_replication
5. sysbench swith to new primary node, and continue running
6. wait untill the old primary node online, then goto step 4

Measure the time it takes for the old primary node to become online. In my test environment, when using 32 oltp threads, the node could swith to ONLINE, but the time it takes is very different in each test. when it come to 128 threads, the node could not swith to ONLINE after 24 hours. If kill the sysbench workload or set group_replication_flow_control_applier_threshold and group_replication_flow_control_certifier_threshold to 1. The node should become online very quickly.

Suggested fix:
Could not waiting Coordinator thread untill stage "Slave has read all relay log; waiting for more updates" and Worker threads into stage "Waiting for an event from Coordinator".

after i patch this file, node swith to online predictability

Attachment: 89582.patch (application/octet-stream, text), 838 bytes.

add oca (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 89582.patch (application/octet-stream, text), 838 bytes.

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.14 changelog:
When a member joined a group that had a constant peak load, the member might not be able to move from the RECOVERING to the ONLINE state. The cause was that:

the member was waiting in a loop for the complete queue of transactions that arrived during recovery to be applied, while new transactions were still arriving.

even when the complete queue had been applied, the member was also checking that the applier was paused, which is unlikely to happen in a continuous peak workload.

Now, when the recovery completion policy is waiting for transactions to be applied, the member first waits until one of the following conditions is fulfilled:

the transactions to apply fit within the flow control configuration. In other words, the transactions to be applied can be applied during the next flow control iteration;

no transactions are being queued or applied, in the case of an empty recovery queue.

Then, the member waits for the currently queued transactions in the group_replication_applier channel to be applied, before the member state changes to ONLINE.