Bug #87631 LOAD DELAYS SWITCHING A GR NODE FROM RECOVERING TO ONLINE
Submitted: 31 Aug 2017 16:06 Modified: 9 Jan 2018 16:45
Reporter: Vitor Oliveira Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.3 OS:Any
Assigned to: CPU Architecture:Any

[31 Aug 2017 16:06] Vitor Oliveira
Description:
When a node joins a group replication group, it must perform recovery, in which transactions from a donor member are applied locally for state transfer.
Once the node has applied all the transactions, it must change state from RECOVERING to ONLINE to become a full-fledge member of the group.

According to http://mysqlhighavailability.com/distributed-recovery-behind-the-scenes/
"When the member queued transactions reach zero and its stored data is equal to the other members, its public state changes to online."

When a node has finished applying all the transactions in the back-log, it tests if there are still messages in the queue, and will not switch to ONLINE until there are none. 

But if there is a consistent load of transactions, a node may not get the opportunity to switch to ONLINE, even if it is ready, only because there are just a few messages circulating.

On 5.7.20 this behaviour seems different, the nodes seems to still be able to join, so it may be a problem introduced in 8.0 only.

How to repeat:
1. Create a group with 3 members and execute sysbench prepare
2. Take one element out of the group with STOP GROUP_REPLICATION
3. Execute sysbench update index in the background without time limit
4. Take the node back into the group with START GROUP_REPLICATION
5. Measure the time it takes for the node to become online
6. If taking too long, kill the sysbench load and the node should become almost immediately online

Suggested fix:
Change the heuristic from waiting for the message queue to be empty to something like change once the number of events in the relay log has not gone to a lower number then last cycle, with 1s cycles.
[9 Jan 2018 16:45] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.4 / 10849 changelog:
In a group where a joining member consistently received transactions, the joining member could sometimes not enter the ONLINE state. This was due to the way the incoming queue of messages was tested.
[11 Jan 2018 15:37] David Moss
Posted by developer:
 
Please ignore mistake in previous message, this was added to the 8.0.4 change log only!
[1 Feb 2018 14:01] David Moss
Now published.