Bug #111727 MGR failure detection fail due to the wrong legacy of m_expels_in_progress
Submitted: 12 Jul 2023 3:18 Modified: 12 Jul 2023 22:28
Reporter: genze wu (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.28, 8.0.33 OS:Any
Assigned to: CPU Architecture:Any

[12 Jul 2023 3:18] genze wu
Description:
When a node crash in a mgr cluster, it will be removed by the failure detection mechanism. 

In detail, the process is:

     detector_task() in Xcom layer found this node dead and send local view 
=>   GCS suspect the node 
=>   send remove node message 
=>   Xcom remove node in handle_remove_node() 
=>   detector_task() found node removed and send local view(with two nodes in it) 
=>   GCS remove node according to local view.

If a node crash and restart quickly, like OOM, it may try to enter the cluster before the remove process has finished. In order to avoid this, Xcom simply ignoring the attempt for this node to enter cluster before old node remove success.

But if the attempt of enter cluster happen just after Xcom has removed the node in handle_remove_node(), and before the detector_task() send local view, the local view detector_task() sent to GCS will 
be the view after added(with three nodes), the view to remove node in GCS(with two nodes) will be miss sent.

This bug will cause a wrong legacy of m_expels_in_progress in GCS. GCS remember m_expels_in_progress in run_process_suspicions(), and forget it in process_view() of the two node local view. In this bug, it will not be forgotten because two node local view is miss sent. The remain of m_expels_in_progress will cause the next failure detection fail, because GCS think cluster do not have majority. The algorithm to calculate m_has_majority is (node_to_suspect + m_expels_in_progress.number) * 2 < total_number_nodes.

Thus, next time when other nodes crash, the failure detection will fail and primary can not be elected, the cluster will hang for write.

How to repeat:
It hard to repeat stably, Xcom do not have DEBUG_EXECUTE_IF. We find and repeat this bug by constantly kill and restart the primary of a single leader mgr cluster.

Every time when removing and adding happened in one second, the next failure detection will fail and primary can not be elected.

If needed, we can provide GCS_DEBUG_TRACE and error_log, it contains logs for issue diagnosis.

Suggested fix:
GCS clean m_suspicions with alive_nodes and left_nodes in process_view(), m_expels_in_progress can also be cleaned here.
[12 Jul 2023 22:28] MySQL Verification Team
Hi,

I could not reproduce this but in theory I see how this can happen so I'm verifying the report. Thank you for reporting.