MySQL Bugs: #116368: Attempt automatic recovery from a complete cluster failure as nodes restart

Bug #116368	Attempt automatic recovery from a complete cluster failure as nodes restart
Submitted:	16 Oct 2024 10:54	Modified:	18 Oct 2024 8:59
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S4 (Feature request)
Version:	8.0, 8.4, 9.x	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	windmill

Description:
I have seen sometimes that a GR cluster may crash and all members may fail at the same time triggered by some related issue. Bugs happen.

However, GR nodes on restarting do not attempt to recover automatically to reform a cluster with the members they had been aware of previously.

This requires manual intervention.

I run multiple clusters running GR and have seen complete cluster failures requiring manual intervention to bring up again. Such intervention takes time thus prolonging any outage of the cluster. It is thus not desirable. Tooling in the shell allows this to be handled but I believe the cluster members should be able to and try to attempt to rebuild themselves automatically in such circumstances.

[ I am not sure if I already filed this but if so can not find the feature request. ]

How to repeat:
Kill all members of a cluster at the same time or trigger some common failure.
Notice how on restart if all members have gone down that the cluster will not come up automatically if you simply restart all members.

Manual action is needed, something like:

mysqlsh root@node1 --password=<password> -- dba reboot-cluster-from-complete-outage <cluster name> --rejoinInstances='node1,node2,node3'

Suggested fix:
Ideally the nodes when they recover should try to reach out to other members they were previously aware of and try to rebuild the cluster and only if they can do so in a consistent and safe manner they should continue.

Such behaviour may lead to at least a quorum of previous nodes agreeing on common state and continuing from that agreed common state. Potentially some members may disagree on common state and thus be expelled from the newly built cluster.

If the quorum can not agree on a consistent common state then the cluster recovery clearly needs manual intervention.

Successful auto-recovery from complete cluster failure will lead to reduced downtime, higher availability, avoid the need for manual recovery and thus provide a better experience for users.

Hello Simon,

Thank you for the feature request!

regards,
Umesh