Description:
I have seen sometimes that a GR cluster may crash and all members may fail at the same time triggered by some related issue. Bugs happen.
However, GR nodes on restarting do not attempt to recover automatically to reform a cluster with the members they had been aware of previously.
This requires manual intervention.
I run multiple clusters running GR and have seen complete cluster failures requiring manual intervention to bring up again. Such intervention takes time thus prolonging any outage of the cluster. It is thus not desirable. Tooling in the shell allows this to be handled but I believe the cluster members should be able to and try to attempt to rebuild themselves automatically in such circumstances.
[ I am not sure if I already filed this but if so can not find the feature request. ]
How to repeat:
Kill all members of a cluster at the same time or trigger some common failure.
Notice how on restart if all members have gone down that the cluster will not come up automatically if you simply restart all members.
Manual action is needed, something like:
mysqlsh root@node1 --password=<password> -- dba reboot-cluster-from-complete-outage <cluster name> --rejoinInstances='node1,node2,node3'
Suggested fix:
Ideally the nodes when they recover should try to reach out to other members they were previously aware of and try to rebuild the cluster and only if they can do so in a consistent and safe manner they should continue.
Such behaviour may lead to at least a quorum of previous nodes agreeing on common state and continuing from that agreed common state. Potentially some members may disagree on common state and thus be expelled from the newly built cluster.
If the quorum can not agree on a consistent common state then the cluster recovery clearly needs manual intervention.
Successful auto-recovery from complete cluster failure will lead to reduced downtime, higher availability, avoid the need for manual recovery and thus provide a better experience for users.