Bug #90793 Recovering GR cluster from complete outage fails when SSL is disabled
Submitted: 8 May 2018 13:05 Modified: 28 Jun 2018 14:39
Reporter: Andrew Pryde Email Updates:
Status: Closed Impact on me:
None 
Category:Shell AdminAPI InnoDB Cluster / ReplicaSet Severity:S2 (Serious)
Version:8.0.11 OS:Any (Offical Docker image)
Assigned to: CPU Architecture:Any
Tags: mysqlsh

[8 May 2018 13:05] Andrew Pryde
Description:
Calling dba.reboot_cluster_from_complete_outage("MySQLCluster") in mysql-shell when a cluster has been created with dba.create_cluster("MySQLCluster", {"sslMemberMode": "DISABLED"}) fails to recover the cluster as mysql-server restarts with SSL enabled.

How to repeat:
1. dba.create_cluster("MySQLCluster", {"sslMemberMode": "DISABLED"})
2. Terminate all instances in cluster 
3. dba.reboot_cluster_from_complete_outage("MySQLCluster")
4. Note cluster does not come back online due to SSL being re-enabled.
[8 May 2018 13:20] Andrew Pryde
Update title
[8 May 2018 14:22] Miguel Araujo
Posted by developer:
 
Issue
-----

The issue is caused because of dba.rebootClusterFromCompleteOutage() not honouring the *PERSISTED* values of the Group Replication SSL settings and not passing it to the internal addInstance() done for the seed instance.

Fix
---

The fix is to read the persisted value of group_replication_recovery_use_ssl and group_replication_ssl_mode on the target instance and use them on the addInstance() internal call.

Workaround
----------

1) Connect to the seed instance (on this example 'localhost:3330') and execute the following:

RESET PERSIST group_replication_bootstrap_group;

SET GLOBAL group_replication_bootstrap_group=ON;

start group_replication;

... The cluster should now be back online. Double-checking:

MySQL   localhost:3330   SQL  \js
Switching to JavaScript mode...
MySQL   localhost:3330   JS  var c = dba.getCluster()
MySQL   localhost:3330   JS  c.status()
{
    "clusterName": "fooBar", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "localhost:3330", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active", 
        "topology": {
            "localhost:3310": {
                "address": "localhost:3310", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "localhost:3320": {
                "address": "localhost:3320", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "localhost:3330": {
                "address": "localhost:3330", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }
        }
    }, 
    "groupInformationSourceMember": "mysql://root@localhost:3330"
}

2) Rejoin the remaining instances

MySQL   localhost:3330   JS  c.rejoinInstance("root@localhost:3310")
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@localhost:3310': ***
Rejoining instance to the cluster ...

The instance 'localhost:3310' was successfully rejoined on the cluster.

MySQL   localhost:3330   JS  c.rejoinInstance("root@localhost:3320")
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@localhost:3320': ***
Rejoining instance to the cluster ...

The instance 'localhost:3320' was successfully rejoined on the cluster.

MySQL   localhost:3330   JS  c.status()
{
    "clusterName": "fooBar", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "localhost:3330", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "localhost:3310": {
                "address": "localhost:3310", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "localhost:3320": {
                "address": "localhost:3320", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "localhost:3330": {
                "address": "localhost:3330", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }
        }
    }, 
    "groupInformationSourceMember": "mysql://root@localhost:3330"
}
[28 Jun 2018 14:39] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.12 changelog:
In the event of a whole cluster stopping unexpectedly, upon reboot the memberSslMode was not preserved. In a cluster where SSL had been disabled, upon issuing dba.rebootClusterFromCompleteOutage() this could prevent instances from rejoining the cluster.