MySQL Bugs: #90438: MySQL InnoDB Cluster (mysql-shell AdminAPI) fails to rejoin instances

Bug #90438	MySQL InnoDB Cluster (mysql-shell AdminAPI) fails to rejoin instances
Submitted:	14 Apr 2018 1:58	Modified:	21 May 2018 10:46
Reporter:	Kenny Gryp	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S3 (Non-critical)
Version:	8.0.4, 5.7.21	OS:	CentOS (7)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
When setting up a Group Replication cluster with the AdminAPI, there are several cases where rejoining a node either manually or automatically will fail.

How to repeat:

1. create a cluster

mysql-js> dba.createCluster('charleroi')
A new InnoDB cluster will be created on instance 'root@192.168.70.2:3306'.

Creating InnoDB cluster 'charleroi' on 'root@192.168.70.2:3306'...
Adding Seed Instance...

Cluster successfully created. Use Cluster.addInstance() to add MySQL instances.
At least 3 instances are needed for the cluster to be able to withstand up to
one server failure.

<Cluster:charleroi>

2. add 2 other instances to the cluster

mysql-js> cluster=dba.getCluster()

mysql-js> cluster.addInstance('root@192.168.70.3:3306')
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Please provide the password for 'root@192.168.70.3:3306': 
Adding instance to the cluster ...

The instance 'root@192.168.70.3:3306' was successfully added to the cluster.

mysql-js> cluster.addInstance('root@192.168.70.4:3306')
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Please provide the password for 'root@192.168.70.4:3306': 
Adding instance to the cluster ...

The instance 'root@192.168.70.4:3306' was successfully added to the cluster.

3. check cluster status, all is good

mysql-js> cluster.status()
{
    "clusterName": "charleroi", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "192.168.70.2:3306", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "192.168.70.2:3306": {
                "address": "192.168.70.2:3306", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "192.168.70.3:3306": {
                "address": "192.168.70.3:3306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "192.168.70.4:3306": {
                "address": "192.168.70.4:3306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }
        }
    }
}

4. check the seeds configuration:

mysql> select @@hostname, @@group_replication_group_seeds;
+------------+---------------------------------+
| @@hostname | @@group_replication_group_seeds |
+------------+---------------------------------+
| node1      |                                 |
+------------+---------------------------------+
1 row in set (0.00 sec)

mysql> select @@hostname, @@group_replication_group_seeds;
+------------+---------------------------------+
| @@hostname | @@group_replication_group_seeds |
+------------+---------------------------------+
| node2      | 192.168.70.2:13306              |
+------------+---------------------------------+
1 row in set (0.00 sec)

mysql>  select @@hostname, @@group_replication_group_seeds;
+------------+---------------------------------+
| @@hostname | @@group_replication_group_seeds |
+------------+---------------------------------+
| node3      | 192.168.70.2:13306              |
+------------+---------------------------------+
1 row in set (0.00 sec)

5. on node1 we stop and start group replication manually:

mysql> stop group_replication;
Query OK, 0 rows affected (0.00 sec)

mysql> start group_replication;
2018-04-14T01:23:05.830963Z 36 [Warning] [MY-011254] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the whitelist. It is mandatory that it is added.'
2018-04-14T01:23:05.834178Z 51 [System] [MY-010597] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2018-04-14T01:23:05.851062Z 36 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] Unable to join the group: peers not configured. '
2018-04-14T01:23:05.852067Z 36 [ERROR] [MY-011254] Plugin group_replication reported: 'Error on group communication engine start'
2018-04-14T01:23:05.852284Z 36 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
ERROR 3097 (HY000): The START GROUP_REPLICATION command failed as there was an error when joining the communication group.
mysql> 

PROBLEM 1: This fails of course as the group_replication_group_seeds is not configured.
PROBLEM 2: Node 2 and node 3 will also fail because their seed is now a member that does not longer exist.

6. Let's try to have node1 join the cluster again with mysql shell

mysql-js> cluster.status();
{
    "clusterName": "charleroi", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "192.168.70.3:3306", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active", 
        "topology": {
            "192.168.70.2:3306": {
                "address": "192.168.70.2:3306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "192.168.70.3:3306": {
                "address": "192.168.70.3:3306", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "192.168.70.4:3306": {
                "address": "192.168.70.4:3306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }
        }
    }

mysql-js> cluster.rejoinInstance('root@192.168.70.2:3306')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@192.168.70.2:3306': 
Rejoining instance to the cluster ...

Cluster.rejoinInstance: ERROR: 
Group Replication join failed.
ERROR: Group Replication plugin failed to start. Server error log contains the following errors: 
 2018-04-14T01:49:56.642418Z 0 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 13306'
2018-04-14T01:49:56.656751Z 0 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 13306'
2018-04-14T01:50:56.519066Z 65 [ERROR] [MY-011254] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2018-04-14T01:50:56.519141Z 65 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'

ERROR: Error joining instance to cluster: '192.168.70.2:3306' - Query failed. MySQL Error (3092): The server is not configured properly to be an active member of the group. Please see more details on error log.. Query: START group_replication (RuntimeError)

on node1 error log: 
2018-04-14T01:49:56.505785Z 65 [Warning] [MY-011254] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the whitelist. It is mandatory that it is added.'
2018-04-14T01:49:56.507633Z 67 [System] [MY-010597] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2018-04-14T01:49:56.642418Z 0 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 13306'
2018-04-14T01:49:56.653242Z 0 [Warning] [MY-011254] Plugin group_replication reported: '[GCS] read failed'
2018-04-14T01:49:56.656751Z 0 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 13306'
2018-04-14T01:50:56.519066Z 65 [ERROR] [MY-011254] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2018-04-14T01:50:56.519141Z 65 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'

select @@hostname, @@group_replication_group_seeds;
+------------+---------------------------------+
| @@hostname | @@group_replication_group_seeds |
+------------+---------------------------------+
| node1      | 192.168.70.2:13306              |
+------------+---------------------------------+
1 row in set (0.00 sec)

PROBLEM 3: often (not always) node1 gets node1 as seed... and that does not work

Suggested fix:
The AdminAPI should configure the group_replication_group_seeds better and add all nodes when it's configuring them.

Hi Kenny,

This issue has been fixed in the latest Shell release: 8.0.11 GA.

Thank you and best regards!