MySQL Bugs: #97645: Cannot rejoinInstance after a group replication member being expelled

Bug #97645	Cannot rejoinInstance after a group replication member being expelled
Submitted:	14 Nov 2019 16:06	Modified:	15 Dec 2019 17:37
Reporter:	Eric Yan	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.18	OS:	CentOS
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
When trying to manually rejoin an expelled instance back to the group replication cluster, we get this error:

SystemError: RuntimeError: Cluster.rejoin_instance: The instance 'db-1001:3306' does not belong to the ReplicaSet: 'default'

We can, however, rejoin the expelled member by:

STOP GROUP_REPLICATION;
START GROUP_REPLICATION;

How to repeat:
We have a group replication cluster of 5 nodes:

root@db-1001 [(none)]> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 238fea26-ff16-11e9-b2ea-d06726c86300 | db-2001     |        3306 | ONLINE       | SECONDARY   | 8.0.18         |
| group_replication_applier | 71cba15b-ff15-11e9-9047-8030e005f300 | db-1002     |        3306 | ONLINE       | PRIMARY     | 8.0.18         |
| group_replication_applier | 92fc2903-ff14-11e9-86a4-8030e006a5a0 | db-1001     |        3306 | ONLINE       | SECONDARY   | 8.0.18         |
| group_replication_applier | d7354d4f-ff16-11e9-9b25-8030e0060690 | db-6001     |        3306 | ONLINE       | SECONDARY   | 8.0.18         |
| group_replication_applier | edca9e98-ff16-11e9-afb4-d06726c87550 | db-2002     |        3306 | ONLINE       | SECONDARY   | 8.0.18         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
5 rows in set (0.00 sec)

Now we blocks connections on db-1001 to simulate a network glitch. Shortly after the network came back, it turns to ERROR state which is expected:

2019-11-14T13:20:29.478612Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2019-11-14T13:20:29.478729Z 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

Since we have autoRejoin disabled, we need to manually rejoin the expelled member back via MySQL Shell:

MySQL Shell 8.0.18

Copyright (c) 2016, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
WARNING: Using a password on the command line interface can be insecure.
Creating a session to 'gr_admin@db-1002'
Your MySQL connection id is 879636 (X protocol)
Server version: 8.0.18 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-1002:33060+ ssl  Py > cluster = dba.get_cluster()
 MySQL  db-1002:33060+ ssl  Py > cluster.describe()
{
    "clusterName": "gr__bk_eu__test__3", 
    "defaultReplicaSet": {
        "name": "default", 
        "topology": [
            {
                "address": "db-1002:3306", 
                "label": "db-1002:3306", 
                "role": "HA"
            }, 
            {
                "address": "db-2001:3306", 
                "label": "db-2001:3306", 
                "role": "HA"
            }, 
            {
                "address": "db-2002:3306", 
                "label": "db-2002:3306", 
                "role": "HA"
            }, 
            {
                "address": "db-6001:3306", 
                "label": "db-6001:3306", 
                "role": "HA"
            }
        ], 
        "topologyMode": "Single-Primary"
    }
}
 MySQL  db-1002:33060+ ssl  Py > cluster.rejoin_instance("gr_admin@db-1001:3306")
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Rejoining instance to the cluster ...

ERROR: Failed to erase the password: Unknown or unsupported command: erase
Please provide the password for 'gr_admin@db-1001:3306': *************************
Save password for 'gr_admin@db-1001:3306'? [Y]es/[N]o/Ne[v]er (default No): 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
SystemError: RuntimeError: Cluster.rejoin_instance: The instance 'db-1001:3306' does not belong to the ReplicaSet: 'default'.

Suggested fix:
It should be able to use Cluster.rejoinInstance() to rejoin the expelled instance back without restarting group replication.

Hi Eric/Simon,

The output of cluster.describe() you added in this bug report does not include `db-1001` so that instance is not part of the cluster's metadata.

For that reason, attempting to rejoin `db-1001` to the cluster using cluster.rejoinInstance() fails.

Now you could think: "ok, it's not in the metadata so I'll use cluster.addInstance() instead" - but that would fail too because the command will detect that the instance is part of the Group Replication group but not part of the metadata. An error like the following would be thrown:

ERROR: Instance 'localhost:3330' is part of the Group Replication group but is not in the metadata. Please use <Cluster>.rescan() to update the metadata.

Using cluster.rescan() would ensure that the Metadata is updated accordingly.

With just this information we cannot assess if there's a bug somewhere.

Please provide a more thorough description of the issue and setup. For example, what were the steps to reach this point?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".