MySQL Bugs: #100036: Unable to fetch live group_replication member data from any server in replicaset

Bug #100036	Unable to fetch live group_replication member data from any server in replicaset
Submitted:	29 Jun 2020 14:28	Modified:	9 Aug 2020 11:20
Reporter:	Snehal Bhavsar	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Router	Severity:	S1 (Critical)
Version:		OS:	CentOS
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Hello All,

We have found one more bug of InnoDB cluster. When there is single one left in the cluster out of three nodes, so while we are cloning the other instances to join the cluster again, MySQL Router is restricting client connections from application to the cluster at the time of cloning. Practically this should not happen as there is a single node available and which can permits all operations on Primary node, even if it is cloning other instance.

ERROR LOGS of MySQL Router: 
2020-06-29 22:01:35 metadata_cache WARNING [7f6cb821e700] xxx.xx.xxx.xxx:1122 is not part of quorum for replicaset 'default'
2020-06-29 22:01:35 metadata_cache ERROR [7f6cb821e700] Unable to fetch live group_replication member data from any server in replicaset 'default'

How to repeat:
We are having only single node available in the Cluster out of three as below:

 MySQL  xxx.xx.xx.xxx:1122 ssl  JS > a.status()
{
    "clusterName": "InnoDBCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "xxx.xx.xx.xxx:1122", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active", 
        "topology": {
            "xxx.xx.xx.xxx:1122": {
                "address": "xxx.xx.xx.xxx:1122", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "xxx.xx.xx.xxx:1122": {
                "address": "xxx.xx.xx.xxx:1122", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "xxx.xx.xx.xxx:1122": {
                "address": "xxx.xx.xx.xxx:1122", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.18"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "xxx.xx.xx.xxx:1122"

Now if the case is we have to force remove each of these missing nodes, and add it back by cloning, We are facing client connections issues while cloning process is in progress, MySQL Router restricts client connection. 

MySQL Router Error Logs:

2020-06-29 22:01:35 metadata_cache WARNING [7f6cb821e700] xxx.xx.xx.xxx:1122 is not part of quorum for replicaset 'default'
2020-06-29 22:01:35 metadata_cache ERROR [7f6cb821e700] Unable to fetch live group_replication member data from any server in replicaset 'default'

Hi,

Can you please share the configuration from the servers.

Also, how did this group ended up with only 1 server, did you "shutdown" (properly) two of them or ?

Thanks
Bogdan

No, we do not shutdown these nodes. These two servers gets missing from the cluster every time due to these error which is again a bug of writeset

2020-06-27T09:32:38.523253Z 18 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 4 failed executing transaction 'e295b724-53c8-11ea-80c8-fa163efa4b49:381194653'; Could not execute Delete_rows event on table xxxxxxxx.QRTZ_TRIGGERS; Cannot delete or update a parent row: a foreign key constraint fails (`xxxxxxxx`.`QRTZ_CRON_TRIGGERS`, CONSTRAINT `QRTZ_CRON_TRIGGERS_ibfk_1` FOREIGN KEY (`SCHED_NAME`, `TRIGGER_NAME`, `TRIGGER_GROUP`) REFERENCES `QRTZ_TRIGGERS` (`SCHED_NAME`, `TRIGGER_NAME`, `), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED, Error_code: MY-001451

2020-06-27T09:32:38.523843Z 14 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'

2020-06-27T09:32:38.527530Z 11 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'

2020-06-27T09:32:38.530952Z 11 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

2020-06-27T09:32:38.542099Z 14 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0

Hi,

And the configs? Can you share them?

Thanks
Bogdan

Hi,

When 2 out of the 3 nodes went awoll you lost majority (quorum) so it is by design that write is not allowed.

Are you following the defined procedures to restore (unblock the group first), to me it looks you are not.

https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-cluster-working-with-cluster.html#res...

kind regards
Bogdan

Hi All,

I was looking into that from the MySQL Router perspective.
I could not reproduce it by simply creating 3-nodes cluster and doing "STOP GROUP_REPLICATION" on the 2 RO nodes. I can still use bot RW and RO ports after that. So there's gotta be more to that. 
One potential reason for the error message like that in the log could be the instance UUID in the cluster metadata and the one reported by the instance itself became different for some reason. I would need the output from the following queries to confirm that:

select @@server_uuid;
select * from instances \G

BR,
Andrzej

Forgot to mention that for 

select * from instances \G

one needs to:

use mysql_innodb_cluster_metadata;

--
BR,
Andrzej

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".