MySQL Bugs: #109202: Unable to fetch live group_replication member data from any server in cluster

Bug #109202	Unable to fetch live group_replication member data from any server in cluster
Submitted:	25 Nov 2022 2:46	Modified:	2 Dec 2022 7:24
Reporter:	zetang zeng	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Router	Severity:	S4 (Feature request)
Version:	8.0.30	OS:	CentOS
Assigned to:		CPU Architecture:	Any

Description:
Soon after cluster in NO_QUORUM state, router will down 

How to repeat:
Step to reproduce:

- deploy 3-nodes cluster & router
- check status and result is ok

/opt/tiger/mysql-shell/bin/mysqlsh --uri "xxxx:xxxx@localhost:6446" -- cluster status
{
    "clusterName": "myCluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "192.168.1.172:3406",
        "ssl": "REQUIRED",
        "status": "OK",
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
        "topology": {
            "192.168.1.121:3406": {
                "address": "192.168.1.121:3406",
                "memberRole": "SECONDARY",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE",
                "version": "5.7.40"
            },
            "192.168.1.172:3406": {
                "address": "192.168.1.172:3406",
                "memberRole": "PRIMARY",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE",
                "version": "5.7.40"
            },
            "192.168.1.26:3406": {
                "address": "192.168.1.26:3406",
                "memberRole": "SECONDARY",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE",
                "version": "5.7.40"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "192.168.1.172:3406"
}

- on 192.168.1.121: sudo kill -9 mysqld_safe && sudo kill -9 mysqld
- on 192.168.1.26: sudo kill -9 mysqld_safe && sudo kill -9 mysqld
- on 192.168.1.172: check status again

/opt/tiger/mysql-shell/bin/mysqlsh --uri "xxxx:xxxx@localhost:6446" -- cluster status
WARNING: Cluster has no quorum and cannot process write transactions: Group has no quorum
{
    "clusterName": "myCluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "192.168.1.172:3406",
        "ssl": "REQUIRED",
        "status": "NO_QUORUM",
        "statusText": "Cluster has no quorum as visible from '192.168.1.172:3406' and cannot process write transactions. 2 members are not active.",
        "topology": {
            "192.168.1.121:3406": {
                "address": "192.168.1.121:3406",
                "memberRole": "SECONDARY",
                "memberState": "(MISSING)",
                "mode": "n/a",
                "readReplicas": {},
                "role": "HA",
                "shellConnectError": "MySQL Error 2003: Could not open connection to '192.168.1.121:3406': Can't connect to MySQL server on '192.168.1.121:3406' (111)",
                "status": "UNREACHABLE"
            },
            "192.168.1.172:3406": {
                "address": "192.168.1.172:3406",
                "memberRole": "PRIMARY",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE",
                "version": "5.7.40"
            },
            "192.168.1.26:3406": {
                "address": "192.168.1.26:3406",
                "memberRole": "SECONDARY",
                "mode": "n/a",
                "readReplicas": {},
                "role": "HA",
                "shellConnectError": "MySQL Error 2003: Could not open connection to '192.168.1.26:3406': Can't connect to MySQL server on '192.168.1.26:3406' (111)",
                "status": "(MISSING)"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "192.168.1.172:3406"
}

- a few seconds later: check status again

/opt/tiger/mysql-shell/bin/mysqlsh --uri "xxxx:xxxx@localhost:6446" -- cluster status
WARNING: Using a password on the command line interface can be insecure.
MySQL Error 2003 (HY000): Can't connect to MySQL server on 'localhost:6446' (111)

- the log of mysql router:

2022-11-24 11:05:21 metadata_cache WARNING [7f228b043700] Member 192.168.1.121:3406 (4b08ec0c-6a4f-11ed-8de7-00163e3e48c8) defined in metadata not found in actual Group Replication
2022-11-24 11:05:21 metadata_cache WARNING [7f228b043700] 192.168.1.172:3406 is not part of quorum for cluster 'myCluster'
2022-11-24 11:05:21 metadata_cache WARNING [7f228b043700] Member 192.168.1.172:3406 (4b0e84e1-6a4f-11ed-8049-00163e4c0d19) defined in metadata not found in actual Group Replication
2022-11-24 11:05:21 metadata_cache WARNING [7f228b043700] 192.168.1.26:3406 is not part of quorum for cluster 'myCluster'
2022-11-24 11:05:21 metadata_cache WARNING [7f228b043700] Member 192.168.1.26:3406 (4b3ce4bb-6a4f-11ed-9ef0-00163e0f5188) defined in metadata not found in actual Group Replication
2022-11-24 11:05:21 metadata_cache WARNING [7f228b043700] 192.168.1.121:3406 is not part of quorum for cluster 'myCluster'
2022-11-24 11:05:21 metadata_cache ERROR [7f228b043700] Unable to fetch live group_replication member data from any server in cluster 'myCluster'

Suggested fix:
In no quorum state cluster should be able to be read.

Hi Zetang,

Thank you for the report. I did manage to reproduce this as described.

kind regards

Posted by developer:
 
As far as I suspect, this is actually working by design.  When the cluster loses quorum, any reads to it are questionable.  For example, perhaps it might be better that the Router fails to route (close the ports), so that the application can, for example, fail over to another Router connected to a healthy cluster.  Or realise it's working on unreliable data.

But regardless of that, I guess if the user is sure that this is the behaviour that's right for him, we could try to add such functionality.  Marking it as a feature request.

PS.
GR data is actually being collected from the last server standing.  The warnings that Bogdan pasted just advise that the expected state differs from real state of the cluster.  Maybe the warning messages should have been more clear.  I just happen to know better, because I wrote the code that generated them.  Here's what really happened:

There is a loop in code which is activated after GR data is read.  It (among other things) compares the expected list of nodes (as defined by metadata provided by Metadata Cache) vs actual list of available nodes (provided by Group Replication).  Since there's no quorum, the 2nd list is empty, and thus every node comes up as missing -> a warning is generated for each one.
If Router failed to extract GR data, it should never make it far enough to execute this check, it would fail earlier here (all code appears in metadata_cache/src/cluster_metadata_gr.cc):

    606       // this node's perspective: give status of all nodes you see
    607       std::map<std::string, GroupReplicationMember> member_status =
    608           fetch_group_replication_members(
    609               *gr_member_connection,
    610               single_primary_mode);  // throws metadata_cache::metadata_error

which would then be caught here:

    652     } catch (const metadata_cache::metadata_error &e) {
    653       log_warning(
    654           "Unable to fetch live group_replication member data from %s from "
    655           "cluster '%s': %s",
    656           mi_addr.c_str(), target_cluster.c_str(), e.what());
    657       continue;  // faulty server, next!
    658     } catch (const std::exception &e) {

thus producing an error "Unable to fetch live group_replication member data from ... etc"

The actual code that's logging the warnings is called from here:

    615       // check status of all nodes; updates instances
    616       // ------------------vvvvvvvvvvvvvvvvvv
    617       bool metadata_gr_discrepancy{false};
    618       const auto status = check_cluster_status(cluster.members, member_status,
    619                                                metadata_gr_discrepancy);

and the log printing happens inside check_cluster_status():

    751   // we do two things here:
    752   // 1. for all `instances`, set .mode according to corresponding .status found
    753   // in `member_status`
    754   // 2. count nodes which are part of quorum (online/recovering nodes)
    ...
    ...
    758   for (auto &member : instances) {
    759     auto status = member_status.find(member.mysql_server_uuid);
    760     const bool node_in_gr = status != member_status.end();
    ...
    ...
    765     if (node_in_gr) {
    ...
    ...
    793     } else {
    ...
    ...
    798       log_custom(log_level,
    799                  "Member %s:%d (%s) defined in metadata not found in actual "
    800                  "Group Replication",
    801                  member.host.c_str(), member.port,
    802                  member.mysql_server_uuid.c_str());
    803     }
    804   }

Thanks for update Pawel,
Would be good if we can 
 - document that even read-only is unavailable when quorum is not existent
 - make a behavior configurable (allow or not read-only access if quorum is not there)

Thanks

> When the cluster loses quorum, any reads to it are questionable.  For example, perhaps it might be better that the Router fails to route (close the ports), so that the application can, for example, fail over to another Router connected to a healthy cluster.  Or realise it's working on unreliable data.

I understand what you describe but may be it is better for MySQL Cluster to do this decision. Router should be transparent, just route me to MySQL Cluster, let me communicate with it.