Bug #109202 | Unable to fetch live group_replication member data from any server in cluster | ||
---|---|---|---|
Submitted: | 25 Nov 2022 2:46 | Modified: | 2 Dec 2022 7:24 |
Reporter: | zetang zeng | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Router | Severity: | S4 (Feature request) |
Version: | 8.0.30 | OS: | CentOS |
Assigned to: | CPU Architecture: | Any |
[25 Nov 2022 2:46]
zetang zeng
[29 Nov 2022 1:11]
MySQL Verification Team
Hi Zetang, Thank you for the report. I did manage to reproduce this as described. kind regards
[1 Dec 2022 11:35]
Pawel Mroszczyk
Posted by developer: As far as I suspect, this is actually working by design. When the cluster loses quorum, any reads to it are questionable. For example, perhaps it might be better that the Router fails to route (close the ports), so that the application can, for example, fail over to another Router connected to a healthy cluster. Or realise it's working on unreliable data. But regardless of that, I guess if the user is sure that this is the behaviour that's right for him, we could try to add such functionality. Marking it as a feature request. PS. GR data is actually being collected from the last server standing. The warnings that Bogdan pasted just advise that the expected state differs from real state of the cluster. Maybe the warning messages should have been more clear. I just happen to know better, because I wrote the code that generated them. Here's what really happened: There is a loop in code which is activated after GR data is read. It (among other things) compares the expected list of nodes (as defined by metadata provided by Metadata Cache) vs actual list of available nodes (provided by Group Replication). Since there's no quorum, the 2nd list is empty, and thus every node comes up as missing -> a warning is generated for each one. If Router failed to extract GR data, it should never make it far enough to execute this check, it would fail earlier here (all code appears in metadata_cache/src/cluster_metadata_gr.cc): 606 // this node's perspective: give status of all nodes you see 607 std::map<std::string, GroupReplicationMember> member_status = 608 fetch_group_replication_members( 609 *gr_member_connection, 610 single_primary_mode); // throws metadata_cache::metadata_error which would then be caught here: 652 } catch (const metadata_cache::metadata_error &e) { 653 log_warning( 654 "Unable to fetch live group_replication member data from %s from " 655 "cluster '%s': %s", 656 mi_addr.c_str(), target_cluster.c_str(), e.what()); 657 continue; // faulty server, next! 658 } catch (const std::exception &e) { thus producing an error "Unable to fetch live group_replication member data from ... etc" The actual code that's logging the warnings is called from here: 615 // check status of all nodes; updates instances 616 // ------------------vvvvvvvvvvvvvvvvvv 617 bool metadata_gr_discrepancy{false}; 618 const auto status = check_cluster_status(cluster.members, member_status, 619 metadata_gr_discrepancy); and the log printing happens inside check_cluster_status(): 751 // we do two things here: 752 // 1. for all `instances`, set .mode according to corresponding .status found 753 // in `member_status` 754 // 2. count nodes which are part of quorum (online/recovering nodes) ... ... 758 for (auto &member : instances) { 759 auto status = member_status.find(member.mysql_server_uuid); 760 const bool node_in_gr = status != member_status.end(); ... ... 765 if (node_in_gr) { ... ... 793 } else { ... ... 798 log_custom(log_level, 799 "Member %s:%d (%s) defined in metadata not found in actual " 800 "Group Replication", 801 member.host.c_str(), member.port, 802 member.mysql_server_uuid.c_str()); 803 } 804 }
[1 Dec 2022 11:51]
MySQL Verification Team
Thanks for update Pawel, Would be good if we can - document that even read-only is unavailable when quorum is not existent - make a behavior configurable (allow or not read-only access if quorum is not there) Thanks
[2 Dec 2022 7:24]
zetang zeng
> When the cluster loses quorum, any reads to it are questionable. For example, perhaps it might be better that the Router fails to route (close the ports), so that the application can, for example, fail over to another Router connected to a healthy cluster. Or realise it's working on unreliable data. I understand what you describe but may be it is better for MySQL Cluster to do this decision. Router should be transparent, just route me to MySQL Cluster, let me communicate with it.