MySQL Bugs: #118704: When a network anomaly occurs, the MGR (MySQL Group Replication) cluster remains in the UNREACHABLE state for an extende

Bug #118704	When a network anomaly occurs, the MGR (MySQL Group Replication) cluster remains in the UNREACHABLE state for an extende
Submitted:	23 Jul 7:32	Modified:	6 Aug 5:43
Reporter:	tangjie gong	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.43	OS:	Oracle Linux (redhat7.x)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86

Description:
When setting up a new MGR (MySQL Group Replication) cluster using MySQL Shell with database version 8.0.43, consisting of three nodes: mysql1, mysql2, and mysql3.
When mysql1 is the primary node, executing iptables -A OUTPUT -d mysql2 -j DROP on the mysql1 server to block outgoing traffic from mysql1 to mysql2 will result in mysql1 being evicted from the cluster, leaving a new cluster composed of mysql2 and mysql3.
However, when executing iptables -A OUTPUT -d mysql3 -j DROP on the mysql1 server to block outgoing traffic from mysql1 to mysql3, the following anomalies occur:
From mysql1's perspective, its own status and mysql3's status are ONLINE, while mysql2's status is UNREACHABLE.
From mysql2's perspective, its own status and mysql3's status are ONLINE, while mysql1's status is UNREACHABLE.
From mysql3's perspective, all nodes are ONLINE.
Additionally, the error log of mysql1 continuously reports the following error:
[Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'

During testing, if the above situation does not occur, you can set mysql2 or mysql3 as the primary node. The test can be repeated for a maximum of 6 rounds, with the only variable being the execution of iptables -A OUTPUT -d xx -j DROP on the primary node. Therefore, this phenomenon does not necessarily occur only when mysql1 is the primary node.
In total, there are three possible outcomes:
The cluster remains in the UNREACHABLE state for an extended period without any node being expelled, and the master-slave data synchronization works normally at this time.
The primary node is expelled, triggering a master-slave switchover.
A slave node is expelled without triggering a switchover.
When a node is expelled, the error log will show Error pushing message into group communication engine., followed by a message indicating that the node is set to ERROR due to network issues: [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'

The versions tested include 8.0.23, 8.0.28, 8.0.32, and 8.0.43.

How to repeat:
An MGR (MySQL Group Replication) instance already exists. It was set up using MySQL Shell with default parameters.

mysql> select MEMBER_ID,MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;
+--------------------------------------+-------------+--------------+-------------+
| MEMBER_ID                            | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------+-------------+--------------+-------------+
| 409821a9-6789-11f0-a939-0050568ba84a | mysql2      | ONLINE       | SECONDARY   |
| 60515c68-6789-11f0-93ca-0050568b194e | mysql1      | ONLINE       | PRIMARY     |
| d89ade68-678b-11f0-94f9-000c29d5f729 | mysql3      | ONLINE       | SECONDARY   |
+--------------------------------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

Execute `iptables -A OUTPUT -d mysql3 -j DROP` on the primary node's server.

mysql1:
mysql> select MEMBER_ID,MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;
+--------------------------------------+-------------+--------------+-------------+
| MEMBER_ID                            | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------+-------------+--------------+-------------+
| 409821a9-6789-11f0-a939-0050568ba84a | mysql2      | ONLINE       | SECONDARY   |
| 60515c68-6789-11f0-93ca-0050568b194e | mysql1      | ONLINE       | PRIMARY     |
| d89ade68-678b-11f0-94f9-000c29d5f729 | mysql3      | UNREACHABLE  | SECONDARY   |
+--------------------------------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

mysql2:
mysql> select MEMBER_ID,MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;
+--------------------------------------+-------------+--------------+-------------+
| MEMBER_ID                            | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------+-------------+--------------+-------------+
| 409821a9-6789-11f0-a939-0050568ba84a | mysql2      | ONLINE       | SECONDARY   |
| 60515c68-6789-11f0-93ca-0050568b194e | mysql1      | ONLINE       | PRIMARY     |
| d89ade68-678b-11f0-94f9-000c29d5f729 | mysql3      | ONLINE       | SECONDARY   |
+--------------------------------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

mysql3:
mysql> select MEMBER_ID,MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;
+--------------------------------------+-------------+--------------+-------------+
| MEMBER_ID                            | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------+-------------+--------------+-------------+
| 409821a9-6789-11f0-a939-0050568ba84a | mysql2      | ONLINE       | SECONDARY   |
| 60515c68-6789-11f0-93ca-0050568b194e | mysql1      | UNREACHABLE  | PRIMARY     |
| d89ade68-678b-11f0-94f9-000c29d5f729 | mysql3      | ONLINE       | SECONDARY   |
+--------------------------------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

The other two scenarios can be stably reproduced by isolating different slave nodes under different primary nodes.

Suggested fix:
In this scenario test, network failures from the primary node to any slave node are simulated using iptables. For example, with nodes A, B, and C: if A isolates C (blocks traffic to C), two potential majorities (AB and BC) would theoretically emerge. Under normal circumstances, the system should determine whether A or C is faulty, leading to the formation of a valid majority such as AB or BC — which is the expected behavior.
However, there may be anomalies that prevent the system from identifying whether A or C is faulty. In such cases, no nodes will be expelled from the cluster.

There are three nodes: A, B, and C. When A is isolated from B, if the "my node identified" of node C is 0, then C will act as the killer in both AC and BC scenarios. In such cases, no expulsion action will be triggered. The source code is as follows:

if (m_is_killer_node) {
      MYSQL_GCS_LOG_TRACE(
          "process_suspicions: Expelling suspects that timed out!");
      bool const removed =
          m_proxy->xcom_remove_nodes(nodes_to_remove, m_gid_hash);
      if (removed && !nodes_to_remember_expel.empty()) {
        m_expels_in_progress.remember_expels_issued(m_config_id,
                                                    nodes_to_remember_expel);
      }
    } else if (force_remove) {
      assert(!m_is_killer_node);
      MYSQL_GCS_LOG_TRACE("process_suspicions: Expelling myself!");
      bool const removed = m_proxy->xcom_remove_node(*m_my_info, m_gid_hash);
      if (!removed) {
        // Failed to remove myself from the group so will install leave view
        m_control_if->install_leave_view(Gcs_view::MEMBER_EXPELLED);
      }
    }

In this case, remaining in the UNREACHABLE state appears to be as expected.

Hi,

I cannot reproduce this using proper 3 nodes with proper network disconnect.

What I believe happens in your case is that this behavior arises from asymmetric network partitioning and the design of MGR’s group membership management: 

- By blocking only outgoing packets from primary to a secondary, only one-way communication is broken, not both.
- MGR relies on consistent communication for quorums and state.
- If a node can receive but cannot transmit to another, differing nodes can form divergent views of group membership (split-brain warning sign).

As a result the blocked secondary may still send messages to the primary (since traffic is not blocked in that direction), so the primary thinks it's ONLINE, but the reverse is not true. This leads to inconsistent MEMBER_STATE output from the perspective of different nodes.

Since this is not a normal behavior we are trying to detect / solve for I cannot accept it as a bug.
When I test the same setup, but instead of iptables I physically kill the network everything behaves as expected.

Thank you for using MySQL Server