Bug #84727 | GR: Partitioned Node Should Get Updated Status and not accept writes | ||
---|---|---|---|
Submitted: | 31 Jan 2017 8:16 | Modified: | 26 Jun 2017 15:59 |
Reporter: | Kenny Gryp | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S2 (Serious) |
Version: | 5.7.17 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[31 Jan 2017 8:16]
Kenny Gryp
[31 Jan 2017 8:16]
Kenny Gryp
.
[31 Jan 2017 8:58]
MySQL Verification Team
Hello Kenny Gryp, Thank you for the report. Thanks, Umesh
[31 Jan 2017 16:08]
Nuno Carvalho
Posted by developer: Hi Kenny, Thank you for the bug report, a member may lose its connection to the group for a short period and be able to reconnect again. That is way we do not react promptly to that event. Though if that unreachable state remains then we should react, though, as you must be aware, if we have ongoing transactions while the connectivity it is lost we cannot set super_read_only=1, it will deadlock. More on this at https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_read_only You may say, then rollback all ongoing transactions and then set super_read_only=1, that will open a window on which a new client may execute a new transaction, and we are back on the same situation. This event cannot be solved without coordination with a external agent. In order to allow that we will improve how we report this events. Best regards, Nuno Carvalho
[31 Jan 2017 22:39]
Kenny Gryp
- In my tests, the queries hang forever, so there's definitely a rollback that should happen in a reasonable amount of time and then super_read_only should be enabled. - Note that PFS still shows that this node is ONLINE and the other nodes are marked UNREACHABLE. That's the only information I can gather. I don't know if it's part of the primary partition or not.
[25 May 2017 11:25]
Pedro Gomes
Posted by developer: A new patch was pushed that introduces a new tool for DBAs to deal with what is called network partitions with a part of the group being stuck in a minority. The scenario is: In a group of 5 servers (S1,S2,S3,S4,S5 ), if there is a disconnection between S1,S2 and S3,S4,S5 the first group is now in a minority, i.e., it can't contact more than half of the group. While the second group remains running, the first one gets stuck waiting for a network re-connection. All transactions in this minority are stuck until a stop group replication command is issued in this members What DBAs can now do is use configure the timeout using the new option *group_replication_unreachable_majority_timeout* Variable Scope : Global Dynamic Variable : Yes Permitted Values : 0 - 31536000 seconds Default Value : 0 When configured to 0, the member will wait forever for a network restore hence the default value means the old behavior is maintained by default. If configured to 60 seconds, for example, it means that the servers in a minority (S1 and S2 in the above example) will after 60 seconds leave the group and error out. All pending transactions will be rolled back and the server will move to ERROR. WARNING: If you have a symmetric group, just two nodes for example (S1,S2), if there is a partition and there is no majority, after the configured timeout all members will shutdown and enter an error state.
[26 Jun 2017 15:59]
David Moss
Posted by developer: Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.19 and 8.0.2 change logs: When there was a network partition and a member was in a minority all queries to that member blocked. To improve this situation, the group_replication_unreachable_majority_timeout variable has been added which enables you to configure how long members in a minority wait to regain contact with a member in the majority before leaving the group.