Bug #84727 GR: Partitioned Node Should Get Updated Status and not accept writes
Submitted: 31 Jan 2017 8:16 Modified: 26 Jun 2017 15:59
Reporter: Kenny Gryp Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.17 OS:Any
Assigned to: CPU Architecture:Any

[31 Jan 2017 8:16] Kenny Gryp
Description:

When a node lost connectivity with it's other nodes (network partitioned).

You get an output like:

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

However, the partitioned node does not ever go into super_read_only=ON. 

mysql> show global variables like 'super_read_only';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| super_read_only | OFF   |
+-----------------+-------+
1 row in set (0.08 sec)

Also, because of that, any write will just be accepted by the mysql server, but they will just hang forever:

*************************** 4. row ***************************
     Id: 2194
   User: root
   Host: localhost
     db: lefredissimo
Command: Query
   Time: 181
  State: query end
   Info: insert into lefredissimo.heloise values (10000, 'manual')

This means that when you want to use group replication in production:
- it's hard to detect a node is partitioned or not by just reading super_read_only variable as advised
- writes will hang which will cause the application to hang which in many cases will bring down application servers as they are all hanging on queries

How to repeat:
To reproduce:

- get a cluster with 3 nodes
- do 'ifconfig eth0 down' to bring down the network interface
- check the partitioned node, it will have the status as mentioned above.

Suggested fix:

I want the node to go into super_read_only=1  when it's network partitioned so that:

1. when I have group_replication_single_primary_mode=off, I can figure out that a node is not primary and should not be getting writes
2. writes will not longer be accepted instead of hanging and causing application problems.
[31 Jan 2017 8:16] Kenny Gryp
.
[31 Jan 2017 8:58] Umesh Shastry
Hello Kenny Gryp,

Thank you for the report.

Thanks,
Umesh
[31 Jan 2017 16:08] Nuno Carvalho
Posted by developer:
 
Hi Kenny,

Thank you for the bug report, a member may lose its connection to the group for a short period and be able to reconnect again.
That is way we do not react promptly to that event.

Though if that unreachable state remains then we should react, though, as you must be aware, if we have ongoing transactions while the connectivity it is lost we cannot set super_read_only=1, it will deadlock. More on this at https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_read_only
You may say, then rollback all ongoing transactions and then set super_read_only=1, that will open a window on which a new client may execute a new transaction, and we are back on the same situation.

This event cannot be solved without coordination with a external agent. In order to allow that we will improve how we report this events.

Best regards,
Nuno Carvalho
[31 Jan 2017 22:39] Kenny Gryp
- In my tests, the queries hang forever, so there's definitely a rollback that should happen in a reasonable amount of time and then super_read_only should be enabled.

- Note that PFS still shows that this node is ONLINE and the other nodes are marked UNREACHABLE. That's the only information I can gather. I don't know if it's part of the primary partition or not.
[25 May 2017 11:25] Pedro Gomes
Posted by developer:
 
A new patch was pushed that introduces a new tool for DBAs to deal with what is called network partitions with a part of the group being stuck in a minority.

The scenario is:
In a group of 5 servers (S1,S2,S3,S4,S5 ), if there is a disconnection between S1,S2 and S3,S4,S5 the first group is now in a minority, i.e., it can't contact more than half of the group.

While the second group remains running, the first one gets stuck waiting for a network re-connection.
All transactions in this minority are stuck until a stop group replication command is issued in this members

What DBAs can now do is use configure the timeout using the new option

*group_replication_unreachable_majority_timeout*

Variable Scope       : Global
Dynamic Variable     : Yes
Permitted Values     : 0 - 31536000 seconds
Default Value        : 0

When configured to 0, the member will wait forever for a network restore hence the default value means the old behavior is maintained by default.

If configured to 60 seconds, for example, it means that the servers in a minority (S1 and S2 in the above example) will after 60 seconds leave the group and error out.
All pending transactions will be rolled back and the server will move to ERROR.

WARNING: If you have a symmetric group, just two nodes for example (S1,S2), if there is a partition and there is no majority, after the configured timeout all members will shutdown and enter an error state.
[26 Jun 2017 15:59] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.19 and 8.0.2 change logs:
When there was a network partition and a member was in a minority all queries to that member blocked. To improve this situation, the group_replication_unreachable_majority_timeout variable has been added which enables you to configure how long members in a minority wait to regain contact with a member in the majority before leaving the group.