Bug #84784 Group Replication nodes do not rejoin cluster after network connectivity issues
Submitted: 2 Feb 2017 3:06 Modified: 3 Mar 2017 12:02
Reporter: Kenny Gryp Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.17 OS:Any
Assigned to: Filipe Campos CPU Architecture:Any

[2 Feb 2017 3:06] Kenny Gryp
Description:
Nodes do not reconnect back to the group replication once they got disconnected, causing nodes to drop from the cluster up to losing the whole cluster availability

How to repeat:
Have a 3 node cluster:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

[vagrant@gr-3 ~]$ sudo ifconfig eth1 down

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

the node becomes unreachable.

[vagrant@gr-3 ~]$ sudo ifconfig eth1 up

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.01 sec)

gr-3 doesn't join the cluster anymore.

mysql> show global status like '%group_repl%';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| Com_group_replication_start      | 1                                    |
| Com_group_replication_stop       | 0                                    |
| group_replication_primary_member | 72149827-e1cc-11e6-9daf-08002789cd2e |
+----------------------------------+--------------------------------------+
3 rows in set (0.00 sec)

(FYI: The primary member is still in the status variable. but can be outdated!)

gr-1 and gr-2 have this:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)

so gr-3 is now broken. it won't join the cluster again automatically

Suggested fix:

When there is a small network glitch, a node or multiple nodes might lose connection with the cluster.

Nodes do not rejoin the group replication automatically.

You can easily break a cluster by just bringing the interfaces down of 2 of the 3 nodes, wait a few seconds until group replication figures it out and the master node will get stuck on new writes.
[3 Mar 2017 12:02] Filipe Campos
Hi Kenny,

Thank you for evaluating Group Replication! Your (and all community)
feedback is important!

When a member of the group is unreachable by a majority of members during
some time, it is expelled from the group.
Then, when this expelled member has its network connection restored, it tries
to rejoin the group and fails, hence it changes to the ERROR state and writes
are forbidden by super_read_only=1.
If the network connection is restored before the member being expelled,
it will be able to resume operation.

Having said that, we understand your concern and are looking into improving
this situation.
[18 Apr 2017 11:37] Sheraz Ahmed
If a group member is unreachable, after how much time its going to be expelled from the group and is this time configurable? 

Secondly, If a group member receives invalid transaction and continuously fails to apply it, after how many attempts and time interval its going to go in error state and  are these settings (interval + attempts) configurable?
[20 Apr 2017 13:11] Filipe Campos
Hello Sheraz Ahmed

First of all, thank you for evaluating Group Replication! Your (and all
community) feedback is important!

Regarding your questions:

Q: If a group member is unreachable, after how much time its going to be
expelled from the group and is this time configurable?

A: After 5 seconds, other nodes will suspect the unreachable group member
has failed and it will be expelled from the group by one of them. Currently,
this time period is not configurable.

Q: Secondly, If a group member receives invalid transaction and continuously
fails to apply it, after how many attempts and time interval its going to
go in error state and are these settings (interval + attempts) configurable?

A: If it is a permanent applier error, like a duplicate primary key or table
missing, it will move to ERROR state after first error, there is no retry. If
it is a temporary error, like a InnoDB lock timeout, it will retry X times,
depending on the value of slave_transaction_retries.
https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transa...

Hope these answers are enough to clear up your doubts.

Best regards,

Filipe
[10 May 2017 13:03] Sheraz Ahmed
Filipe Campos,

I appreciate your quick and thorough response. At the same time I apologize for not being able to thank you earlier. 

Thank you :)
[1 Jul 2017 22:49] Jo Goossens
Hi,

After some testing I noticed as well the permanent error state. In our opinion this increases the possibility of total failure significantly. If not fixed manually soon enough and another node has the same kind of issue, the whole cluster is down.

Is there any fix planned for this? For example for MySQL 5.7.19?

Thanks a lot for looking further into this!
[15 Sep 2017 16:33] Ramesh Patel
we are seeing the same in 5.7.19...is this fixed?
[16 Nov 2017 8:12] Frank Ullrich
Situation in 5.7.20 seems to be unchanged!
[23 Jan 1:15] ronnie arangali
is it fix on 5.7.21?
[23 Jan 8:10] Jo Goossens
We recently had to restart 2 nodes manually and let everything repair by the group replication. We had 20 minutes outage while this process was happening (we discovered too late).

We improved our monitoring now to catch it, but why can a simple network error cause a permanent error state so easily? This bypasses the whole point of this redundant setup in our opinion.