Bug #84784 Group Replication nodes do not rejoin cluster after network connectivity issues
Submitted: 2 Feb 2017 3:06 Modified: 18 Apr 2:07
Reporter: Kenny Gryp Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.17, 8.0.4 OS:Any
Assigned to: Filipe Campos CPU Architecture:Any

[2 Feb 2017 3:06] Kenny Gryp
Description:
Nodes do not reconnect back to the group replication once they got disconnected, causing nodes to drop from the cluster up to losing the whole cluster availability

How to repeat:
Have a 3 node cluster:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

[vagrant@gr-3 ~]$ sudo ifconfig eth1 down

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

the node becomes unreachable.

[vagrant@gr-3 ~]$ sudo ifconfig eth1 up

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.01 sec)

gr-3 doesn't join the cluster anymore.

mysql> show global status like '%group_repl%';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| Com_group_replication_start      | 1                                    |
| Com_group_replication_stop       | 0                                    |
| group_replication_primary_member | 72149827-e1cc-11e6-9daf-08002789cd2e |
+----------------------------------+--------------------------------------+
3 rows in set (0.00 sec)

(FYI: The primary member is still in the status variable. but can be outdated!)

gr-1 and gr-2 have this:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)

so gr-3 is now broken. it won't join the cluster again automatically

Suggested fix:

When there is a small network glitch, a node or multiple nodes might lose connection with the cluster.

Nodes do not rejoin the group replication automatically.

You can easily break a cluster by just bringing the interfaces down of 2 of the 3 nodes, wait a few seconds until group replication figures it out and the master node will get stuck on new writes.
[3 Mar 2017 12:02] Filipe Campos
Hi Kenny,

Thank you for evaluating Group Replication! Your (and all community)
feedback is important!

When a member of the group is unreachable by a majority of members during
some time, it is expelled from the group.
Then, when this expelled member has its network connection restored, it tries
to rejoin the group and fails, hence it changes to the ERROR state and writes
are forbidden by super_read_only=1.
If the network connection is restored before the member being expelled,
it will be able to resume operation.

Having said that, we understand your concern and are looking into improving
this situation.
[18 Apr 2017 11:37] Sheraz Ahmed
If a group member is unreachable, after how much time its going to be expelled from the group and is this time configurable? 

Secondly, If a group member receives invalid transaction and continuously fails to apply it, after how many attempts and time interval its going to go in error state and  are these settings (interval + attempts) configurable?
[20 Apr 2017 13:11] Filipe Campos
Hello Sheraz Ahmed

First of all, thank you for evaluating Group Replication! Your (and all
community) feedback is important!

Regarding your questions:

Q: If a group member is unreachable, after how much time its going to be
expelled from the group and is this time configurable?

A: After 5 seconds, other nodes will suspect the unreachable group member
has failed and it will be expelled from the group by one of them. Currently,
this time period is not configurable.

Q: Secondly, If a group member receives invalid transaction and continuously
fails to apply it, after how many attempts and time interval its going to
go in error state and are these settings (interval + attempts) configurable?

A: If it is a permanent applier error, like a duplicate primary key or table
missing, it will move to ERROR state after first error, there is no retry. If
it is a temporary error, like a InnoDB lock timeout, it will retry X times,
depending on the value of slave_transaction_retries.
https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transa...

Hope these answers are enough to clear up your doubts.

Best regards,

Filipe
[10 May 2017 13:03] Sheraz Ahmed
Filipe Campos,

I appreciate your quick and thorough response. At the same time I apologize for not being able to thank you earlier. 

Thank you :)
[1 Jul 2017 22:49] Jo Goossens
Hi,

After some testing I noticed as well the permanent error state. In our opinion this increases the possibility of total failure significantly. If not fixed manually soon enough and another node has the same kind of issue, the whole cluster is down.

Is there any fix planned for this? For example for MySQL 5.7.19?

Thanks a lot for looking further into this!
[15 Sep 2017 16:33] Ramesh Patel
we are seeing the same in 5.7.19...is this fixed?
[16 Nov 2017 8:12] Frank Ullrich
Situation in 5.7.20 seems to be unchanged!
[23 Jan 1:15] ronnie arangali
is it fix on 5.7.21?
[23 Jan 8:10] Jo Goossens
We recently had to restart 2 nodes manually and let everything repair by the group replication. We had 20 minutes outage while this process was happening (we discovered too late).

We improved our monitoring now to catch it, but why can a simple network error cause a permanent error state so easily? This bypasses the whole point of this redundant setup in our opinion.
[18 Apr 2:06] Kenny Gryp
I really think this bug/feature should be fixed.
Several people commented on this bug already, this is not an isolated case.

I really reduces the practical use of group replication.

Additionally, I would like to add some more information that I figured out during my tests:
when there are only 2 nodes remaining in the cluster, the nodes DO RECONNECT. This seems to be a special case and avoids clusters going down completely as far as I can see. On purpose?

In any case, I would like to see  all nodes reconnect automatically AND block writes instead of stall them (with group_replication_unreachable_majority_timeout > 0)
[18 Apr 2:07] Kenny Gryp
Also reproducible in version 8.0.4
[18 Apr 8:13] Nuno Carvalho
Hi all,

Thank you for your scenarios, we are looking into improve this.

Best regards,
Nuno Carvalho
[9 May 10:42] Shubhra Prakash Nandi
Hi, I was just wondering if a scheduled restart of mysql cluster instances would be good idea to get around this issue. They will not be simultaneous but sufficiently separated in time so that the restarted node can recover before other node is restarted.

Apart from this, is there any official workaround in place for this issue now?
[21 Aug 22:54] Roger Lee
Could we get an update on this bug?  Could we at least make the 5 second timeout at least configurable to a long time?

Can we recommend a means to reconnect the lost member or a way to manually rejoin that lost member?
[24 Sep 11:25] Ojas Desai
Fixed in any of the Latest versions ?
[25 Oct 7:10] MANVITH GOLLA
Is this issue is fixed in current GA MY-8.0.12?
[31 Oct 12:57] Nuno Carvalho
Thank you for the feedback, while we do not have automatic rejoin after network issues please check how to extend the network timeout at

https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

https://mysqlhighavailability.com/group-replication-coping-with-unreliable-failure-detecti...

Best regards,
Nuno Carvalho