Bug #84784 Group Replication nodes do not rejoin cluster after network connectivity issues
Submitted: 2 Feb 2017 3:06 Modified: 27 Mar 2019 15:27
Reporter: Kenny Gryp Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.17, 8.0.4 OS:Any
Assigned to: Filipe Campos CPU Architecture:Any

[2 Feb 2017 3:06] Kenny Gryp
Description:
Nodes do not reconnect back to the group replication once they got disconnected, causing nodes to drop from the cluster up to losing the whole cluster availability

How to repeat:
Have a 3 node cluster:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

[vagrant@gr-3 ~]$ sudo ifconfig eth1 down

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

the node becomes unreachable.

[vagrant@gr-3 ~]$ sudo ifconfig eth1 up

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.01 sec)

gr-3 doesn't join the cluster anymore.

mysql> show global status like '%group_repl%';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| Com_group_replication_start      | 1                                    |
| Com_group_replication_stop       | 0                                    |
| group_replication_primary_member | 72149827-e1cc-11e6-9daf-08002789cd2e |
+----------------------------------+--------------------------------------+
3 rows in set (0.00 sec)

(FYI: The primary member is still in the status variable. but can be outdated!)

gr-1 and gr-2 have this:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)

so gr-3 is now broken. it won't join the cluster again automatically

Suggested fix:

When there is a small network glitch, a node or multiple nodes might lose connection with the cluster.

Nodes do not rejoin the group replication automatically.

You can easily break a cluster by just bringing the interfaces down of 2 of the 3 nodes, wait a few seconds until group replication figures it out and the master node will get stuck on new writes.
[3 Mar 2017 12:02] Filipe Campos
Hi Kenny,

Thank you for evaluating Group Replication! Your (and all community)
feedback is important!

When a member of the group is unreachable by a majority of members during
some time, it is expelled from the group.
Then, when this expelled member has its network connection restored, it tries
to rejoin the group and fails, hence it changes to the ERROR state and writes
are forbidden by super_read_only=1.
If the network connection is restored before the member being expelled,
it will be able to resume operation.

Having said that, we understand your concern and are looking into improving
this situation.
[18 Apr 2017 11:37] Sheraz Ahmed
If a group member is unreachable, after how much time its going to be expelled from the group and is this time configurable? 

Secondly, If a group member receives invalid transaction and continuously fails to apply it, after how many attempts and time interval its going to go in error state and  are these settings (interval + attempts) configurable?
[20 Apr 2017 13:11] Filipe Campos
Hello Sheraz Ahmed

First of all, thank you for evaluating Group Replication! Your (and all
community) feedback is important!

Regarding your questions:

Q: If a group member is unreachable, after how much time its going to be
expelled from the group and is this time configurable?

A: After 5 seconds, other nodes will suspect the unreachable group member
has failed and it will be expelled from the group by one of them. Currently,
this time period is not configurable.

Q: Secondly, If a group member receives invalid transaction and continuously
fails to apply it, after how many attempts and time interval its going to
go in error state and are these settings (interval + attempts) configurable?

A: If it is a permanent applier error, like a duplicate primary key or table
missing, it will move to ERROR state after first error, there is no retry. If
it is a temporary error, like a InnoDB lock timeout, it will retry X times,
depending on the value of slave_transaction_retries.
https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transa...

Hope these answers are enough to clear up your doubts.

Best regards,

Filipe
[10 May 2017 13:03] Sheraz Ahmed
Filipe Campos,

I appreciate your quick and thorough response. At the same time I apologize for not being able to thank you earlier. 

Thank you :)
[1 Jul 2017 22:49] Jo Goossens
Hi,

After some testing I noticed as well the permanent error state. In our opinion this increases the possibility of total failure significantly. If not fixed manually soon enough and another node has the same kind of issue, the whole cluster is down.

Is there any fix planned for this? For example for MySQL 5.7.19?

Thanks a lot for looking further into this!
[15 Sep 2017 16:33] Ramesh Patel
we are seeing the same in 5.7.19...is this fixed?
[16 Nov 2017 8:12] Frank Ullrich
Situation in 5.7.20 seems to be unchanged!
[23 Jan 2018 1:15] ronnie arangali
is it fix on 5.7.21?
[23 Jan 2018 8:10] Jo Goossens
We recently had to restart 2 nodes manually and let everything repair by the group replication. We had 20 minutes outage while this process was happening (we discovered too late).

We improved our monitoring now to catch it, but why can a simple network error cause a permanent error state so easily? This bypasses the whole point of this redundant setup in our opinion.
[18 Apr 2018 2:06] Kenny Gryp
I really think this bug/feature should be fixed.
Several people commented on this bug already, this is not an isolated case.

I really reduces the practical use of group replication.

Additionally, I would like to add some more information that I figured out during my tests:
when there are only 2 nodes remaining in the cluster, the nodes DO RECONNECT. This seems to be a special case and avoids clusters going down completely as far as I can see. On purpose?

In any case, I would like to see  all nodes reconnect automatically AND block writes instead of stall them (with group_replication_unreachable_majority_timeout > 0)
[18 Apr 2018 2:07] Kenny Gryp
Also reproducible in version 8.0.4
[18 Apr 2018 8:13] Nuno Carvalho
Hi all,

Thank you for your scenarios, we are looking into improve this.

Best regards,
Nuno Carvalho
[9 May 2018 10:42] Shubhra Prakash Nandi
Hi, I was just wondering if a scheduled restart of mysql cluster instances would be good idea to get around this issue. They will not be simultaneous but sufficiently separated in time so that the restarted node can recover before other node is restarted.

Apart from this, is there any official workaround in place for this issue now?
[21 Aug 2018 22:54] Roger Lee
Could we get an update on this bug?  Could we at least make the 5 second timeout at least configurable to a long time?

Can we recommend a means to reconnect the lost member or a way to manually rejoin that lost member?
[24 Sep 2018 11:25] Ojas Desai
Fixed in any of the Latest versions ?
[25 Oct 2018 7:10] MANVITH GOLLA
Is this issue is fixed in current GA MY-8.0.12?
[31 Oct 2018 12:57] Nuno Carvalho
Thank you for the feedback, while we do not have automatic rejoin after network issues please check how to extend the network timeout at

https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

https://mysqlhighavailability.com/group-replication-coping-with-unreliable-failure-detecti...

Best regards,
Nuno Carvalho
[4 Mar 2019 21:22] Jo Goossens
That is a great feature! However there are larger changes in MySQL 8.0 which prevent us from just upgrading our existing 5.7 cluster. 

Also such larger upgrade just for this features sounds a bit much.

Any chance  group_replication_member_expel_timeout  will be backported to MySQL 5.7? Seems like a smaller patch to do this?
[27 Mar 2019 15:27] Margaret Fisher
Posted by developer:
 
Fixed by WL #11284 Group Replication: auto-rejoin member to group after an expel in MySQL 8.0.16. Changelog entry:

For Group Replication, the new system variable group_replication_autorejoin_tries lets you specify the number of tries that a member makes to automatically rejoin the group if it is expelled, or if it is unable to contact a majority of the group before the group_replication_unreachable_majority_timeout setting is reached. The default setting, 0, means that the member does not try to rejoin, and proceeds to the action specified by the group_replication_exit_state_action system variable. 

Activate auto-rejoin if you can tolerate the possibility of stale reads and want to minimize the need for manual intervention, especially where transient network issues fairly often result in the expulsion of members.  If you specify a number of tries, when the member's expulsion or unreachable majority timeout is reached, it makes an attempt to rejoin (using the same settings as it used previously), then continues to make further auto-rejoin attempts up to the specified number of tries. After an unsuccessful auto-rejoin attempt, the member waits 5 minutes before the next try. During the auto-rejoin procedure, the member remains in super read only mode and displays an ERROR state to the replication group. The member can be stopped manually at any time by using a STOP GROUP_REPLICATION statement or shutting down the server. If the specified number of tries is exhausted without the member rejoining or being stopped, the member proceeds to the action specified by the group_replication_exit_state_action system variable, which can be either remaining in super read only mode or shutting down.
[27 Mar 2019 15:31] Jo Goossens
That sounds fantastic, any chance about backport of any of these improvments for 5.7.x ? :)
[28 Mar 2019 0:14] Hanjie Wang
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
[28 Mar 2019 1:43] Bin Hong
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
[28 Mar 2019 8:44] Yoann La Cancellera
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
I don't think this bug should be considered closed
[28 Mar 2019 20:44] Stian Halseth
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
[5 Apr 2019 7:32] Clara Medina GarcĂ­a
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
[10 May 2019 4:08] Si Hao Ng
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
[7 Jul 2019 6:39] Hochan Son
+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7