MySQL Bugs: #84784: Group Replication nodes do not rejoin cluster after network connectivity issues

Bug #84784	Group Replication nodes do not rejoin cluster after network connectivity issues
Submitted:	2 Feb 2017 3:06	Modified:	27 Mar 2019 15:27
Reporter:	Kenny Gryp	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	5.7.17, 8.0.4	OS:	Any
Assigned to:	Filipe Campos	CPU Architecture:	Any

Description:
Nodes do not reconnect back to the group replication once they got disconnected, causing nodes to drop from the cluster up to losing the whole cluster availability

How to repeat:
Have a 3 node cluster:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

[vagrant@gr-3 ~]$ sudo ifconfig eth1 down

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

the node becomes unreachable.

[vagrant@gr-3 ~]$ sudo ifconfig eth1 up

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | UNREACHABLE  |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | UNREACHABLE  |
| group_replication_applier | 74dc6ab2-e1cc-11e6-92aa-08002789cd2e | gr-3        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.01 sec)

gr-3 doesn't join the cluster anymore.

mysql> show global status like '%group_repl%';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| Com_group_replication_start      | 1                                    |
| Com_group_replication_stop       | 0                                    |
| group_replication_primary_member | 72149827-e1cc-11e6-9daf-08002789cd2e |
+----------------------------------+--------------------------------------+
3 rows in set (0.00 sec)

(FYI: The primary member is still in the status variable. but can be outdated!)

gr-1 and gr-2 have this:

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 72149827-e1cc-11e6-9daf-08002789cd2e | gr-1        |        3306 | ONLINE       |
| group_replication_applier | 740e1fd2-e1cc-11e6-a8ec-08002789cd2e | gr-2        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)

so gr-3 is now broken. it won't join the cluster again automatically

Suggested fix:

When there is a small network glitch, a node or multiple nodes might lose connection with the cluster.

Nodes do not rejoin the group replication automatically.

You can easily break a cluster by just bringing the interfaces down of 2 of the 3 nodes, wait a few seconds until group replication figures it out and the master node will get stuck on new writes.

Hi Kenny,

Thank you for evaluating Group Replication! Your (and all community)
feedback is important!

When a member of the group is unreachable by a majority of members during
some time, it is expelled from the group.
Then, when this expelled member has its network connection restored, it tries
to rejoin the group and fails, hence it changes to the ERROR state and writes
are forbidden by super_read_only=1.
If the network connection is restored before the member being expelled,
it will be able to resume operation.

Having said that, we understand your concern and are looking into improving
this situation.

If a group member is unreachable, after how much time its going to be expelled from the group and is this time configurable? 

Secondly, If a group member receives invalid transaction and continuously fails to apply it, after how many attempts and time interval its going to go in error state and  are these settings (interval + attempts) configurable?

Hello Sheraz Ahmed

First of all, thank you for evaluating Group Replication! Your (and all
community) feedback is important!

Regarding your questions:

Q: If a group member is unreachable, after how much time its going to be
expelled from the group and is this time configurable?

A: After 5 seconds, other nodes will suspect the unreachable group member
has failed and it will be expelled from the group by one of them. Currently,
this time period is not configurable.

Q: Secondly, If a group member receives invalid transaction and continuously
fails to apply it, after how many attempts and time interval its going to
go in error state and are these settings (interval + attempts) configurable?

A: If it is a permanent applier error, like a duplicate primary key or table
missing, it will move to ERROR state after first error, there is no retry. If
it is a temporary error, like a InnoDB lock timeout, it will retry X times,
depending on the value of slave_transaction_retries.
https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transa...

Hope these answers are enough to clear up your doubts.

Best regards,

Filipe

Filipe Campos,

I appreciate your quick and thorough response. At the same time I apologize for not being able to thank you earlier. 

Thank you :)

Hi,

After some testing I noticed as well the permanent error state. In our opinion this increases the possibility of total failure significantly. If not fixed manually soon enough and another node has the same kind of issue, the whole cluster is down.

Is there any fix planned for this? For example for MySQL 5.7.19?

Thanks a lot for looking further into this!

we are seeing the same in 5.7.19...is this fixed?

Situation in 5.7.20 seems to be unchanged!

is it fix on 5.7.21?

We recently had to restart 2 nodes manually and let everything repair by the group replication. We had 20 minutes outage while this process was happening (we discovered too late).

We improved our monitoring now to catch it, but why can a simple network error cause a permanent error state so easily? This bypasses the whole point of this redundant setup in our opinion.

I really think this bug/feature should be fixed.
Several people commented on this bug already, this is not an isolated case.

I really reduces the practical use of group replication.

Additionally, I would like to add some more information that I figured out during my tests:
when there are only 2 nodes remaining in the cluster, the nodes DO RECONNECT. This seems to be a special case and avoids clusters going down completely as far as I can see. On purpose?

In any case, I would like to see  all nodes reconnect automatically AND block writes instead of stall them (with group_replication_unreachable_majority_timeout > 0)

Also reproducible in version 8.0.4

Hi all,

Thank you for your scenarios, we are looking into improve this.

Best regards,
Nuno Carvalho

Hi, I was just wondering if a scheduled restart of mysql cluster instances would be good idea to get around this issue. They will not be simultaneous but sufficiently separated in time so that the restarted node can recover before other node is restarted.

Apart from this, is there any official workaround in place for this issue now?

Could we get an update on this bug?  Could we at least make the 5 second timeout at least configurable to a long time?

Can we recommend a means to reconnect the lost member or a way to manually rejoin that lost member?

Fixed in any of the Latest versions ?

Is this issue is fixed in current GA MY-8.0.12?

Thank you for the feedback, while we do not have automatic rejoin after network issues please check how to extend the network timeout at

https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

https://mysqlhighavailability.com/group-replication-coping-with-unreliable-failure-detecti...

Best regards,
Nuno Carvalho

That is a great feature! However there are larger changes in MySQL 8.0 which prevent us from just upgrading our existing 5.7 cluster. 

Also such larger upgrade just for this features sounds a bit much.

Any chance  group_replication_member_expel_timeout  will be backported to MySQL 5.7? Seems like a smaller patch to do this?

Posted by developer:
 
Fixed by WL #11284 Group Replication: auto-rejoin member to group after an expel in MySQL 8.0.16. Changelog entry:

For Group Replication, the new system variable group_replication_autorejoin_tries lets you specify the number of tries that a member makes to automatically rejoin the group if it is expelled, or if it is unable to contact a majority of the group before the group_replication_unreachable_majority_timeout setting is reached. The default setting, 0, means that the member does not try to rejoin, and proceeds to the action specified by the group_replication_exit_state_action system variable. 

Activate auto-rejoin if you can tolerate the possibility of stale reads and want to minimize the need for manual intervention, especially where transient network issues fairly often result in the expulsion of members.  If you specify a number of tries, when the member's expulsion or unreachable majority timeout is reached, it makes an attempt to rejoin (using the same settings as it used previously), then continues to make further auto-rejoin attempts up to the specified number of tries. After an unsuccessful auto-rejoin attempt, the member waits 5 minutes before the next try. During the auto-rejoin procedure, the member remains in super read only mode and displays an ERROR state to the replication group. The member can be stopped manually at any time by using a STOP GROUP_REPLICATION statement or shutting down the server. If the specified number of tries is exhausted without the member rejoining or being stopped, the member proceeds to the action specified by the group_replication_exit_state_action system variable, which can be either remaining in super read only mode or shutting down.

That sounds fantastic, any chance about backport of any of these improvments for 5.7.x ? :)

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7
I don't think this bug should be considered closed

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7

+1 for back-porting auto-rejoin after expulsion and auto-rejoin retry to 5.7