Bug #99659 Group Replication going to ERROR after Regular operation is restored
Submitted: 21 May 2020 14:49 Modified: 28 May 2020 15:15
Reporter: Claude Steiner Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:mysql80-server-8.0.19_2 OS:FreeBSD ( FreeBSD 12.1-RELEASE r354233 GENERIC amd64)
Assigned to: MySQL Verification Team CPU Architecture:x86
Tags: Group Replication Error

[21 May 2020 14:49] Claude Steiner
Description:
I'm running a MySQL Group Replication with three MySQL Servers on FreeBSD 12.1 (under ESXi 6.5)

I noticed a very strange behaviour that happen every 10-14 days on one instance of the three MySQL Servers.
The server obviously has network troubles to reach the other replication group servers, which is, I guess, not a problem of MySQL Server itself, but probably an issue with FreeBSD 12.1 and ESXi 6.5 (I'm still trying to figure that out and fix it).
So the network seems to be not working correctly for a second, then recovers... 

I get the notice: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'

This is three seconds later followed by the error 'Member was expelled from the group due to network failures, changing member status to ERROR.'

I guess this is an error of MySQL-Server, as it stated that the regular operation is restored, but then goes to ERROR mode anyway.

LOG:
2020-05-21T12:37:23.114565Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.62:3306 has become unreachable.'
2020-05-21T12:37:23.115638Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.63:3306 has become unreachable.'
2020-05-21T12:37:23.115687Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2020-05-21T12:37:24.119704Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.62:3306 is reachable again.'
2020-05-21T12:37:24.120123Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.63:3306 is reachable again.'
2020-05-21T12:37:24.120359Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2020-05-21T12:37:27.351154Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2020-05-21T12:37:27.352920Z 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
(END)

I have many VMs (running under FreeBSD 11.1 and 12.1) on that ESXi with many different services on them (DNS, HAProxy, Apache, etc) and can't notice any other network issue there. Any hint about this underlaying network issue is appreciated.

How to repeat:
Hard to say how this can be repeated, maybe by intentionally brining the network interface down for just a second. (not tested)

Suggested fix:
make sure after recovery it can not trigger the error mode.
[28 May 2020 15:15] MySQL Verification Team
Hi,

This is not a bug. You have network issues. Since 8.0.13 we have more timeout settings you can tweak to solve this type of issues. Start with group_replication_member_expel_timeout:

https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

kind regards
Bogdan