Description:
I'm running a MySQL Group Replication with three MySQL Servers on FreeBSD 12.1 (under ESXi 6.5)
I noticed a very strange behaviour that happen every 10-14 days on one instance of the three MySQL Servers.
The server obviously has network troubles to reach the other replication group servers, which is, I guess, not a problem of MySQL Server itself, but probably an issue with FreeBSD 12.1 and ESXi 6.5 (I'm still trying to figure that out and fix it).
So the network seems to be not working correctly for a second, then recovers...
I get the notice: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
This is three seconds later followed by the error 'Member was expelled from the group due to network failures, changing member status to ERROR.'
I guess this is an error of MySQL-Server, as it stated that the regular operation is restored, but then goes to ERROR mode anyway.
LOG:
2020-05-21T12:37:23.114565Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.62:3306 has become unreachable.'
2020-05-21T12:37:23.115638Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.63:3306 has become unreachable.'
2020-05-21T12:37:23.115687Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2020-05-21T12:37:24.119704Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.62:3306 is reachable again.'
2020-05-21T12:37:24.120123Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.0.1.63:3306 is reachable again.'
2020-05-21T12:37:24.120359Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2020-05-21T12:37:27.351154Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2020-05-21T12:37:27.352920Z 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
(END)
I have many VMs (running under FreeBSD 11.1 and 12.1) on that ESXi with many different services on them (DNS, HAProxy, Apache, etc) and can't notice any other network issue there. Any hint about this underlaying network issue is appreciated.
How to repeat:
Hard to say how this can be repeated, maybe by intentionally brining the network interface down for just a second. (not tested)
Suggested fix:
make sure after recovery it can not trigger the error mode.