Bug #93578 group_replication fatal error, mysqld dies
Submitted: 12 Dec 2018 16:15 Modified: 18 Jan 15:28
Reporter: Eric Goldsmith Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S1 (Critical)
Version:8.0.13 OS:Microsoft Windows (Windows Server 2008 R2 Std)
Assigned to: CPU Architecture:x86 (64-bit AMD Opteron 6380)
Tags: dies, exception, Fatal, replication, server

[12 Dec 2018 16:15] Eric Goldsmith
Description:
I have three servers in an InnoDB Cluster / Group Replication. One died and did not recover.

Prior to dying, there was a network failure. The group_replication plugin recognized that the other nodes in the cluster were unreachable and waited for them to reappear. Later, it recognized that the other nodes were reachable again and logged "Regular operation is restored and transactions are unblocked", but 3 seconds later it logged "Member was expelled from the group due to network failures, changing member status to ERROR", then an exception was thrown and the service died.

2018-12-12T00:32:09.432562Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.192.x.y:3307 has become unreachable.'
2018-12-12T00:32:09.448163Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.64.z.w:3307 has become unreachable.'
2018-12-12T00:32:09.448163Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2018-12-12T00:37:37.973903Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.192.x.y:3307 is reachable again.'
2018-12-12T00:37:37.973903Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.64.z.w:3307 is reachable again.'
2018-12-12T00:37:37.973903Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2018-12-12T00:37:40.017529Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2018-12-12T00:37:40.064330Z 0 [ERROR] [MY-013173] [Repl] Plugin group_replication reported: 'The plugin encountered a critical error and will abort: Fatal error during execution of Group Replication'
00:37:40 UTC - mysqld got exception 0x80000003 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=7
max_threads=151
thread_count=18
connection_count=6
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 67684 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
1402b66e2    mysqld.exe!?my_errno@@YAHXZ()
7fefc0adc17    ucrtbase.DLL!raise()
7fefc0aeaa1    ucrtbase.DLL!abort()
7fee894838e    group_replication.dll!???
7fee8904f92    group_replication.dll!???
7fee891f2bc    group_replication.dll!???
7fee891d0c2    group_replication.dll!???
7fee89b3681    group_replication.dll!???
7fee89b301d    group_replication.dll!???
7fee89b6695    group_replication.dll!???
7fee89b4c77    group_replication.dll!???
7fee89b6f0f    group_replication.dll!???
1406a8937    mysqld.exe!??1?$lock_guard@Vmutex@std@@@std@@QEAA@XZ()
1402b667c    mysqld.exe!?my_thread_join@@YAHPEAUmy_thread_handle@@PEAPEAX@Z()
7fefc05cd70    ucrtbase.DLL!_o__realloc_base()
771a59cd    kernel32.dll!BaseThreadInitThunk()
7740385d    ntdll.dll!RtlUserThreadStart()

How to repeat:
Break the network at one of the servers in a cluster.

Note: I was not able to test repeatability since I'm not allowed to break the network. I have, however, seen this failure occur on 8.0.12 in the same way. The log was essentially the same as the one reported above.

2018-11-30T15:19:16.719205Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2018-11-30T15:19:17.013235Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2018-11-30T15:19:17.036237Z 0 [ERROR] [MY-013173] [Repl] Plugin group_replication reported: 'The plugin encountered a critical error and will abort: Fatal error during execution of Group Replication'
15:19:17 UTC - mysqld got exception 0x80000003 ;

Suggested fix:
Handle the exception
[13 Dec 2018 16:53] Eric Goldsmith
Additionally, when an exception like this occurs, returning a non-zero error code can be used to cause the Windows service to be restarted (ref: MySQL service properties 'Recovery' tab). Currently, MySQL server 8.0.13 does not appear to do this, as I can't get Windows to restart the service after it fails.
[11 Jan 16:52] Eric Goldsmith
20 failures in the last 4 days have been observed, and each of the 3 servers in the cluster have exhibited this problem.

Even though the error log states "Member was expelled from the group due to network failures", this does not appear to be so. Persistent connections to non-cluster MySQL services (on the same servers) have not died.
[16 Jan 15:40] Mario Staykov
I have observed the same bug, which I described in https://dba.stackexchange.com/questions/227199/group-replication-plugin-crashes-mysql-8-0/... before I was certain it's a bug.

The context I encountered it in was slightly different - even just attempting to INSTALL PLUGIN caused the crash. The workaround that was found was specifying in /etc/mysql/my.cnf:
    loose-group_replication_exit_state_action = READ_ONLY

Obviously, just triggering this behaviour that's default since 8.0.12 (https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...) shouldn't result in a MySQL crash and should be handled as an exception indicative of why MySQL will stop.
[18 Jan 15:28] Eric Goldsmith
Thanks Mario!
I'll give that a try.