Bug #93578 group_replication fatal error, mysqld dies
Submitted: 12 Dec 2018 16:15 Modified: 5 Feb 2019 10:10
Reporter: Eric Goldsmith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.13 OS:Windows (Windows Server 2008 R2 Std)
Assigned to: CPU Architecture:x86 (64-bit AMD Opteron 6380)
Tags: dies, exception, Fatal, replication, server

[12 Dec 2018 16:15] Eric Goldsmith
Description:
I have three servers in an InnoDB Cluster / Group Replication. One died and did not recover.

Prior to dying, there was a network failure. The group_replication plugin recognized that the other nodes in the cluster were unreachable and waited for them to reappear. Later, it recognized that the other nodes were reachable again and logged "Regular operation is restored and transactions are unblocked", but 3 seconds later it logged "Member was expelled from the group due to network failures, changing member status to ERROR", then an exception was thrown and the service died.

2018-12-12T00:32:09.432562Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.192.x.y:3307 has become unreachable.'
2018-12-12T00:32:09.448163Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.64.z.w:3307 has become unreachable.'
2018-12-12T00:32:09.448163Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2018-12-12T00:37:37.973903Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.192.x.y:3307 is reachable again.'
2018-12-12T00:37:37.973903Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.64.z.w:3307 is reachable again.'
2018-12-12T00:37:37.973903Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2018-12-12T00:37:40.017529Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2018-12-12T00:37:40.064330Z 0 [ERROR] [MY-013173] [Repl] Plugin group_replication reported: 'The plugin encountered a critical error and will abort: Fatal error during execution of Group Replication'
00:37:40 UTC - mysqld got exception 0x80000003 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=7
max_threads=151
thread_count=18
connection_count=6
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 67684 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
1402b66e2    mysqld.exe!?my_errno@@YAHXZ()
7fefc0adc17    ucrtbase.DLL!raise()
7fefc0aeaa1    ucrtbase.DLL!abort()
7fee894838e    group_replication.dll!???
7fee8904f92    group_replication.dll!???
7fee891f2bc    group_replication.dll!???
7fee891d0c2    group_replication.dll!???
7fee89b3681    group_replication.dll!???
7fee89b301d    group_replication.dll!???
7fee89b6695    group_replication.dll!???
7fee89b4c77    group_replication.dll!???
7fee89b6f0f    group_replication.dll!???
1406a8937    mysqld.exe!??1?$lock_guard@Vmutex@std@@@std@@QEAA@XZ()
1402b667c    mysqld.exe!?my_thread_join@@YAHPEAUmy_thread_handle@@PEAPEAX@Z()
7fefc05cd70    ucrtbase.DLL!_o__realloc_base()
771a59cd    kernel32.dll!BaseThreadInitThunk()
7740385d    ntdll.dll!RtlUserThreadStart()

How to repeat:
Break the network at one of the servers in a cluster.

Note: I was not able to test repeatability since I'm not allowed to break the network. I have, however, seen this failure occur on 8.0.12 in the same way. The log was essentially the same as the one reported above.

2018-11-30T15:19:16.719205Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2018-11-30T15:19:17.013235Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2018-11-30T15:19:17.036237Z 0 [ERROR] [MY-013173] [Repl] Plugin group_replication reported: 'The plugin encountered a critical error and will abort: Fatal error during execution of Group Replication'
15:19:17 UTC - mysqld got exception 0x80000003 ;

Suggested fix:
Handle the exception
[13 Dec 2018 16:53] Eric Goldsmith
Additionally, when an exception like this occurs, returning a non-zero error code can be used to cause the Windows service to be restarted (ref: MySQL service properties 'Recovery' tab). Currently, MySQL server 8.0.13 does not appear to do this, as I can't get Windows to restart the service after it fails.
[11 Jan 2019 16:52] Eric Goldsmith
20 failures in the last 4 days have been observed, and each of the 3 servers in the cluster have exhibited this problem.

Even though the error log states "Member was expelled from the group due to network failures", this does not appear to be so. Persistent connections to non-cluster MySQL services (on the same servers) have not died.
[16 Jan 2019 15:40] Mario Staykov
I have observed the same bug, which I described in https://dba.stackexchange.com/questions/227199/group-replication-plugin-crashes-mysql-8-0/... before I was certain it's a bug.

The context I encountered it in was slightly different - even just attempting to INSTALL PLUGIN caused the crash. The workaround that was found was specifying in /etc/mysql/my.cnf:
    loose-group_replication_exit_state_action = READ_ONLY

Obviously, just triggering this behaviour that's default since 8.0.12 (https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...) shouldn't result in a MySQL crash and should be handled as an exception indicative of why MySQL will stop.
[18 Jan 2019 15:28] Eric Goldsmith
Thanks Mario!
I'll give that a try.
[5 Feb 2019 2:00] MySQL Verification Team
Hi,

Thanks for your report, bug is verified but I dropped the severity to S3 as there's a workaround.

kind regards
Bogdan
[5 Feb 2019 10:10] Nuno Carvalho
Hi Eric,

Lets split this in two parts.

First, in 8.0.13 group_replication_exit_state_action default value is ABORT_SERVER. which means that when a error forces the server to abandon the group, like a network partition, it will abort.
Abort here literally means abort, like your stack shows. On 8.0.14 we improved that behaviour by shutting down the server, please see Bug#91793.
You can upgrade to 8.0.14 or change group_replication_exit_state_action to READ_ONLY.
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

Second, when a server is facing network partitions even if it is able to reconnect, it may be too late, that is, during the period it was disconnected too much data went though the communication layer which makes impossible the disconnected member to get all that traffic. On that cases, the server reconnects, realizes that it cannot be updated and leaves the group. The scenario you have on your logs.
You can increase this period by adjusting
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...
On future releases we will introduce a new approach to tackle this.

Since I gave you two solutions to solve your situation, I'm closing this bug. If you have any doubt please reopen it and make your questions.

Best regards,
Nuno Carvalho