Bug #112967 Crash when a member is immediately reconnected after disconnection
Submitted: 6 Nov 2023 1:07 Modified: 6 Dec 2023 16:19
Reporter: mr xiao Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.28 OS:Debian (Debian GNU/Linux 10 (buster))
Assigned to: MySQL Verification Team CPU Architecture:x86
Tags: crash

[6 Nov 2023 1:07] mr xiao
Description:
Hi guys,

I have three servers in an InnoDB Cluster / Group Replication which deployed in Kubernetes. When one of them is unreachable due to network reasons and recovers immediately, the other two crash.

The reason for our investigation of the network problem is that the node that deployed the service at that point in time had a loss of its related network cards

As far as I can tell, this is now the second time this happened.

Below are the crash logs and some related configurations

```
2023-10-31T17:27:35.726203-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address mgr-1.mgr.mysql.svc:3306 has become unreachable.'
2023-10-31T17:27:36.522998-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Updating physical connections to other servers'
2023-10-31T17:27:36.523034-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 0 host mgr-2.mgr.mysql.svc:33061'
2023-10-31T17:27:36.523043-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 1 host mgr-1.mgr.mysql.svc:33061'
2023-10-31T17:27:36.523049-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 2 host mgr-0.mgr.mysql.svc:33061'
2023-10-31T17:27:36.523057-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sucessfully installed new site definition. Start synode for this configuration is {4317e324 1801217 0}, boot key synode is {4317e324 1801206 0}, configured event horizon=10, my node identifier is 2'
2023-10-31T17:27:36.531824-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Updating physical connections to other servers'
2023-10-31T17:27:36.531855-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 0 host mgr-2.mgr.mysql.svc:33061'
2023-10-31T17:27:36.531863-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 1 host mgr-0.mgr.mysql.svc:33061'
2023-10-31T17:27:36.531871-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sucessfully installed new site definition. Start synode for this configuration is {4317e324 1801218 0}, boot key synode is {4317e324 1801207 0}, configured event horizon=10, my node identifier is 1'
2023-10-31T17:27:36.539301-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Updating physical connections to other servers'
2023-10-31T17:27:36.539339-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 0 host mgr-2.mgr.mysql.svc:33061'
2023-10-31T17:27:36.539351-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Using existing server node 1 host mgr-0.mgr.mysql.svc:33061'
2023-10-31T17:27:36.539359-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sucessfully installed new site definition. Start synode for this configuration is {4317e324 1801219 0}, boot key synode is {4317e324 1801208 0}, configured event horizon=10, my node identifier is 1'
2023-10-31T17:27:37.378395-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Group is able to support up to communication protocol version 8.0.27'
2023-10-31T17:27:37.378456-00:00 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: mgr-1.mgr.mysql.svc:3306'
2023-10-31T17:27:37.378509-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to mgr-2.mgr.mysql.svc:3306, mgr-0.mgr.mysql.svc:3306 on view 16986378538917379:4.'
2023-10-31T17:27:37.582629-00:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node mgr-1.mgr.mysql.svc:33061. Please check the connection status to this member'
2023-10-31T17:27:37.582758-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Failure reading from fd=-1 n=18446744073709551615 from mgr-1.mgr.mysql.svc:33061'
2023-10-31T17:27:38.208665-00:00 227915 [Note] [MY-010914] [Server] Got an error reading communication packets
2023-10-31T17:27:38.306312-00:00 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Not able to decrement number of packets in transit. Non-existing node from incoming packet.'
17:27:38 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x55e64018f57d]
/usr/sbin/mysqld(print_fatal_signal(int)+0x303) [0x55e63f20c453]
/usr/sbin/mysqld(handle_fatal_signal+0x65) [0x55e63f20c4c5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f56840c3730]
/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_communication_protocol_changer::decrement_nr_packets_in_transit(Gcs_packet const&, Gcs_xcom_nodes const&)+0x84) [0x7f566d07bba4]
/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_communication::process_user_data_packet(Gcs_packet&&, std::unique_ptr<Gcs_xcom_nodes, std::default_delete<Gcs_xcom_nodes> >&&)+0x25) [0x7f566d043085]
/usr/lib/mysql/plugin/group_replication.so(do_cb_xcom_receive_data(synode_no, synode_no, Gcs_xcom_nodes*, synode_no, unsigned int, char*)+0x9d7) [0x7f566cff9307]
/usr/lib/mysql/plugin/group_replication.so(Data_notification::do_execute()+0x34) [0x7f566cffb3e4]
/usr/lib/mysql/plugin/group_replication.so(Parameterized_notification<false>::operator()()+0xa) [0x7f566cffb50a]
/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_engine::process()+0xa6) [0x7f566cffba76]
/usr/lib/mysql/plugin/group_replication.so(process_notification_thread(void*)+0x9) [0x7f566cffbcc9]
/usr/sbin/mysqld(+0x2688f34) [0x55e6406f2f34]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f56840b8fa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f5683895eff]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
```
Maybe relevant, bug conservatively boring config:
```
[mysqld]
# Group Replication
group_replication_start_on_boot: "off"  
group_replication_bootstrap_group: "off"
group_replication_exit_state_action: "READ_ONLY"
group_replication_flow_control_mode: "DISABLED"
group_replication_single_primary_mode: "ON"
group_replication_consistency: "BEFORE_ON_PRIMARY_FAILOVER"
group_replication_transaction_size_limit: "150000000"
group_replication_autorejoin_tries: 2000
group_replication_member_expel_timeout: 0
group_replication_unreachable_majority_timeout: 0
group_replication_clone_threshold: 9223372036854775807
group_replication_communication_max_message_size: 10485760
group_replication_enforce_update_everywhere_checks: 0 
group_replication_message_cache_size: 536870912
group_replication_gtid_assignment_block_size: 1000000
group_replication_recovery_get_public_key: 1
group_replication_recovery_reconnect_interval: 1
group_replication_recovery_retry_count: 2000
group_replication_poll_spin_loops: 2000
group_replication_recovery_complete_at: TRANSACTIONS_CERTIFIED
group_replication_member_weight: 80
group_replication_paxos_single_leader: 1
```

How to repeat:
I haven't reproduced it yet, but this has happened twice in production environment.

This scenario may be repeated by quickly replacing the network adapter or performing some network failure when the cluster group replication is normal.

Suggested fix:
Handle the exception
[6 Nov 2023 16:19] MySQL Verification Team
Hi,

Have you had any incidents with 8.0.34? I have tried multiple times with 8.0.34 removing the NIC from the server (VM, so easy to enable/disable nic) and I cannot reproduce the problem.
[7 Dec 2023 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".