MySQL Bugs: #115045: Async connection failover crash on group

Bug #115045	Async connection failover crash on group_replication member
Submitted:	17 May 2024 6:40	Modified:	22 May 2024 18:06
Reporter:	phoenix Zhang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S6 (Debug Builds)
Version:	8.0.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	debug

Description:
Happen on DEBUG version

For Async connection failover, we have mgr-A as the source cluster, and mgr-B as the replica cluster. When inject kill -19 on primary node of mgr-B, after a while, use kill -18 for that node. Then, the node may crash.

How to repeat:
First, we use the diff:
diff --git a/mysql-test/suite/group_replication/t/gr_single_primary_stop.cnf b/mysql-test/suite/group_replication/t/gr_single_primary_stop.cnf
index fdfa2d0b7ba..455d89be527 100644
--- a/mysql-test/suite/group_replication/t/gr_single_primary_stop.cnf
+++ b/mysql-test/suite/group_replication/t/gr_single_primary_stop.cnf
@@ -1,12 +1,60 @@
 !include ../my.cnf
 
 [mysqld.1]
+group_replication_group_name='aabbccdd-aabb-aabb-aabb-aabbccddeeff'
+group_replication_group_seeds='127.0.0.1:33061,127.0.0.1:33062,127.0.0.1:33063'
+group_replication_local_address='127.0.0.1:33061'
+group_replication_start_on_boot=OFF
+group_replication_enforce_update_everywhere_checks=OFF
+group_replication_single_primary_mode=ON
 
 [mysqld.2]
+group_replication_group_name='aabbccdd-aabb-aabb-aabb-aabbccddeeff'
+group_replication_group_seeds='127.0.0.1:33061,127.0.0.1:33062,127.0.0.1:33063'
+group_replication_local_address='127.0.0.1:33062'
+group_replication_start_on_boot=OFF
+group_replication_enforce_update_everywhere_checks=OFF
+group_replication_single_primary_mode=ON
 
 [mysqld.3]
+group_replication_group_name='aabbccdd-aabb-aabb-aabb-aabbccddeeff'
+group_replication_group_seeds='127.0.0.1:33061,127.0.0.1:33062,127.0.0.1:33063'
+group_replication_local_address='127.0.0.1:33063'
+group_replication_start_on_boot=OFF
+group_replication_enforce_update_everywhere_checks=OFF
+group_replication_single_primary_mode=ON
+
+[mysqld.4]
+group_replication_group_name='aabbccdd-aabb-aabb-aabb-aabbccddee11'
+group_replication_group_seeds='127.0.0.1:33064,127.0.0.1:33065,127.0.0.1:33066'
+group_replication_local_address='127.0.0.1:33064'
+group_replication_start_on_boot=OFF
+group_replication_enforce_update_everywhere_checks=OFF
+group_replication_single_primary_mode=ON
+
+[mysqld.5]
+group_replication_group_name='aabbccdd-aabb-aabb-aabb-aabbccddee11'
+group_replication_group_seeds='127.0.0.1:33064,127.0.0.1:33065,127.0.0.1:33066'
+group_replication_local_address='127.0.0.1:33065'
+group_replication_start_on_boot=OFF
+group_replication_enforce_update_everywhere_checks=OFF
+group_replication_single_primary_mode=ON
+
+[mysqld.6]
+group_replication_group_name='aabbccdd-aabb-aabb-aabb-aabbccddee11'
+group_replication_group_seeds='127.0.0.1:33064,127.0.0.1:33065,127.0.0.1:33066'
+group_replication_local_address='127.0.0.1:33066'
+group_replication_start_on_boot=OFF
+group_replication_enforce_update_everywhere_checks=OFF
+group_replication_single_primary_mode=ON
 
 [ENV]
 SERVER_MYPORT_3=               @mysqld.3.port
 SERVER_MYSOCK_3=               @mysqld.3.socket
+SERVER_MYPORT_4=               @mysqld.4.port
+SERVER_MYSOCK_4=               @mysqld.4.socket
+SERVER_MYPORT_5=               @mysqld.5.port
+SERVER_MYSOCK_5=               @mysqld.5.socket
+SERVER_MYPORT_6=               @mysqld.6.port
+SERVER_MYSOCK_6=               @mysqld.6.socket

Then, use mtr to start 6 servers. 
$ mysql-test/mtr gr_single_primary_stop --start

And, use below script to init Async connection failover
$ cat init.sh 
#!/bin/sh

# init mgr-A
mysql -uroot -P13000 -h127.0.0.1 -e "set global group_replication_bootstrap_group=on;CHANGE MASTER TO MASTER_USER='root'  FOR CHANNEL 'group_replication_recovery';start group_replication;set global group_replication_bootstrap_group=off;"
sleep 5
mysql -uroot -P13002 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root'  FOR CHANNEL 'group_replication_recovery';start group_replication;"
mysql -uroot -P13004 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root'  FOR CHANNEL 'group_replication_recovery';start group_replication;"
sleep 5;
mysql -uroot -P13000 -h127.0.0.1 -e "SELECT * FROM performance_schema.replication_group_members;"

# init mgr-B
mysql -uroot -P13006 -h127.0.0.1 -e "set global group_replication_bootstrap_group=on;CHANGE MASTER TO MASTER_USER='root'  FOR CHANNEL 'group_replication_recovery';start group_replication;set global group_replication_bootstrap_group=off;"
sleep 5
mysql -uroot -P13008 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root'  FOR CHANNEL 'group_replication_recovery';start group_replication;"
mysql -uroot -P13010 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root'  FOR CHANNEL 'group_replication_recovery';start group_replication;"
sleep 5;
mysql -uroot -P13006 -h127.0.0.1 -e "SELECT * FROM performance_schema.replication_group_members;"

# init master-slave channel
mysql -uroot -P13006 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root', MASTER_HOST='127.0.0.1', MASTER_PORT=13000, MASTER_RETRY_COUNT=2, MASTER_AUTO_POSITION=1 FOR CHANNEL 'ch1';"
mysql -uroot -P13008 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root', MASTER_HOST='127.0.0.1', MASTER_PORT=13000, MASTER_RETRY_COUNT=2, MASTER_AUTO_POSITION=1 FOR CHANNEL 'ch1';"
mysql -uroot -P13010 -h127.0.0.1 -e "CHANGE MASTER TO MASTER_USER='root', MASTER_HOST='127.0.0.1', MASTER_PORT=13000, MASTER_RETRY_COUNT=2, MASTER_AUTO_POSITION=1 FOR CHANNEL 'ch1';"
mysql -uroot -P13006 -h127.0.0.1 -e "CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1 FOR CHANNEL 'ch1';"
mysql -uroot -P13006 -h127.0.0.1 -e "SELECT asynchronous_connection_failover_add_managed('ch1', 'GroupReplication', 'aabbccdd-aabb-aabb-aabb-aabbccddeeff', '127.0.0.1', 13000, '', 80, 60);"
mysql -uroot -P13006 -h127.0.0.1 -e "START SLAVE;"

Run the script:
$ bash init.sh 
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 0e3b8e8f-1416-11ef-a24e-d08e7908bddb | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.32         | XCom                       |
| group_replication_applier | 0e41c49e-1416-11ef-8df8-d08e7908bddb | 127.0.0.1   |       13002 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
| group_replication_applier | 0e4a272e-1416-11ef-87c3-d08e7908bddb | 127.0.0.1   |       13004 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 0e534cb1-1416-11ef-bebd-d08e7908bddb | 127.0.0.1   |       13006 | ONLINE       | PRIMARY     | 8.0.32         | XCom                       |
| group_replication_applier | 0e598fe8-1416-11ef-a00a-d08e7908bddb | 127.0.0.1   |       13008 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
| group_replication_applier | 0e6cbd66-1416-11ef-9667-d08e7908bddb | 127.0.0.1   |       13010 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_add_managed('ch1', 'GroupReplication', 'aabbccdd-aabb-aabb-aabb-aabbccddeeff', '127.0.0.1', 13000, '', 80, 60) |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| The UDF asynchronous_connection_failover_add_managed() executed successfully.                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

Now, check 13006, it start start the channel ch1.

For another way, use sysbench on primary-node of mgr-A:
$ sysbench ./share/sysbench/oltp_write_only.lua --mysql-db=test --mysql-host=127.0.0.1 --mysql-port=13000 --mysql-user=root --mysql_storage_engine=innodb --tables=10 --table-size=10000 --report-interval=2 --threads=10 --time=0 --db-driver=mysql --skip_trx=false prepare

After 13006 sync
$ sysbench ./share/sysbench/oltp_write_only.lua --mysql-db=test --mysql-host=127.0.0.1 --mysql-port=13000 --mysql-user=root --mysql_storage_engine=innodb --tables=10 --table-size=10000 --report-interval=2 --threads=10 --time=0 --db-driver=mysql --skip_trx=false run

Litter later, inject on 13006:
$ kill -19 1293527
$ sleep 60
$ kill -18 1293527

After a while, 13006 crash, the stack on errorlog:
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld: debugger aborting because missing DBUG_RETURN or DBUG_VOID_RETURN macro in function "?func"

2024-05-17T06:29:02Z UTC - mysqld got signal 6 ; 
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=050458640a057aedd398289b3238493f6e2e287f
Thread pointer: 0x0 
Attempting backtrace. You can use the following information to find out 
where mysqld died. If you see no messages after this, something went
terribly wrong...
2024-05-17T06:29:02.382207Z 129 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'
stack_bottom = 0 thread_stack 0x100000
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x59) [0x55ac7b0b0a31]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(print_fatal_signal(int)+0x394) [0x55ac79af9c5b]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(my_server_abort()+0x78) [0x55ac79af9f11]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(my_abort()+0x11) [0x55ac7b0a6900]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(+0x4f1ad92) [0x55ac7b084d92]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(_db_return_(unsigned int, _db_stack_frame_*)+0xbd) [0x55ac7b0835c0]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(AutoDebugTrace::~AutoDebugTrace()+0x21) [0x55ac796d37fd]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(leave_group_on_failure::leave(std::bitset<7ul> const&, long long, Notification_context*, char const*)+0xb7e) [0x7f4dc84e9304]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Plugin_gcs_events_handler::was_member_expelled_from_group(Gcs_view const&) const+0x109) [0x7f4dc84b4f5f]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Plugin_gcs_events_handler::on_view_changed(Gcs_view const&, std::vector<std::pair<Gcs_member_identifier*, Gcs_message_data*>, std::allocator<std::pair<Gcs_member_identifier*, Gcs_message_data*> > > const&) const+0x181) [0x7f4dc84b42c3]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Gcs_xcom_control::install_view(Gcs_xcom_view_identifier*, Gcs_group_identifier const&, std::map<Gcs_member_identifier, Xcom_member_state*, std::less<Gcs_member_identifier>, std::allocator<std::pair<Gcs_member_identifier const, Xcom_member_state*> > >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifier*> >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Gcs_xcom_control::install_leave_view(Gcs_view::Gcs_view_error_code)+0x31b) [0x7f4dc86520c3]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Gcs_xcom_control::do_leave_view()+0xc7) [0x7f4dc864f61b]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Gcs_xcom_interface::make_gcs_leave_group_on_error()+0xc9) [0x7f4dc85be5af]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(do_cb_xcom_expel()+0x28) [0x7f4dc85c433a]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Expel_notification::do_execute()+0x1a) [0x7f4dc85d3480]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Parameterized_notification<false>::operator()()+0x27) [0x7f4dc85d55d1]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(Gcs_xcom_engine::process()+0xe7) [0x7f4dc85d3b47]
/home/zwf/gitlab/percona-server/DEBUG/plugin_output_directory/group_replication.so(process_notification_thread(void*)+0x24) [0x7f4dc85d367b]
/home/zwf/gitlab/percona-server/DEBUG/runtime_output_directory/mysqld(+0x5c604c2) [0x55ac7bdca4c2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f4dda2d7609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f4dda1fc353]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
safe_process[1293526]: Child process: 1293527, killed by signal: 6

The reason is, when 13006 awake by kill -18, it find itself expel by mgr-B. So, it try leave mgr-B through leave_group_on_failure::leave.
And since there still channel ch1 on 13006, it will terminate the slave channel, through Replication_thread_api::rpl_channel_stop_all -> channel_stop
In channel，it will do thread init/end operation, which may rewrite the dbug_trace info.

Hi,
Thanks for the test case. I managed to reproduce the problem. Verified.