Description:
In a 3 nodes cluster A, B, C there is a node "C" that left the cluster, and when rejoining it fails, and after that attempts to STOP GROUP_REPLICATION or START GROUP_REPLICATION will fail with a message like:
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
A pmp shows there are a few XCom threads dangling on the joiner node "C" (in particular the one with ssl23_connect):
1 read(libpthread.so.0),sock_read(libcrypto.so),BIO_read(libcrypto.so),ssl23_read_bytes(libssl.so),ssl23_connect(libssl.so),Xcom_network_provider::open_connection(xcom_network_provider.cc:312),Network_provider_manager::open_xcom_connection(new_allocator.h:82),dial(xcom_transport.cc:797),sender_task(xcom_transport.cc:797),task_loop(task.cc:1158),xcom_taskmain2(xcom_base.cc:1536),xcom_taskmain_startup(xcom_base.cc:1536),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
1 pthread_join(libpthread.so.0),Gcs_xcom_interface::finalize(gcs_xcom_interface.cc:663),Gcs_operations::finalize(gcs_operations.cc:110),leave_group(plugin.cc:1167),leave_group_and_terminate_plugin_modules(plugin.cc:1096),initialize_plugin_and_join(plugin.cc:844),plugin_group_replication_start(plugin.cc:664),group_replication_start(rpl_group_replication.cc:199),mysql_execute_command(sql_parse.cc:3520),dispatch_sql_command(sql_parse.cc:5427),dispatch_command(sql_parse.cc:2015),do_command(sql_parse.cc:1405),handle_connection(connection_handler_per_thread.cc:308),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
1 pthread_join(libpthread.so.0),Gcs_xcom_control::do_leave(gcs_xcom_control_interface.cc:678),Gcs_xcom_interface::finalize_xcom(gcs_xcom_interface.cc:628),cleanup_xcom(gcs_xcom_interface.cc:605),do_execute(gcs_xcom_notification.cc:51),Parameterized_notification<true>::operator()(gcs_xcom_notification.cc:51),Gcs_xcom_engine::process(gcs_xcom_notification.cc:221),process_notification_thread(gcs_xcom_notification.cc:158),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
1 pthread_cond_timedwait,Gcs_suspicions_manager::process_suspicions(gcs_xcom_control_interface.cc:2086),suspicions_processing_thread(gcs_xcom_control_interface.cc:72),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
1 accept(libpthread.so.0),xcom_tcp_server_startup(xcom_network_provider.cc:65),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)
And we see similar behavior on only one of the other nodes ("B") which is a SECONDARY at the time (and here we see the counterpart of ssl23_connect: ssl23_accept):
1 read(libpthread.so.0),sock_read(libcrypto.so),BIO_read(libcrypto.so),ssl23_read_bytes(libssl.so),ssl23_get_client_hello(libssl.so),ssl23_accept(libssl.so),xcom_tcp_server_startup(xcom_network_provider.cc:114),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)
1 pthread_cond_wait,Gcs_xcom_engine::process(gcs_xcom_notification.cc:212),process_notification_thread(gcs_xcom_notification.cc:158),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
1 pthread_cond_timedwait,Gcs_suspicions_manager::process_suspicions(gcs_xcom_control_interface.cc:2086),suspicions_processing_thread(gcs_xcom_control_interface.cc:72),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
1 poll(libc.so.6),poll(poll2.h:46),poll_wait(poll2.h:46),task_loop(task.cc:1196),xcom_taskmain2(xcom_base.cc:1536),xcom_taskmain_startup(xcom_base.cc:1536),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
Node "A" does not show the above behavior, only "B".
We assume these (xcom_tcp_server_startup and Xcom_network_provider::open_connection) relied on timeouts down the call chain (i.e. in libss/libcrypto in this case), but as seen in the stack traces these are not guaranteed and MySQL should clean up on it's own to guarantee consistent behavior.
How to repeat:
The issue does not repeat every time a node re-joins the cluster, and conditions for it to repeat are not clear, but what is certain and has repeated multiple times in at least 6 different clusters, and the threads remain hanging there forever.
Suggested fix:
Add timeouts to XCom side of the connection stack; I would assume at/or near xcom_tcp_server_startup and Xcom_network_provider::open_connection.
Those two should give up and clear everything after X seconds, so the connection can be retried later).