Bug #109668 XCom connection stalled forever in read() syscall over network
Submitted: 17 Jan 21:56 Modified: 24 Feb 10:44
Reporter: Marcos Albe (OCA) Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.27 OS:CentOS
Assigned to: MySQL Verification Team CPU Architecture:x86

[17 Jan 21:56] Marcos Albe
Description:
In a 3 nodes cluster A, B, C  there is a node "C" that left the cluster, and when rejoining it fails, and after that attempts to STOP GROUP_REPLICATION or START GROUP_REPLICATION will fail with a message like:

[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

A pmp shows there are a few XCom threads dangling on the joiner node "C" (in particular the one with ssl23_connect):

      1 read(libpthread.so.0),sock_read(libcrypto.so),BIO_read(libcrypto.so),ssl23_read_bytes(libssl.so),ssl23_connect(libssl.so),Xcom_network_provider::open_connection(xcom_network_provider.cc:312),Network_provider_manager::open_xcom_connection(new_allocator.h:82),dial(xcom_transport.cc:797),sender_task(xcom_transport.cc:797),task_loop(task.cc:1158),xcom_taskmain2(xcom_base.cc:1536),xcom_taskmain_startup(xcom_base.cc:1536),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_join(libpthread.so.0),Gcs_xcom_interface::finalize(gcs_xcom_interface.cc:663),Gcs_operations::finalize(gcs_operations.cc:110),leave_group(plugin.cc:1167),leave_group_and_terminate_plugin_modules(plugin.cc:1096),initialize_plugin_and_join(plugin.cc:844),plugin_group_replication_start(plugin.cc:664),group_replication_start(rpl_group_replication.cc:199),mysql_execute_command(sql_parse.cc:3520),dispatch_sql_command(sql_parse.cc:5427),dispatch_command(sql_parse.cc:2015),do_command(sql_parse.cc:1405),handle_connection(connection_handler_per_thread.cc:308),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_join(libpthread.so.0),Gcs_xcom_control::do_leave(gcs_xcom_control_interface.cc:678),Gcs_xcom_interface::finalize_xcom(gcs_xcom_interface.cc:628),cleanup_xcom(gcs_xcom_interface.cc:605),do_execute(gcs_xcom_notification.cc:51),Parameterized_notification<true>::operator()(gcs_xcom_notification.cc:51),Gcs_xcom_engine::process(gcs_xcom_notification.cc:221),process_notification_thread(gcs_xcom_notification.cc:158),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,Gcs_suspicions_manager::process_suspicions(gcs_xcom_control_interface.cc:2086),suspicions_processing_thread(gcs_xcom_control_interface.cc:72),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 accept(libpthread.so.0),xcom_tcp_server_startup(xcom_network_provider.cc:65),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

And we see similar behavior on only one of the other nodes ("B") which is a SECONDARY at the time (and here we see the counterpart of ssl23_connect: ssl23_accept):

      1 read(libpthread.so.0),sock_read(libcrypto.so),BIO_read(libcrypto.so),ssl23_read_bytes(libssl.so),ssl23_get_client_hello(libssl.so),ssl23_accept(libssl.so),xcom_tcp_server_startup(xcom_network_provider.cc:114),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,Gcs_xcom_engine::process(gcs_xcom_notification.cc:212),process_notification_thread(gcs_xcom_notification.cc:158),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,Gcs_suspicions_manager::process_suspicions(gcs_xcom_control_interface.cc:2086),suspicions_processing_thread(gcs_xcom_control_interface.cc:72),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 poll(libc.so.6),poll(poll2.h:46),poll_wait(poll2.h:46),task_loop(task.cc:1196),xcom_taskmain2(xcom_base.cc:1536),xcom_taskmain_startup(xcom_base.cc:1536),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)

Node "A" does not show the above behavior, only "B".

We assume these (xcom_tcp_server_startup and  Xcom_network_provider::open_connection) relied on timeouts down the call chain (i.e. in libss/libcrypto in this case), but as seen in the stack traces these are not guaranteed and MySQL should clean up on it's own to guarantee consistent behavior.

How to repeat:

The issue does not repeat every time a node re-joins the cluster, and conditions for it to repeat are not clear, but what is certain and has repeated multiple times in at least 6 different clusters, and the threads remain hanging there forever.

Suggested fix:
Add timeouts to XCom side of the connection stack; I would assume at/or near xcom_tcp_server_startup and  Xcom_network_provider::open_connection.

Those two  should give up and clear everything after X seconds, so the connection can be retried later).
[17 Jan 22:47] Marcos Albe
PMP from joiner (node "C")

Attachment: joiner-pmp.out (application/octet-stream, text), 20.89 KiB.

[18 Jan 2:28] Marcos Albe
PMP from secondary (node "B")

Attachment: secondary-pmp.out (application/octet-stream, text), 20.15 KiB.

[23 Jan 14:03] MySQL Verification Team
Hi,

Can you explain the "there is a node "C" that left the cluster" in more details?

What is "left the cluster" here? Network issue? Crash? Shutdown of the mysqld process? Kill -9 of mysqld process? poweroff of the server?

Any logs you can share when this happens?
[25 Feb 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".