MySQL Bugs: #109668: XCom connection stalled forever in read() syscall over network

Bug #109668	XCom connection stalled forever in read() syscall over network
Submitted:	17 Jan 2023 21:56	Modified:	12 Jun 2023 17:38
Reporter:	Marcos Albe (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.27, 8.0.33	OS:	CentOS
Assigned to:		CPU Architecture:	x86
Tags:	Contribution

Description:
In a 3 nodes cluster A, B, C  there is a node "C" that left the cluster, and when rejoining it fails, and after that attempts to STOP GROUP_REPLICATION or START GROUP_REPLICATION will fail with a message like:

[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

A pmp shows there are a few XCom threads dangling on the joiner node "C" (in particular the one with ssl23_connect):

      1 read(libpthread.so.0),sock_read(libcrypto.so),BIO_read(libcrypto.so),ssl23_read_bytes(libssl.so),ssl23_connect(libssl.so),Xcom_network_provider::open_connection(xcom_network_provider.cc:312),Network_provider_manager::open_xcom_connection(new_allocator.h:82),dial(xcom_transport.cc:797),sender_task(xcom_transport.cc:797),task_loop(task.cc:1158),xcom_taskmain2(xcom_base.cc:1536),xcom_taskmain_startup(xcom_base.cc:1536),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_join(libpthread.so.0),Gcs_xcom_interface::finalize(gcs_xcom_interface.cc:663),Gcs_operations::finalize(gcs_operations.cc:110),leave_group(plugin.cc:1167),leave_group_and_terminate_plugin_modules(plugin.cc:1096),initialize_plugin_and_join(plugin.cc:844),plugin_group_replication_start(plugin.cc:664),group_replication_start(rpl_group_replication.cc:199),mysql_execute_command(sql_parse.cc:3520),dispatch_sql_command(sql_parse.cc:5427),dispatch_command(sql_parse.cc:2015),do_command(sql_parse.cc:1405),handle_connection(connection_handler_per_thread.cc:308),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_join(libpthread.so.0),Gcs_xcom_control::do_leave(gcs_xcom_control_interface.cc:678),Gcs_xcom_interface::finalize_xcom(gcs_xcom_interface.cc:628),cleanup_xcom(gcs_xcom_interface.cc:605),do_execute(gcs_xcom_notification.cc:51),Parameterized_notification<true>::operator()(gcs_xcom_notification.cc:51),Gcs_xcom_engine::process(gcs_xcom_notification.cc:221),process_notification_thread(gcs_xcom_notification.cc:158),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,Gcs_suspicions_manager::process_suspicions(gcs_xcom_control_interface.cc:2086),suspicions_processing_thread(gcs_xcom_control_interface.cc:72),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 accept(libpthread.so.0),xcom_tcp_server_startup(xcom_network_provider.cc:65),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

And we see similar behavior on only one of the other nodes ("B") which is a SECONDARY at the time (and here we see the counterpart of ssl23_connect: ssl23_accept):

      1 read(libpthread.so.0),sock_read(libcrypto.so),BIO_read(libcrypto.so),ssl23_read_bytes(libssl.so),ssl23_get_client_hello(libssl.so),ssl23_accept(libssl.so),xcom_tcp_server_startup(xcom_network_provider.cc:114),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,Gcs_xcom_engine::process(gcs_xcom_notification.cc:212),process_notification_thread(gcs_xcom_notification.cc:158),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,Gcs_suspicions_manager::process_suspicions(gcs_xcom_control_interface.cc:2086),suspicions_processing_thread(gcs_xcom_control_interface.cc:72),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)
      1 poll(libc.so.6),poll(poll2.h:46),poll_wait(poll2.h:46),task_loop(task.cc:1196),xcom_taskmain2(xcom_base.cc:1536),xcom_taskmain_startup(xcom_base.cc:1536),pfs_spawn_thread(pfs.cc:2947),start_thread(libpthread.so.0),clone(libc.so.6)

Node "A" does not show the above behavior, only "B".

We assume these (xcom_tcp_server_startup and  Xcom_network_provider::open_connection) relied on timeouts down the call chain (i.e. in libss/libcrypto in this case), but as seen in the stack traces these are not guaranteed and MySQL should clean up on it's own to guarantee consistent behavior.

How to repeat:

The issue does not repeat every time a node re-joins the cluster, and conditions for it to repeat are not clear, but what is certain and has repeated multiple times in at least 6 different clusters, and the threads remain hanging there forever.

Suggested fix:
Add timeouts to XCom side of the connection stack; I would assume at/or near xcom_tcp_server_startup and  Xcom_network_provider::open_connection.

Those two  should give up and clear everything after X seconds, so the connection can be retried later).

PMP from joiner (node "C")

Attachment: joiner-pmp.out (application/octet-stream, text), 20.89 KiB.

PMP from secondary (node "B")

Attachment: secondary-pmp.out (application/octet-stream, text), 20.15 KiB.

Hi,

Can you explain the "there is a node "C" that left the cluster" in more details?

What is "left the cluster" here? Network issue? Crash? Shutdown of the mysqld process? Kill -9 of mysqld process? poweroff of the server?

Any logs you can share when this happens?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Hi Team,

Could you please confirm if this bug #32873315 is related to this issue?

Hello! After some time the problem repeated itself on an environment where it was possible to take some time and collect data; We couldn't get GCS debug trace from the PRIMARY unfortunately, but we now seem to have much better understanding of the issue:

On the JOINER:
==============
A node (node1, long_node_name00007, a.k.a. node_name1) left due to random error (error 13 in this case):

2023-05-25T23:59:01.504869-07:00 3058257 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 1 failed executing transaction '899e3c00-9110-11ed-9833-005056995ec5:28014'; Error 'File '/logs/mysql-slow.log' not found (OS errno 13 - Permission denied)' on query. Default database: ''. Query: 'flush slow logs', Error_code: MY-0000292023-05-25T23:59:01.505411-07:00 3058256 [Warning] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756
...
2023-05-26T00:15:21.032680-07:00 9887444 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2023-05-26T00:15:21.032761-07:00 9887444 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
 

As seen above, after the node leaves the cluster it's unable to join back. In the GCS debug output of this node we see it correctly inits SSL:

[MYSQL_GCS_DEBUG] [GCS] xcom_id 11ea4c93 Started executing during regular phase: 0x7f988d5b82d0
...
[MYSQL_GCS_DEBUG] [GCS] TLS configuration is version: 'TLSv1,TLSv1.1,TLSv1.2', ciphersuites: 'NULL'
[MYSQL_GCS_DEBUG] [GCS] Success initializing SSL
 

Then connects to itself: 

[MYSQL_GCS_DEBUG] [GCS] client connected to node_name1.somehost.somedoamain.org 33061 fd 66
[MYSQL_GCS_DEBUG] [GCS] Trying to connect using SSL.
[MYSQL_GCS_DEBUG] [GCS] Accepting socket funerr=0 shutdown_tcp_server=0
[MYSQL_GCS_DEBUG] [GCS] Verifying server certificate and expected host name: node_name1.somehost.somedoamain.org
[MYSQL_GCS_DEBUG] [GCS] Success connecting using SSL.
[MYSQL_GCS_DEBUG] [GCS] ::join():: I am NOT the boot node.
 

Then tries (and fails) to connect to the PRIMARY, which triggers Gcs_xcom_engine::finalize

[MYSQL_GCS_DEBUG] [GCS] Client local port 33061 xcom_client_open_connection to node_name3.somehost.somedoamain.org:33061
[MYSQL_GCS_DEBUG] [GCS] connecting to node_name3.somehost.somedoamain.org 33061
[MYSQL_GCS_DEBUG] [GCS] client connected to node_name3.somehost.somedoamain.org 33061 fd 66
[MYSQL_GCS_DEBUG] [GCS] Trying to connect using SSL.
[MYSQL_GCS_DEBUG] [GCS] The member is leaving the group.
[MYSQL_GCS_DEBUG] [GCS] Gcs_xcom_engine::finalize invoked!

In the PRIMARY:
================
On the node acting as PRIMARY (node3, a.k.a. long_node_name0000a, a.k.a. node_name3) we see this thread:

 

Thread 47 (Thread 0x7f775b3ff700 (LWP 12514)):
#0  0x00007f78b8d3f75d in read () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f78b82bf86b in sock_read () from /apps/mysql/lib/libcrypto.so
No symbol table info available.
#2  0x00007f78b82bd8ab in BIO_read () from /apps/mysql/lib/libcrypto.so
No symbol table info available.
#3  0x00007f78b86688ab in ssl23_read_bytes () from /apps/mysql/lib/libssl.so
No symbol table info available.
#4  0x00007f78b8666f2a in ssl23_get_client_hello () from /apps/mysql/lib/libssl.so
No symbol table info available.
#5  0x00007f78b8667598 in ssl23_accept () from /apps/mysql/lib/libssl.so
No symbol table info available.
#6  0x00007f78a127c8ae in xcom_tcp_server_startup(Xcom_network_provider*) () at ./8.0.27-18/plugin/group_replication/libmysqlgcs/src/bindings/xcom/xcom/network/xcom_network_provider.cc:114
#7  0x0000000002811aa0 in execute_native_thread_routine ()
No symbol table info available.
#8  0x00007f78b8d38ea5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#9  0x00007f78b70a0b0d in clone () from /lib64/libc.so.6
No symbol table info available.
 

And in strace we see that 12514 is waiting on a TCP read:

12514 1685478757.865163 read(170<TCPv6:[[::ffff:10.146.18.162]:33061->[::ffff:10.13.40.111]:57876]>,  <unfinished ...>
 

This address, 10.13.40.111 does NOT belong to a GR node; Rather it's an application node, which makes us suspect there's a port scanner.

Another hint that there's a port scanner is that we see a connection to 33601 on the PRIMARY, which has bytes in the queue and never gets processed. This connection is also from an application server:

 

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
...
tcp6     612      0 10.146.18.162:33061     10.13.40.89:44894       ESTABLISHED -  
 

As a side note, colleague Nickolay Ihalainen commented on a similar case:

MySQL sets timeout with SSL_SESSION_set_timeout(SSL_get_session(ssl), timeout); and calling SSL_accept.
set_timeout function is claimed to be not effective for SSL_accept. Could be related to: https://access.redhat.com/security/cve/cve-2016-8610 

In any case: the problem appears to be related to unexpected garbage getting to port 33061 from a port probe/scanner, which of course should be avoided and we indeed recommended setting a strict allowlist to filter undesired traffic, but still looks like MySQL could do a better job at closing dangling connections like this one.

Hope this helps!

We have the following data which needs to be sanitized, let us know what is useful and we'll clean it up for sharing:

|____node1
| |____GCS_DEBUG_TRACE.1 
| |____GCS_DEBUG_TRACE.2 
| |____gdb.bt.full.1 
| |____gdb.bt.full.2
| |____ip_addr.out
| |____lsof.out
| |____mysql-variables.out
| |____mysql.log
| |____mysql_error.1.log
| |____mysql_p_s.threads.out
| |____netstat_punta.1.out
| |____netstat_punta.2.out
| |____pem_file_md5.hashes
| |____processlist.2
| |____rpm_qa.out
| |____ss_taeioZ.out
| |____strace.mysqld.out

|____node2
| |____GCS_DEBUG_TRACE.1
| |____GCS_DEBUG_TRACE.2
| |____gdb.bt.full.1
| |____gdb.bt.full.2
| |____ip_addr.out
| |____lsof.out
| |____mysql-variables.1
| |____mysql.log
| |____mysql_error.1.log
| |____mysql_p_s.threads.out
| |____netstat_punta.1.out
| |____nohup.out
| |____pem_files_md5.hashes
| |____rpm_qa.out
| |____ss_taeioZ.out
| |____strace.mysqld.out

|____node3 (PRIMARY)
| |____dmesg.out
| |____gdb.bt.full.1
| |____ip_addr.out
| |____lsof.out
| |____mysql-variables.out
| |____mysql.log
| |____mysql_p_s.threads--os_thread_12513.out
| |____mysql_p_s.threads--thread_id_65.out
| |____mysql_p_s.threads.out
| |____netstat_punta.1.out
| |____nohup.out
| |____pem_files_md5.hashes
| |____poll.strace
| |____readlink_f__mysqld_fd_64
| |____readlink_f__mysqld_fd_66
| |____rpm_qa.out
| |____ss_taeioZ.out
| |____strace.12514.out

Files named .1 and .2 are before/after re-joining attempt. GC

Hello Marcos Albe,

Thank you, Marcos Albe for the additional details.
IMHO if possible please attach all the data(captured when issue was noticed) which would be useful since we don't have a reliable way to reproduce this issue. 

PS - If the data you need to attach is more than 3MB, you should create a compressed archive of the data and a README file that describes the data with a filename that includes the bug number (recommended filename: mysql-bug-data-109668.zip) and upload one to sftp.oracle.com. More details in the "Files" section of this bug report.

Sincerely,
Umesh

Hello Umesh, 

Gladly there's no further need for diagnostic, you can reproduce this by setting up GR with SSL:

~$ dbdeployer deploy replication --topology=group /opt/mysql/8.0.32;

And make sure each my.cnf has:

group_replication_ssl_mode = REQUIRED
ssl-ca    = ca.pem
ssl-cert  = server-cert.pem
ssl-key   = server-key.pem

Then stop one node from your group:

~/sandboxes/group_msb_8_0_32/node3/stop;

Then connect to group replication port (33061 by default; in the sandbox env the port was something else like 23358) on node1 with a simple mysql client:

mysql -h127.0.0.1 -P23358;  

Finally, in a separate terminal, try bringing back node3 to the group, and you will see it's not possible. 

~/sandboxes/group_msb_8_0_32/node3/start;
~/sandboxes/group_msb_8_0_32/node3/use -e "START GROUP_REPLICATION"; -- will hang

You can then check this other output:

pstack $(cat ~/sandboxes/group_msb_8_0_32/node1/data/*pid) |grep -iB3 -A7 BIO_read

And you will see the problem thread that blocks all GR membership activity:

[COR-SER] [marcos.albe@tp-support03 node1]$ pstack $(cat ~/sandboxes/group_msb_8_0_32/node1/data/*pid) |grep -iB3 -A7 BIO_read
Thread 55 (Thread 0x7f7bb77ff700 (LWP 2198598)):
#0  0x00007f7c050beab4 in read () from /lib64/libpthread.so.0
#1  0x00007f7c0461c86b in sock_read () from /home/opt/percona_server/8.0.32/bin/../lib/private/libcrypto.so
#2  0x00007f7c0461a8ab in BIO_read () from /home/opt/percona_server/8.0.32/bin/../lib/private/libcrypto.so
#3  0x00007f7c049c58ab in ssl23_read_bytes () from /home/opt/percona_server/8.0.32/bin/../lib/private/libssl.so
#4  0x00007f7c049c3f2a in ssl23_get_client_hello () from /home/opt/percona_server/8.0.32/bin/../lib/private/libssl.so
#5  0x00007f7c049c4598 in ssl23_accept () from /home/opt/percona_server/8.0.32/bin/../lib/private/libssl.so
#6  0x00007f7bebe6e55e in xcom_tcp_server_startup(Xcom_network_provider*) () from /opt/percona_server/8.0.32/lib/plugin/group_replication.so
#7  0x0000000002b367d4 in ?? ()
#8  0x00007f7c050b51ca in start_thread () from /lib64/libpthread.so.0
#9  0x00007f7c03231e73 in clone () from /lib64/libc.so.6

I understand that connecting to port 33061 is not expected or a good idea, but security probes/network scanners are more and more frequently hitting DB ports, and it seems to me that a timeout for that read() that lingers there forever is in order and should provide a more pleasant experience for the unwary user :)

Thanks!
Marcos

FWIW: Above test was done with Percona Server, but we verified the same with Community MySQL 8.0.32 as well.

Thank you Marcos for the details.
I'll try to reproduce at my end on latest builds and get back to you if anything further needed. Thank you once again.

Sincerely,
Umesh

Hello Marcos,

Thank you for the details, however only once I experienced the "hang" and in rest 7-8 attempts not seeing any issues. Tried with 8.0.32(observed only once) and 8.0.33(not observed any issues in 2 attempts). Am I missing anything here? Joining the activity log shortly for your reference. Please advise.

regards,
Umesh

Test results

Attachment: 109668_8.0.32.results (application/octet-stream, text), 10.32 KiB.

Hello Umesh,
There's a small confusion; In your test you do:

bin/mysql -h127.0.0.1 -uroot -P23233 -pmsandbox

Which connects to the MySQL port (default would have been 3306).

The problem happens when connecting to GR communications port, which by default is 33061. You can run node1/use -e "SELECT @@global.group_replication_local_address"; to find out the GR port in the sandbox.  Please try connecting to that port and see if that actually does the trick.
Thanks for looking into it!
Best,
Marcos

Thank you Marcos.
I'll give it a try over the weekend and process the bug.

regards,
Umesh

Hello Marcos,

I am so thankful for your patience.
Finally, I'm able to reproduce the issue after your valuable inputs i.e. attempting to connect GR communications port. 
Verified as described.

Sincerely,
Umesh

Umesh,

Thanks to you for your efforts to validate!! :-)

Best,
Marcos

Contribution patch to fix the hang (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-Bug-109668-XCom-connection-stalled-forever-in-read-s.patch (text/x-patch), 44.90 KiB.

Thank you, Venkatesh for the Contribution.

Contribution submitted via Github - Bug #109668: XCom connection stalled forever in read() syscall over network (*) Contribution by Venkatesh Prasad Venugopal (Github venkatesh-prasad-v, mysql-server/pull/488#issuecomment-1663478873): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_1460947951.txt (text/plain), 44.93 KiB.

Hi Umesh!

Any updates on this?

--
Venkatesh Prasad (Venu)

If you're interested, you can try my modified version of Group Replication: https://github.com/advancedmysql/mysqlplus. See if this version addresses your issue, as it includes fixes for several related bugs.