Bug #116580 | Forever hang after connection drop | ||
---|---|---|---|
Submitted: | 7 Nov 2024 9:19 | Modified: | 13 Dec 2024 8:45 |
Reporter: | Val Doroshchuk | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | Connector / C++ | Severity: | S3 (Non-critical) |
Version: | 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[7 Nov 2024 9:19]
Val Doroshchuk
[7 Nov 2024 9:19]
Val Doroshchuk
#0 0x00007aa463118bcf in __GI___poll (fds=0x7fff82106030, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00007aa463d3fdff in cdk::foundation::connection::detail::poll_one (socket=3, mode=cdk::foundation::connection::detail::POLL_MODE_READ, wait=true, timeout_usec=0) at ./cdk/foundation/socket_detail.cc:881 #2 0x00007aa463d3fff0 in cdk::foundation::connection::detail::recv_some (socket=3, buffer=0x55a075bdfdb0 "\n\002\340Y\nA96c5f63935aff2e9887df4e3307aa342ce43fe8a750000daac86c933eee2ba99", buffer_size=5, wait=true) at ./cdk/foundation/socket_detail.cc:950 #3 0x00007aa463d3fee6 in cdk::foundation::connection::detail::recv (socket=3, buffer=0x55a075bdfdb0 "\n\002\340Y\nA96c5f63935aff2e9887df4e3307aa342ce43fe8a750000daac86c933eee2ba99", buffer_size=5) at ./cdk/foundation/socket_detail.cc:920 #4 0x00007aa463d3b2f0 in cdk::foundation::connection::Socket_base::Read_op::do_wait (this=0x55a075b91530) at ./cdk/foundation/connection_tcpip.cc:282 #5 0x00007aa463959208 in cdk::foundation::api::Async_op_base::wait (this=0x55a075b91530) at ./cdk/include/mysql/cdk/protocol/../foundation/async.h:78 #6 0x00007aa463a8ac88 in cdk::protocol::mysqlx::Protocol_impl::rd_wait (this=0x55a075b90ba0) at ./cdk/protocol/mysqlx/protocol.cc:595 #7 0x00007aa463a8b1b3 in cdk::protocol::mysqlx::Op_rcv::do_read_msg (this=0x55a075b8c400, async=false) at ./cdk/protocol/mysqlx/protocol.cc:756 #8 0x00007aa463a8b583 in cdk::protocol::mysqlx::Op_rcv::do_wait (this=0x55a075b8c400) at ./cdk/protocol/mysqlx/protocol.cc:903 #9 0x00007aa463959208 in cdk::foundation::api::Async_op_base::wait (this=0x55a075b8c400) at ./cdk/include/mysql/cdk/foundation/async.h:78 #10 0x00007aa463a849eb in cdk::mysqlx::Stmt_op::do_wait (this=0x55a075b8a5a0) at ./cdk/mysqlx/result.cc:236 #11 0x00007aa463959208 in cdk::foundation::api::Async_op_base::wait (this=0x55a075b8a5a0) at ./cdk/include/mysql/cdk/foundation/async.h:78 #12 0x00007aa463a84b78 in cdk::mysqlx::Stmt_op::discard_result (this=0x55a075b8a5a0) at ./cdk/mysqlx/result.cc:277 #13 0x00007aa463a717cf in cdk::mysqlx::Stmt_op::discard (this=0x55a075b8a5a0) at ./cdk/include/mysql/cdk/mysqlx/result.h:633 #14 0x00007aa463a7655b in cdk::mysqlx::Expectation<cdk::mysqlx::Query_stmt, (cdk::protocol::mysqlx::api::Protocol_fields::value)4>::~Expectation (this=0x55a075b8a5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:130 #15 0x00007aa463a72df7 in cdk::mysqlx::Prepared<cdk::mysqlx::Query_stmt>::~Prepared (this=0x55a075b8a5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:403 #16 0x00007aa463a7e191 in cdk::mysqlx::Cmd_StmtExecute::~Cmd_StmtExecute (this=0x55a075b8a5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:604 #17 0x00007aa463a7e1ee in cdk::mysqlx::Cmd_StmtExecute::~Cmd_StmtExecute (this=0x55a075b8a5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:604 #18 0x00007aa46398fe4c in std::_Sp_counted_ptr<cdk::mysqlx::Stmt_op*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55a075b91090) at /usr/include/c++/10/bits/shared_ptr_base.h:380 #19 0x000055a075043831 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55a075b91090) at /usr/include/c++/10/bits/shared_ptr_base.h:158 #20 0x000055a075035221 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55a075bde0c8, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736 #21 0x00007aa46395a778 in std::__shared_ptr<cdk::mysqlx::Stmt_op, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55a075bde0c0, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188 #22 0x00007aa46395a798 in std::shared_ptr<cdk::mysqlx::Stmt_op>::~shared_ptr (this=0x55a075bde0c0, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121 #23 0x00007aa4639755cb in cdk::Reply::~Reply (this=0x55a075bde0b0, __in_chrg=<optimized out>) at ./cdk/include/mysql/cdk/reply.h:44 #24 0x00007aa463975600 in cdk::Reply::~Reply (this=0x55a075bde0b0, __in_chrg=<optimized out>) at ./cdk/include/mysql/cdk/reply.h:44 #25 0x00007aa463a42d34 in mysqlx::abi2::r0::common::Result_impl::~Result_impl (this=0x55a075b74ef0, __in_chrg=<optimized out>) at ./common/result.cc:193 #26 0x00007aa463a42dda in mysqlx::abi2::r0::common::Result_impl::~Result_impl (this=0x55a075b74ef0, __in_chrg=<optimized out>) at ./common/result.cc:194 #27 0x00007aa4639f17ee in mysqlx::abi2::r0::internal::Result_detail::~Result_detail (this=0x7fff82106770, __in_chrg=<optimized out>) at ./devapi/result.cc:495 #28 0x000055a07502a6f2 in mysqlx::abi2::r0::internal::Row_result_detail<mysqlx::abi2::r0::Columns>::~Row_result_detail (this=0x7fff82106770, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/detail/result.h:314 #29 0x000055a07502a750 in mysqlx::abi2::r0::internal::Result_common<mysqlx::abi2::r0::internal::Row_result_detail<mysqlx::abi2::r0::Columns> >::~Result_common (this=0x7fff82106770, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/result.h:71 #30 0x000055a07502ade6 in mysqlx::abi2::r0::RowResult::~RowResult (this=0x7fff82106770, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/result.h:585 #31 0x000055a07512a340 in mysqlx::abi2::r0::SqlResult::~SqlResult (this=0x7fff82106770, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/result.h:730 #32 0x000055a075236564 in Sess_connection_drop_Test::TestBody (this=0x55a075b77150) at ./devapi/tests/session-t.cc:4380 #33 0x000055a0753080bf in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () #34 0x000055a075300be3 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () #35 0x000055a0752dc3f0 in testing::Test::Run() () #36 0x000055a0752dce34 in testing::TestInfo::Run() () #37 0x000055a0752dd7c4 in testing::TestSuite::Run() () #38 0x000055a0752edafc in testing::internal::UnitTestImpl::RunAllTests() () #39 0x000055a07530964c in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) () #40 0x000055a075301e85 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) () #41 0x000055a0752ec111 in testing::UnitTest::Run() () #42 0x000055a074f603e9 in RUN_ALL_TESTS () at /opt/googletest-1.15.2/include/gtest/gtest.h:2334 #43 0x000055a074f60309 in main (argc=1, argv=0x7fff82107028) at ./build/tests_main.cc:66
[7 Nov 2024 9:19]
Val Doroshchuk
Thread 1 (Thread 0x736cbcf5d5c0 (LWP 61730) "run_unit_tests"): #0 __GI___libc_read (nbytes=541, buf=0x63e4c17cfb4c, fd=3) at ../sysdeps/unix/sysv/linux/read.c:26 #1 __GI___libc_read (fd=3, buf=0x63e4c17cfb4c, nbytes=541) at ../sysdeps/unix/sysv/linux/read.c:24 #2 0x0000736cbbae1e3d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3 #3 0x0000736cbbad234b in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3 #4 0x0000736cbbad5345 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3 #5 0x0000736cbbad54c7 in BIO_read () from /lib/x86_64-linux-gnu/libcrypto.so.3 #6 0x0000736cbc6c53cc in ?? () from /lib/x86_64-linux-gnu/libssl.so.3 #7 0x0000736cbc6c7332 in ?? () from /lib/x86_64-linux-gnu/libssl.so.3 #8 0x0000736cbc6a13fc in ?? () from /lib/x86_64-linux-gnu/libssl.so.3 #9 0x0000736cbc6a7cb7 in SSL_read () from /lib/x86_64-linux-gnu/libssl.so.3 #10 0x0000736cbcd43d77 in cdk::foundation::connection::TLS::Read_op::common_read (this=0x63e4c17d71d0) at ./cdk/foundation/connection_openssl.cc:931 #11 0x0000736cbcd43c54 in cdk::foundation::connection::TLS::Read_op::do_cont (this=0x63e4c17d71d0) at ./cdk/foundation/connection_openssl.cc:909 #12 0x0000736cbc9591c3 in cdk::foundation::api::Async_op_base::cont (this=0x63e4c17d71d0) at ./cdk/include/mysql/cdk/protocol/../foundation/async.h:72 #13 0x0000736cbca8abbe in cdk::protocol::mysqlx::Protocol_impl::rd_cont (this=0x63e4c1724970) at ./cdk/protocol/mysqlx/protocol.cc:575 #14 0x0000736cbca8ac32 in cdk::protocol::mysqlx::Protocol_impl::rd_wait (this=0x63e4c1724970) at ./cdk/protocol/mysqlx/protocol.cc:591 #15 0x0000736cbca8b1b3 in cdk::protocol::mysqlx::Op_rcv::do_read_msg (this=0x63e4c17d7680, async=false) at ./cdk/protocol/mysqlx/protocol.cc:756 #16 0x0000736cbca8b583 in cdk::protocol::mysqlx::Op_rcv::do_wait (this=0x63e4c17d7680) at ./cdk/protocol/mysqlx/protocol.cc:903 #17 0x0000736cbc959208 in cdk::foundation::api::Async_op_base::wait (this=0x63e4c17d7680) at ./cdk/include/mysql/cdk/foundation/async.h:78 #18 0x0000736cbca849eb in cdk::mysqlx::Stmt_op::do_wait (this=0x63e4c171e5a0) at ./cdk/mysqlx/result.cc:236 #19 0x0000736cbc959208 in cdk::foundation::api::Async_op_base::wait (this=0x63e4c171e5a0) at ./cdk/include/mysql/cdk/foundation/async.h:78 #20 0x0000736cbca84b78 in cdk::mysqlx::Stmt_op::discard_result (this=0x63e4c171e5a0) at ./cdk/mysqlx/result.cc:277 #21 0x0000736cbca717cf in cdk::mysqlx::Stmt_op::discard (this=0x63e4c171e5a0) at ./cdk/include/mysql/cdk/mysqlx/result.h:633 #22 0x0000736cbca7655b in cdk::mysqlx::Expectation<cdk::mysqlx::Query_stmt, (cdk::protocol::mysqlx::api::Protocol_fields::value)4>::~Expectation (this=0x63e4c171e5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:130 #23 0x0000736cbca72df7 in cdk::mysqlx::Prepared<cdk::mysqlx::Query_stmt>::~Prepared (this=0x63e4c171e5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:403 #24 0x0000736cbca7e191 in cdk::mysqlx::Cmd_StmtExecute::~Cmd_StmtExecute (this=0x63e4c171e5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:604 #25 0x0000736cbca7e1ee in cdk::mysqlx::Cmd_StmtExecute::~Cmd_StmtExecute (this=0x63e4c171e5a0, __in_chrg=<optimized out>) at ./cdk/mysqlx/stmt.h:604 #26 0x0000736cbc98fe4c in std::_Sp_counted_ptr<cdk::mysqlx::Stmt_op*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x63e4c17d7240) at /usr/include/c++/10/bits/shared_ptr_base.h:380 #27 0x000063e4c00c3823 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x63e4c17d7240) at /usr/include/c++/10/bits/shared_ptr_base.h:158 #28 0x000063e4c00b521d in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x63e4c171fff8, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736 #29 0x0000736cbc95a778 in std::__shared_ptr<cdk::mysqlx::Stmt_op, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x63e4c171fff0, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188 #30 0x0000736cbc95a798 in std::shared_ptr<cdk::mysqlx::Stmt_op>::~shared_ptr (this=0x63e4c171fff0, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121 #31 0x0000736cbc9755cb in cdk::Reply::~Reply (this=0x63e4c171ffe0, __in_chrg=<optimized out>) at ./cdk/include/mysql/cdk/reply.h:44 #32 0x0000736cbc975600 in cdk::Reply::~Reply (this=0x63e4c171ffe0, __in_chrg=<optimized out>) at ./cdk/include/mysql/cdk/reply.h:44 #33 0x0000736cbca42d34 in mysqlx::abi2::r0::common::Result_impl::~Result_impl (this=0x63e4c17d8b70, __in_chrg=<optimized out>) at ./common/result.cc:193 #34 0x0000736cbca42dda in mysqlx::abi2::r0::common::Result_impl::~Result_impl (this=0x63e4c17d8b70, __in_chrg=<optimized out>) at ./common/result.cc:194 #35 0x0000736cbc9f17ee in mysqlx::abi2::r0::internal::Result_detail::~Result_detail (this=0x7ffca2ef96b0, __in_chrg=<optimized out>) at ./devapi/result.cc:495 #36 0x000063e4c00aa6f2 in mysqlx::abi2::r0::internal::Row_result_detail<mysqlx::abi2::r0::Columns>::~Row_result_detail (this=0x7ffca2ef96b0, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/detail/result.h:314 #37 0x000063e4c00aa750 in mysqlx::abi2::r0::internal::Result_common<mysqlx::abi2::r0::internal::Row_result_detail<mysqlx::abi2::r0::Columns> >::~Result_common (this=0x7ffca2ef96b0, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/result.h:71 #38 0x000063e4c00aade6 in mysqlx::abi2::r0::RowResult::~RowResult (this=0x7ffca2ef96b0, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/result.h:585 #39 0x000063e4c01a9fda in mysqlx::abi2::r0::SqlResult::~SqlResult (this=0x7ffca2ef96b0, __in_chrg=<optimized out>) at ./include/mysqlx/devapi/result.h:730 #40 0x000063e4c02b61fe in Sess_connection_drop_Test::TestBody (this=0x63e4c170b150) at ./devapi/tests/session-t.cc:4380 #41 0x000063e4c0387ceb in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () #42 0x000063e4c038080f in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () #43 0x000063e4c035c01c in testing::Test::Run() () #44 0x000063e4c035ca60 in testing::TestInfo::Run() () #45 0x000063e4c035d3f0 in testing::TestSuite::Run() () #46 0x000063e4c036d728 in testing::internal::UnitTestImpl::RunAllTests() () #47 0x000063e4c0389278 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) () --Type <RET> for more, q to quit, c to continue without paging-- #48 0x000063e4c0381ab1 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) () #49 0x000063e4c036bd3d in testing::UnitTest::Run() () #50 0x000063e4bffe03e9 in RUN_ALL_TESTS () at /opt/googletest-1.15.2/include/gtest/gtest.h:2334 #51 0x000063e4bffe0309 in main (argc=1, argv=0x7ffca2ef9f68) at ./build/tests_main.cc:66
[7 Nov 2024 10:50]
Bogdan Degtyariov
Hi Val, We will analyze this bug together with Bug# 116569 and update both bugs statuses once it is done.
[13 Nov 2024 10:45]
MySQL Verification Team
Thank you Mr. Doroshchuk, This is now a fully verified bug for the version 8.0 and higher.
[15 Nov 2024 8:56]
Rafal Somla
Posted by developer: There is old bug#92325 which reports similar issue and explains which part of the code could be responsible for that. Since then the code was fixed to remove the issue. Can you see if the issue can be reproduced with the current code (version 9.1.0)?
[15 Nov 2024 10:54]
MySQL Verification Team
Thank you, Rafal.
[15 Nov 2024 11:01]
Val Doroshchuk
Thank you, Rafal! > Since then the code was fixed to remove the issue. Can you see if the issue can be reproduced with the current code (version 9.1.0)? It is reproducible in 9.1.0 and trunk too. By test provided. Idea that MySQL closes the connection after `mysqlx_write_timeout`, and if the mysqlx::Session is throttled for too long, something bad happens and poll() or SSL_read() never wakes up. Please look at recv_some(): int select_result = poll_one(socket, POLL_MODE_READ, wait); << timeout_usec is -1 means infinite. Even if you add the timeout here, f.e. int select_result = poll_one(socket, POLL_MODE_READ, wait, 100000); it won't help since you would need to fix also void recv(Socket socket, byte *buffer, size_t buffer_size) { while (bytes_received != buffer_size) << it should exit the loop when no bytes received bytes_received += recv_some(socket, buffer + bytes_received, buffer_size - bytes_received, true);
[15 Nov 2024 16:08]
Rafal Somla
Posted by developer: Hmm, you are right -- the problem is in `connection::detail::recv()` function which assumes that a call to `recv_some()` will return a non-zero amount of received bytes (or throw error). But `recv_some()` can also return 0 in some cases... We need to look at this closer. Thanks for good analysis!
[15 Nov 2024 16:13]
MySQL Verification Team
Thank you, Rafal ......
[6 Dec 2024 15:34]
Rafal Somla
Posted by developer: Just an update as I had some time to work on this issue. However, I could not reproduce it locally with the following test case: ``` TEST_F(Bugs, bug37278716_connection_hang) { SKIP_IF_NO_XPLUGIN; auto res = sql("SELECT @@version"); LOG() << "talking to server version: " << res.fetchOne()[0]; sql("DROP TABLE IF EXISTS test.big"); sql("CREATE TABLE test.big(a CHAR(32))"); LOG() << "Table created, generating data..."; auto data_start = hr_clock::now(); sql("INSERT INTO test.big(a) VALUES ('a'),('b')"); for (unsigned i = 0; i < 4; ++i) { sql("INSERT INTO test.big(a) SELECT concat(t1.a,t2.a) FROM test.big as t1, test.big as t2"); } auto data_stop = hr_clock::now(); res = sql("SELECT count(a) FROM test.big"); LOG() << "Data generated: " << res.fetchOne()[0] << " rows" << " (" << duration_sec(data_stop - data_start) << "s)"; LOG() << "Query with timeout"; sql("SET SESSION mysqlx_write_timeout = 1"); auto select_start = hr_clock::now(); { auto res = sql("SELECT * FROM test.big"); // res.fetchAll(); } auto select_stop = hr_clock::now(); LOG() << "DONE, spent " << duration_sec(select_stop - select_start) << "s fetching data"; } ``` I got this output ``` [ RUN ] Bugs.bug37278716_connection_hang talking to server version: 9.2.0-commercial Table created, generating data... Data generated: 3263442 rows (21.2463s) Query with timeout DONE, spent 5.632s fetching data [ OK ] Bugs.bug37278716_connection_hang (28491 ms) ``` As one can see I did not manipulate the TCP/IP stack in my test because I thought that setting write timeout should be enough to get server to drop the connection. Apparently this is not the case. Using `tc` to cause 100% packet loos is more problematic in our test environment so it would be better if issue can be reproduced without it if possible. But no success so far. Note: Other required declarations for the test case above ``` #include <chrono> using hr_clock = std::chrono::high_resolution_clock; template <typename D> double duration_sec(D d) { return std::chrono::duration_cast<std::chrono::duration<double>>(d).count(); } ```
[6 Dec 2024 15:45]
Val Doroshchuk
Thanks, have you ever reproduced that hang? > Using `tc` to cause 100% packet loos is more problematic in our test environment so it would be better if issue can be reproduced without it if possible. I used `tc just to trigger closing connection after `mysqlx_write_timeout`. Never reproduced locally without, but on prod env it is suddenly reproducible when not needed =)
[6 Dec 2024 16:07]
Rafal Somla
Yeah, mysterious things can happen in a production environment... This was my first attempt at reproducing it and as I wrote was not successful. I will continue trying, possibly using your trick with `tc` etc. However, it could help a bit if you try to run my test case and tell if it works for you or hangs in your environment.
[6 Dec 2024 16:27]
MySQL Verification Team
Thank you, Rafal ..... You can try setting all client <-> server timeouts to the low values .... I agree that disrupting TCP packets is not great idea ....
[11 Dec 2024 8:24]
Rafal Somla
Posted by developer: Finally I was able to reproduce the issue. The only way that worked was blocking the traffic from server (using server timeouts did not cause any issues). In the situation where connector is reading data from server and packets from server are not reaching the client, the code "hangs" either in OpenSSL BIO_read() function or in a call to poll() (if TLS is not used). I think that in both cases this is the expected behavior, because we do not have any read-timeouts implemented. Without a timeout a read operation will wait indefinitely. So, to my understanding, the connector behaves as it should in such situation and a possible issue (rather a feature request) would be to add read/write timeout options to the connector. Still, there is this possibility for infinite loop in the internal data reading logic in function recv() (910@socket_detail.cc). This however could be an issue only if we have timeouts. Without a timeout the call to recv_some() either returns some data or throws an error which is the assumption here. I think that if connection is broken then this would lead to error from poll() and then recv_some() would throw error. Only if connection is still alive but packets do not reach their target we have an infinite wait here. However, I have not generated this situation in my tests to be 100% sure. Also, not sure if this explanation fits with the behavior you observed in the production environment -- if my theory is true reading query results should be interrupted by throwing errors when connections are broken, it should not hang waiting forever -- but I don't know what you observed exactly. If you think that adding read/write-timeout options to Connector/C++ is vital, please report a separate feature request for that.
[11 Dec 2024 11:24]
Val Doroshchuk
Thanks and very glad you reproduced it! > I think that in both cases this is the expected behavior, because we do not have any read-timeouts implemented. For me it sounds like a bug, since missing timeouts leads to infinite hand =-) Btw just adding timeouts might not work, (need to handle empty result cases), also not obvious how to introduce timeouts for TLS. > not sure if this explanation fits with the behavior you observed in the production environment -- if my theory is true reading query results should be interrupted by throwing errors when connections are broken, it should not hang waiting forever -- but I don't know what you observed exactly. Looks we have been observing the same case, the client is just stuck forever and not possible to recover unless reboot. > If you think that adding read/write-timeout options to Connector/C++ is vital, please report a separate feature request for that. If you still think it is not a bug, we will try to rise feature request to introduce timeouts.
[11 Dec 2024 12:09]
Rafal Somla
Posted by developer: I see this situation as similar to when you would call read() on fd that has no data available. Without setting any timeouts such read() would just hang indefinitely and I don't think anyone considers it a bug -- it is the expected behavior in such situation. It turns out I was not correct thinking that if connection is terminated on the server side a poll() call will return with error. I figured out that I can break on-going connection by simply killing the server when client is waiting for data. I checked with strace that what happens in this situation is that poll() continues to wait even after server process has terminated and definitely closed all its connections (that was a surprise for me). Eventually, after some long waiting, it does terminate with error -- I guess there is some built-in timeout in the kernel (*). That would be consistent with what you observed in the production -- when connections are dropped in the middle of receiving server reply the connector just hangs on the poll() call. Definitely having an option to control read timeout would help to handle that situation better. Still please report a feature request for this -- that would help us to manage the change. Note (*): Another possibility that I have not checked. Maybe poll() is hanging after I kill the server only because I blocked all tcp/ip traffic from that server. If traffic was not blocked then perhaps some low-level tcp/ip packet will be sent to inform the client that the connection is broken and poll() will then return with error shortly. Even without timeout options we are left with the issue of errors thrown in destructors which terminate process without a possibility to catch them -- this will be dealt with in the other bug#116569.
[11 Dec 2024 12:14]
MySQL Verification Team
Thank you, Rafal ......
[13 Dec 2024 8:45]
Val Doroshchuk
> poll() continues to wait even after server process has terminated and definitely closed all its connections (that was a surprise for me) Right, it looks what we were experiencing. `SHOW PROCESSLIST` does not show anything, or even killed but connector is still polling in poll(). Tried even to find a way how to close or cancel the socket, but all tries led to crash. So did not find any ways how to recover from this scenario.