Bug #102190 mysql_real_connect does not respect MYSQL_OPT_CONNECT_TIMEOUT
Submitted: 7 Jan 2021 22:44 Modified: 22 Jan 2021 17:12
Reporter: Bruno Junqueira Adami Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: C API (client library) Severity:S2 (Serious)
Version:21 OS:Ubuntu
Assigned to: Georgi Kodinov CPU Architecture:Any

[7 Jan 2021 22:44] Bruno Junqueira Adami
Description:
Hi,

I have the following problem using the C client library: even using MYSQL_OPT_CONNECT_TIMEOUT (set to 4), whenever there is a server network instability event, the connect function hangs indefinitely. Here is a core dump of the process that called the mysql library, which was stuck for more than 20 minutes before timing out finally:

```
#0  __libc_read (nbytes=5, buf=0x7fd88c0107d3, fd=43) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=43, buf=0x7fd88c0107d3, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007fd8c4fbe389 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007fd8c4fb962e in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007fd8c4fb8484 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007fd8c4fb8a57 in BIO_read () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007fd8c3e64b91 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007fd8c3e68e1e in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007fd8c3e666d0 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007fd8c3e99901 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fd8c3e8f525 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fd8c3e7ad58 in SSL_do_handshake () from /lib/x86_64-linux-gnu/libssl.so.1.1
#12 0x00007fd8c4064a60 in ?? () from /lib/x86_64-linux-gnu/libmysqlclient.so.21
#13 0x00007fd8c4014195 in ?? () from /lib/x86_64-linux-gnu/libmysqlclient.so.21
#14 0x00007fd8c401448e in ?? () from /lib/x86_64-linux-gnu/libmysqlclient.so.21
#15 0x00007fd8c4017b99 in mysql_real_connect () from /lib/x86_64-linux-gnu/libmysqlclient.so.21
```

I am not sure how the timeouts I set in the mysql library affects the SSL library timeouts (SSL_do_handshake?), maybe it is related?

Thanks in advance.

How to repeat:
1) On Ubuntu (or any linux), I have wrote the following code:

#include <mysql.h>
#include <iostream>

int main () {
  while (true) {
    MYSQL* con = mysql_init(NULL);
    system("date");

    if (con == NULL) {
      std::cout << mysql_error(con) << std::endl;
      exit(1);
    }

    unsigned int mysql_ct;
    mysql_ct = 4;

    mysql_options(con, MYSQL_OPT_CONNECT_TIMEOUT, &mysql_ct);

    if (mysql_real_connect(con, "database ip goes here ...", "root", "root", NULL, 0, NULL, 0) == NULL) {
      std::cout << "NO " << mysql_error(con) << std::endl;
      mysql_close(con);
      continue;
    }

    std::cout << "YES" << std::endl;
    mysql_close(con);
  }
  return 0;
}

(Please replace "database ip goes here ..." with the database IP, and also any other user params)

2) Compile with: g++ test.cpp -I /usr/include/mysql/ -lmysqlclient (may be different depending on your linux distro).

3) Run it on gdb with: gdb a.out -ex 'r'   (so you can inspect where the execution is at, by interrupting the process with ctrl+c and then backtrace).

4) I have simulated a 'network down' in two ways that were able to reproduce this (while the code above was executing):

a) By disabling my network interface.
b) By blocking the database IP in my router firewall (TPLink).

And even by turning the network up again it doesn't resume.

5) About 50% of the times, the timeout of 4 seconds is respected. But, the other times you need to wait 20+ minutes, and it finally times out..

Notice that this was in a controlled environment. This is also happening on our production servers, when the server provider network hangs and comes back (even if it is for <1 minute).
[9 Jan 2021 19:30] MySQL Verification Team
Thank you for the bug report. and test case.
[14 Jan 2021 15:21] Georgi Kodinov
mysql_real_connect() does much more than just connecting. It does read a number of packets (the authentication exchange etc). And, as it is evidently in your case, it reads the packets of the TLS exchange. 

These reads are subject to the read timeout that is by default 10 mins. So, unless you're using the async connect APIs mysql_real_connect() is subject to the read timeout as well. 
Looking at the code the only places where the connect timeout is used are (on a regular blocking connection):
- waiting for the server greeting packet to arrive 
- passed down to TLS_set_timeout()

The latter is a bit dubious, so I'll skip that. But the former is very much relevant. Note that this is just one place where data are read of many.

The rest of the reads done by mysql_real_connect() are subject to the MYSQL_OPT_READ_TIMEOUT timeout (and of course writes to MYSQL_OPT_WRITE_TIMEOUT ). 

So here's my request (the reproduction case being a bit non-deterministic and the callstacks lacking debug info): in your test program can you also try setting read and write timeouts to something less than 10 mins and try like this ? 

I hope it will work for you. If that's the case and you're content with the result you can just do nothing or close the bug with a comment so that other readers would know as well.

In case the above works and you want to contribute to the documentation please re-open the bug as a docs bug asking for docs update.

And finally, if the above doesn't work, please make sure you're using the debug libmysql library (and the debug openssl library if available on your system) and reopen the bug and send in the callstack(s) of your test app hanging.

Hope this help!
[14 Jan 2021 16:13] Bruno Junqueira Adami
Hi Georgi,

Thanks for your reply! I have added the read and write timeout, but nothing has changed unfortunately (it still hangs for 20+ minutes):

    unsigned int mysql_ct;
    mysql_ct = 4;

    mysql_options(con, MYSQL_OPT_CONNECT_TIMEOUT, &mysql_ct);
    mysql_options(con, MYSQL_OPT_READ_TIMEOUT, &mysql_ct);
    mysql_options(con, MYSQL_OPT_WRITE_TIMEOUT, &mysql_ct);

I have installed the libs with debug symbols, and this is the full stacktrace:

#0  0x00007ffff75e2142 in __GI___libc_read (fd=3, buf=buf@entry=0x5555555a84c3, nbytes=nbytes@entry=5)
    at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007ffff7210389 in read (__nbytes=5, __buf=0x5555555a84c3, __fd=<optimized out>)
    at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2  sock_read (b=0x5555555b4690, out=0x5555555a84c3 "\367\377\177", outl=5) at ../crypto/bio/bss_sock.c:99
#3  0x00007ffff720b62e in bread_conv (bio=<optimized out>, data=<optimized out>, datal=<optimized out>, readbytes=0x7fffffffd580)
    at ../crypto/bio/bio_meth.c:123
#4  0x00007ffff720a484 in bio_read_intern (b=0x5555555b4690, data=data@entry=0x5555555a84c3, dlen=dlen@entry=5, 
    readbytes=readbytes@entry=0x7fffffffd580) at ../crypto/bio/bio_lib.c:270
#5  0x00007ffff720aa57 in BIO_read (b=<optimized out>, data=data@entry=0x5555555a84c3, dlen=dlen@entry=5)
    at ../crypto/bio/bio_lib.c:296
#6  0x00007ffff745cb91 in ssl3_read_n (s=s@entry=0x5555555a11d0, n=n@entry=5, max=5, extend=extend@entry=0, 
    clearold=clearold@entry=1, readbytes=readbytes@entry=0x7fffffffd648) at ../ssl/record/rec_layer_s3.c:296
#7  0x00007ffff7460e1e in ssl3_get_record (s=s@entry=0x5555555a11d0) at ../ssl/record/ssl3_record.c:207
#8  0x00007ffff745e6d0 in ssl3_read_bytes (s=0x5555555a11d0, type=22, recvd_type=0x7fffffffd7ec, buf=0x5555555a2f60 "\001", 
    len=4, peek=0, readbytes=0x7fffffffd7f0) at ../ssl/record/rec_layer_s3.c:1323
#9  0x00007ffff7491901 in tls_get_message_header (s=s@entry=0x5555555a11d0, mt=mt@entry=0x7fffffffd854)
    at ../ssl/statem/statem_lib.c:1160
#10 0x00007ffff7487525 in read_state_machine (s=0x5555555a11d0) at ../ssl/statem/statem.c:579
#11 state_machine (s=0x5555555a11d0, server=0) at ../ssl/statem/statem.c:434
#12 0x00007ffff7472d58 in SSL_do_handshake (s=0x5555555a11d0) at ../ssl/ssl_lib.c:3661
#13 0x00007ffff792ca00 in ssl_handshake_loop (ssl_errno_holder=0x7fffffffd9a8, func=0x7ffff74748d0 <SSL_connect>, 
    ssl=<optimized out>, vio=0x5555555bd6a0) at /var/lib/pb2/sb_1-410081-1600883484.86/release/mysql-8.0.22/vio/viossl.cc:431
#14 ssl_do (vio=0x5555555bd6a0, timeout=<optimized out>, func=0x7ffff74748d0 <SSL_connect>, 
    ssl_errno_holder=ssl_errno_holder@entry=0x7fffffffd9a8, sslptr=<optimized out>, sslptr@entry=0x0, ptr=<optimized out>)
    at /var/lib/pb2/sb_1-410081-1600883484.86/release/mysql-8.0.22/vio/viossl.cc:667
#15 0x00007ffff792ce46 in sslconnect (ptr=ptr@entry=0x5555555b6b20, vio=<optimized out>, timeout=<optimized out>, 
    ssl_errno_holder=ssl_errno_holder@entry=0x7fffffffd9a8, ssl=ssl@entry=0x0)
    at /var/lib/pb2/sb_1-410081-1600883484.86/release/mysql-8.0.22/vio/viossl.cc:727
#16 0x00007ffff78dc135 in cli_establish_ssl (mysql=mysql@entry=0x55555558e5c0)
    at /var/lib/pb2/sb_1-410081-1600883484.86/release/mysql-8.0.22/sql-common/client.cc:4175
#17 0x00007ffff78dc42e in csm_establish_ssl (ctx=0x7fffffffdcb0)
    at /var/lib/pb2/sb_1-410081-1600883484.86/release/mysql-8.0.22/sql-common/client.cc:6374
#18 0x00007ffff78dfb39 in mysql_real_connect (mysql=0x55555558e5c0, host=<optimized out>, user=<optimized out>, 
    passwd=<optimized out>, db=<optimized out>, port=<optimized out>, unix_socket=0x0, client_flag=0)
    at /var/lib/pb2/sb_1-410081-1600883484.86/release/mysql-8.0.22/sql-common/client.cc:5580

It does look like it is stuck at the TLS reads, and even with the MySql timeouts for connect, read and write set, it doesn't seem to affect/propagate to the TLS timeout. I hope this stacktrace helps you to pinpoint the problem!

Best regards,
Bruno
[21 Jan 2021 9:16] MySQL Verification Team
Thank you Bruno for the details.
Please note that I have discussed internally with Joro and even we(including our Miguel) tried on various instances but not seeing any issues. Moreover there's no TLS timeout(TLS relies on the underlying socket) etc and we believe it must be environment issue which is causing this.

If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.  

Thank you for your interest in MySQL.

regards,
Umesh
[22 Jan 2021 3:59] Bruno Junqueira Adami
Hi,

- Moreover there's no TLS timeout(TLS relies on the underlying socket)

That is precisely the root cause of this bug that I was hoping that the MySql team to fix. I am not very familiar with the underlying implementation, but can you not set a timeout in the TLS library?

- and we believe it must be environment issue which is causing this

That is when the connection is lost due to some network problem. The TLS/SSL read will timeout, and in this case it seems to be using the socket default that isn't being changed by the MySql library (which is too long!).

- If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.

It seems that I cannot reopen the bug! Please can you let me know precisely the steps you tried to reproduce this bug?
[22 Jan 2021 7:21] Georgi Kodinov
We're using the standard openssl library packaged by your Linux distro. If that's not providing a timeout and instead is relying on the underlying socket's timeout there's very little we, the MySQL team, can do to fix that.
Note that openssl is also an open source product so one can actually check the network handling code in it.
[22 Jan 2021 17:12] Bruno Junqueira Adami
Thanks Georgi and team,

Can you please point out to the MySql implementation so that I can take a look?

I understand that the issue is with the SLL library, but I was wondering if the MySql team could implement a workaround. For instance, these ideas (not necessarily the best ideas) come to mind:

a) If I am not mistaken, it should be using linux sockets and it uses the blocking API. If that is the case, you can set the read/write timeouts: https://www.man7.org/linux/man-pages/man7/socket.7.html (SO_RCVTIMEO, SO_SNDTIMEO)
b) Switch to use a non-blocking BIO, and use the select function with a timeout while trying to complete the handshake: https://www.openssl.org/docs/manmaster/man3/SSL_do_handshake.html (BIO non blocking section)
c) Implement a separate timer that if it times out, it will close the underlying socket, causing the SSL handshake function to return error (because the socket will be closed).
d) Expose something to the client that they can call to close the underlying socket (the user then can implement their timeout)

Please let me know your thoughts, thanks in advance