| Bug #98624 | Can't connect to MySQL after establishing around 65536 connections | ||
|---|---|---|---|
| Submitted: | 16 Feb 2020 16:02 | Modified: | 12 Mar 2020 1:38 |
| Reporter: | Fungo Wang (OCA) | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Connection Handling | Severity: | S3 (Non-critical) |
| Version: | 8.0.19 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | 1130, 65536, Connections, lf_hash, mdl | ||
[16 Feb 2020 16:08]
Fungo Wang
65536 is a huge number, it takes time to setup so many connections, and maybe you also need to increase your OS limits config to support creating so many connections (when thread pool is not used). To make life easier, you can apply below diff to 8.0.19 code, and rebuild. After changing the LF_PINBOX_MAX_PINS to 128, in my test, 124 connections is enougth to repro. diff --git a/mysys/lf_alloc-pin.cc b/mysys/lf_alloc-pin.cc index 02d6c36..5532ff2 100644 --- a/mysys/lf_alloc-pin.cc +++ b/mysys/lf_alloc-pin.cc @@ -127,7 +127,7 @@ static_assert(sizeof(std::atomic<void *>) == sizeof(void *), #include "mysql/service_mysql_alloc.h" #include "mysys/mysys_priv.h" /* key_memory_lf_node */ -#define LF_PINBOX_MAX_PINS 65536 +#define LF_PINBOX_MAX_PINS 128 static void lf_pinbox_real_free(LF_PINS *pins); That's why in the previous post's gdb stack, pins == 137 will cause failure. ``` (gdb) p pins $2 = 137 // here LF_PINBOX_MAX_PINS = 128 ```
[17 Feb 2020 13:41]
MySQL Verification Team
Hi Mr. Wang, Thank you for your bug report. First of all, this could be an error that is specific on the OS, which you have not specified. It could be a bug about the error that is reported. Simply, it could be a problem that you have run out of the file handles. So, try increasing those and try again, please ...... Thanks in advance.
[17 Feb 2020 13:56]
Fungo Wang
Hi Sinisa, Have you tried the repro steps, or checked my analysis? I'm pretty sure that the limitations of my OS resources are not reached, cause I have enlarged them to big enough values. And actually the 1130 error is not related to OS resource, if you analyze the mysql code, only one place can report 1130, that is in `check_connection()` caused by failing `acl_check_host()`. Please try the repro, thanks very much!
[17 Feb 2020 16:53]
MySQL Verification Team
Hi, I did not manage to repeat the behaviour with a shell script, due to the shell limits. However, I managed to repeat the behaviour with a program written in C. I also found a culprit of this limit of the total number of connections. It is this line: #define LF_PINBOX_MAX_PINS 65536 However, I can't tell whether this limit can be raised or not. Development shall have to decide upon that. Verified as reported.
[18 Feb 2020 0:33]
Fungo Wang
Hi Sinisa, Thanks for your verification. Just want to make it more clear. This is a bug in the MDL subsystem, the scalability of MDL is limited to 65536 users, due to the pins number limitation (LF_PINBOX_MAX_PINS = 65536) of the LF_HASH it employs. After 64K users, MDL subsystem will fail to work. The exposed behavior in 8.0 is 1130 ERROR, a connection failure, because `acl_check_host()` make the very first request to MDL subsytem for each THD, this because worklog "WL#9045: Make user management DDLs atomic", in which the acl cache lock implementation is change from OS mutex to a MDL LOCK, whose key is ACL_CACHE. As long as lf_hash is used by MDL, this limitations exists, it behavior differently in 5.7 (I'll report a separate bug later), but the root cause is same. Thanks.
[18 Feb 2020 13:27]
MySQL Verification Team
HI Mr. Wang, You are welcome and thanks for the additional comments.
[21 Feb 2020 8:44]
Ståle Deraas
Posted by developer: There is also a TCP restriction around 65K on the number of open tcp connections to single port.
[12 Mar 2020 1:38]
Fungo Wang
Hi Ståle Deraas, > There is also a TCP restriction around 65K on the number of open tcp connections to single port. There is no such 64K socket limitation that a linux tcp port can handle, you could easily set up more than 64K socket to a mysql-sever:3306 using several client boxes. Pls also check my reply under my new bug #98911. Thanks.

Description: Recently, one of our customer was bitten by this issue in our prod env. After establishing around 65536 connections to mysqld, no new connection can be setup unless disconnecting some old ones. The error is like bellow: ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server How to repeat: The reproducing steps are simple, just keep creating new connection and don't close it (by some tools). Then watch the `Threads_connected` global status. After passing around 65536 connections, the new connection will report error. I could repro this with the latest 8.0.19, and monitor output like this: ``` Threads_connected 65515 Threads_connected 65518 Threads_connected 65521 Threads_connected 65525 Threads_connected 65530 Threads_connected 65529 Threads_connected 65535 Threads_connected 65540 Threads_connected 65540 Threads_connected 65544 Threads_connected 65543 Threads_connected 65547 Threads_connected 65544 ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server ``` Suggested fix: Errorno 1130 was reported by `acl_check_host()` failure, in `sql/sql_connect.cc:611` ``` 608 if (acl_check_host(thd, thd->m_main_security_ctx.host().str, 609 main_sctx_ip.str)) { 610 /* HOST_CACHE stats updated by acl_check_host(). */ 611 my_error(ER_HOST_NOT_PRIVILEGED, MYF(0), 612 thd->m_main_security_ctx.host_or_ip().str); 613 return 1; 614 } ``` After digging the code and run testing with my colleagues, we found that the root cause is a hard-coded limitation of of hash pin, in `msys/_alloc-pin.cc:380`. ``` #define LF_PINBOX_MAX_PINS 65536 ``` The reason why acl_check_host() failure is showed in bellow stack, and gdb printed var. ``` #0 lf_pinbox_get_pins (pinbox=pinbox@entry=0x3adcf28 <mdl_locks+40>) at /home/fungo/mysql-server-8019/mysys/lf_alloc-pin.cc:181 #1 0x000000000119cef4 in lf_hash_get_pins (hash=0x3adcf00 <mdl_locks>) at /home/fungo/mysql-server-8019/include/lf.h:219 #2 MDL_map::get_pins (this=0x3adcf00 <mdl_locks>) at /home/fungo/mysql-server-8019/sql/mdl.cc:229 #3 MDL_context::fix_pins (this=0x7f96500d4170) at /home/fungo/mysql-server-8019/sql/mdl.cc:1461 #4 MDL_context::try_acquire_lock_impl (this=this@entry=0x7f96500d4170, mdl_request=mdl_request@entry=0x7f96c009d4a0, out_ticket=out_ticket@entry=0x7f96c009d3c8) at /home/fungo/mysql-server-8019/sql/mdl.cc:2850 #5 0x000000000119d3df in MDL_context::acquire_lock (this=0x7f96500d4170, mdl_request=mdl_request@entry=0x7f96c009d4a0, lock_wait_timeout=lock_wait_timeout@entry=3600) at /home/fungo/mysql-server-8019/sql/mdl.cc:3376 #6 0x0000000000f594d8 in Acl_cache_lock_guard::lock (this=this@entry=0x7f96c009d6c0, raise_error=<optimized out>) at /home/fungo/mysql-server-8019/sql/auth/sql_auth_cache.cc:3397 #7 0x0000000000f5cee4 in Acl_cache_lock_guard::lock (this=this@entry=0x7f96c009d6c0, raise_error=raise_error@entry=false) at /home/fungo/mysql-server-8019/sql/auth/sql_auth_cache.cc:3404 #8 0x0000000000f5037f in acl_check_host (thd=thd@entry=0x7f96500d40e0, host=0x25af470 "", ip=ip@entry=0x7f9650050790 "127.0.0.1") at /home/fungo/mysql-server-8019/sql/auth/sql_authentication.cc:1681 #9 0x0000000000d83b91 in check_connection (thd=thd@entry=0x7f96500d40e0) at /home/fungo/mysql-server-8019/sql/sql_connect.cc:608 #10 0x0000000000d84f36 in login_connection (thd=0x7f96500d40e0) at /home/fungo/mysql-server-8019/sql/sql_connect.cc:704 #11 thd_prepare_connection (thd=thd@entry=0x7f96500d40e0) at /home/fungo/mysql-server-8019/sql/sql_connect.cc:877 #12 0x0000000000effdbf in handle_connection (arg=arg@entry=0x5e316c0) at /home/fungo/mysql-server-8019/sql/conn_handler/connection_handler_per_thread.cc:298 #13 0x0000000002335ad9 in pfs_spawn_thread (arg=0x5e43630) at /home/fungo/mysql-server-8019/storage/perfschema/pfs.cc:2854 #14 0x00007f97d402ce25 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f97d23bdf1d in clone () from /lib64/libc.so.6 (gdb) l 181 176 top_ver = pinbox->pinstack_top_ver; 177 do { 178 if (!(pins = top_ver % LF_PINBOX_MAX_PINS)) { 179 /* the stack of free elements is empty */ 180 pins = pinbox->pins_in_array.fetch_add(1) + 1; 181 if (unlikely(pins >= LF_PINBOX_MAX_PINS)) { 182 return 0; 183 } 184 /* 185 note that the first allocated element has index 1 (pins==1). (gdb) p pins $2 = 137 // here LF_PINBOX_MAX_PINS = 128 ``` In worklog "WL#9045: Make user management DDLs atomic", the acl cache lock implementation is change from OS mutex to a MDL LOCK, whose key is ACL_CACHE. And the MDL system is based on lf_hash. Every connection (THD) will fetch a lf_hash pin and keep it in it's context, upon the very first request (`MDL_context::acquire_lock()`) to the MDL system. The pin is release back to MDL system at connection close (thd cleanup). If the MDL hash pins are exhausted, new MDL request (from new connection) will fail. `Acl_cache_lock_guard::lock()` make the very first call to `MDL_context::acquire_lock()` for each new connection, so "ERROR 1130" is reported as connection failure, when no nore MDL hash pin is avalible. I think this issue has existed for a long term after MDL adapted to lf_hash, and WL#9045 just make it easier to be triggered. This issue will also put a implicit limit (about 65535) on MySQL max_connections. https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_max_connection...