Bug #98624 Can't connect to MySQL after establishing around 65536 connections
Submitted: 16 Feb 2020 16:02 Modified: 12 Mar 2020 1:38
Reporter: Fungo Wang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Connection Handling Severity:S3 (Non-critical)
Version:8.0.19 OS:Any
Assigned to: CPU Architecture:Any
Tags: 1130, 65536, Connections, lf_hash, mdl

[16 Feb 2020 16:02] Fungo Wang
Description:
Recently, one of our customer was bitten by this issue in our prod env. After establishing around 65536 connections to mysqld, no new connection can be setup unless disconnecting some old ones.

The error is like bellow:
ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server

How to repeat:
The reproducing steps are simple, just keep creating new connection and don't close it (by some tools). Then watch the `Threads_connected` global status.

After passing around 65536 connections, the new connection will report error.

I could repro this with the latest 8.0.19, and monitor output like this:

```
Threads_connected       65515
Threads_connected       65518
Threads_connected       65521
Threads_connected       65525
Threads_connected       65530
Threads_connected       65529
Threads_connected       65535
Threads_connected       65540
Threads_connected       65540
Threads_connected       65544
Threads_connected       65543
Threads_connected       65547
Threads_connected       65544
ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server
ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server
ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server
ERROR 1130 (HY000): Host 'xx.xxx.xxx.xxx' is not allowed to connect to this MySQL server
```

Suggested fix:
Errorno 1130 was reported by `acl_check_host()` failure, in `sql/sql_connect.cc:611`

```
608    if (acl_check_host(thd, thd->m_main_security_ctx.host().str,
609                       main_sctx_ip.str)) {
610      /* HOST_CACHE stats updated by acl_check_host(). */
611      my_error(ER_HOST_NOT_PRIVILEGED, MYF(0),
612               thd->m_main_security_ctx.host_or_ip().str);
613      return 1;
614    }

```

After digging the code and run testing with my colleagues, we found that the root cause is a hard-coded limitation of of hash pin, in `msys/_alloc-pin.cc:380`.

```
#define LF_PINBOX_MAX_PINS 65536
```

The reason why acl_check_host() failure is showed in bellow stack, and gdb printed var.

```
#0  lf_pinbox_get_pins (pinbox=pinbox@entry=0x3adcf28 <mdl_locks+40>) at /home/fungo/mysql-server-8019/mysys/lf_alloc-pin.cc:181
#1  0x000000000119cef4 in lf_hash_get_pins (hash=0x3adcf00 <mdl_locks>) at /home/fungo/mysql-server-8019/include/lf.h:219
#2  MDL_map::get_pins (this=0x3adcf00 <mdl_locks>) at /home/fungo/mysql-server-8019/sql/mdl.cc:229
#3  MDL_context::fix_pins (this=0x7f96500d4170) at /home/fungo/mysql-server-8019/sql/mdl.cc:1461
#4  MDL_context::try_acquire_lock_impl (this=this@entry=0x7f96500d4170, mdl_request=mdl_request@entry=0x7f96c009d4a0, out_ticket=out_ticket@entry=0x7f96c009d3c8)
    at /home/fungo/mysql-server-8019/sql/mdl.cc:2850
#5  0x000000000119d3df in MDL_context::acquire_lock (this=0x7f96500d4170, mdl_request=mdl_request@entry=0x7f96c009d4a0, lock_wait_timeout=lock_wait_timeout@entry=3600)
    at /home/fungo/mysql-server-8019/sql/mdl.cc:3376
#6  0x0000000000f594d8 in Acl_cache_lock_guard::lock (this=this@entry=0x7f96c009d6c0, raise_error=<optimized out>)
    at /home/fungo/mysql-server-8019/sql/auth/sql_auth_cache.cc:3397
#7  0x0000000000f5cee4 in Acl_cache_lock_guard::lock (this=this@entry=0x7f96c009d6c0, raise_error=raise_error@entry=false)
    at /home/fungo/mysql-server-8019/sql/auth/sql_auth_cache.cc:3404
#8  0x0000000000f5037f in acl_check_host (thd=thd@entry=0x7f96500d40e0, host=0x25af470 "", ip=ip@entry=0x7f9650050790 "127.0.0.1")
    at /home/fungo/mysql-server-8019/sql/auth/sql_authentication.cc:1681
#9  0x0000000000d83b91 in check_connection (thd=thd@entry=0x7f96500d40e0) at /home/fungo/mysql-server-8019/sql/sql_connect.cc:608
#10 0x0000000000d84f36 in login_connection (thd=0x7f96500d40e0) at /home/fungo/mysql-server-8019/sql/sql_connect.cc:704
#11 thd_prepare_connection (thd=thd@entry=0x7f96500d40e0) at /home/fungo/mysql-server-8019/sql/sql_connect.cc:877
#12 0x0000000000effdbf in handle_connection (arg=arg@entry=0x5e316c0) at /home/fungo/mysql-server-8019/sql/conn_handler/connection_handler_per_thread.cc:298
#13 0x0000000002335ad9 in pfs_spawn_thread (arg=0x5e43630) at /home/fungo/mysql-server-8019/storage/perfschema/pfs.cc:2854
#14 0x00007f97d402ce25 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f97d23bdf1d in clone () from /lib64/libc.so.6

(gdb) l 181
176       top_ver = pinbox->pinstack_top_ver;
177       do {
178         if (!(pins = top_ver % LF_PINBOX_MAX_PINS)) {
179           /* the stack of free elements is empty */
180           pins = pinbox->pins_in_array.fetch_add(1) + 1;
181           if (unlikely(pins >= LF_PINBOX_MAX_PINS)) {
182             return 0;
183           }
184           /*
185             note that the first allocated element has index 1 (pins==1).
(gdb) p pins 
$2 = 137 // here LF_PINBOX_MAX_PINS = 128 
```
In worklog "WL#9045: Make user management DDLs atomic", the acl cache lock implementation is change from OS mutex to a MDL LOCK, whose key is ACL_CACHE. And the MDL system is based on lf_hash.

Every connection (THD) will fetch a lf_hash pin and keep it in it's context, upon the very first request (`MDL_context::acquire_lock()`) to the MDL system. The pin is release back to MDL system at connection close (thd cleanup).

If the MDL hash pins are exhausted, new MDL request (from new connection) will fail. 

`Acl_cache_lock_guard::lock()` make the very first call to `MDL_context::acquire_lock()` for each new connection, so "ERROR 1130" is reported as connection failure, when no nore MDL hash pin is avalible.

I think this issue has existed for a long term after MDL adapted to lf_hash, and WL#9045 just make it easier to be triggered.

This issue will also put a implicit limit (about 65535) on MySQL max_connections. https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_max_connection...
[16 Feb 2020 16:08] Fungo Wang
65536 is a huge number, it takes time to setup so many connections, and maybe you also need to increase your OS limits config to support creating so many connections (when thread pool is not used).

To make life easier, you can apply below diff to 8.0.19 code, and rebuild.

After changing the LF_PINBOX_MAX_PINS to 128, in my test, 124 connections is enougth to repro.

diff --git a/mysys/lf_alloc-pin.cc b/mysys/lf_alloc-pin.cc
index 02d6c36..5532ff2 100644
--- a/mysys/lf_alloc-pin.cc
+++ b/mysys/lf_alloc-pin.cc
@@ -127,7 +127,7 @@ static_assert(sizeof(std::atomic<void *>) == sizeof(void *),
 #include "mysql/service_mysql_alloc.h"
 #include "mysys/mysys_priv.h" /* key_memory_lf_node */

-#define LF_PINBOX_MAX_PINS 65536
+#define LF_PINBOX_MAX_PINS 128

 static void lf_pinbox_real_free(LF_PINS *pins);

That's why in the previous post's gdb stack, pins == 137 will cause failure.

```
(gdb) p pins 
$2 = 137 // here LF_PINBOX_MAX_PINS = 128 
```
[17 Feb 2020 13:41] MySQL Verification Team
Hi Mr. Wang,

Thank you for your bug report.

First of all, this could be an error that is specific on the OS, which you have not specified.

It could be a bug about the error that is reported. Simply, it could be a problem that you have run out of the file handles. So, try increasing those and try again, please ......

Thanks in advance.
[17 Feb 2020 13:56] Fungo Wang
Hi Sinisa,

Have you tried the repro steps, or checked my analysis?

I'm pretty sure that the limitations of my OS resources are not reached, cause I have enlarged  them to big enough values. 

And actually the 1130 error is not related to OS resource, if you analyze the mysql code, only one place can report 1130, that is in `check_connection()` caused by failing `acl_check_host()`.

Please try the repro,  thanks very much!
[17 Feb 2020 16:53] MySQL Verification Team
Hi,

I did not manage to repeat the behaviour with a shell script, due to the shell limits. However, I managed to repeat the behaviour with a program written in C.

I also found a culprit of this limit of the total number of connections.

It is this line:

#define LF_PINBOX_MAX_PINS 65536

However, I can't tell whether this limit can be raised or not. Development shall have to decide upon that.

Verified as reported.
[18 Feb 2020 0:33] Fungo Wang
Hi Sinisa,

Thanks for your verification.

Just want to make it more clear.

This is a bug in the MDL subsystem, the scalability of MDL is limited to 65536 users, due to the pins number limitation (LF_PINBOX_MAX_PINS = 65536) of the LF_HASH it employs.

After 64K users, MDL subsystem will fail to work. 

The exposed behavior in 8.0 is 1130 ERROR, a connection failure, because `acl_check_host()` make the very first request to MDL subsytem for each THD, this because worklog "WL#9045: Make user management DDLs atomic", in which the acl cache lock implementation is change from OS mutex to a MDL LOCK, whose key is ACL_CACHE.

As long as lf_hash is used by MDL, this limitations exists, it behavior differently in 5.7 (I'll report a separate bug later), but the root cause is same.

Thanks.
[18 Feb 2020 13:27] MySQL Verification Team
HI Mr. Wang,

You are welcome and thanks for the additional comments.
[21 Feb 2020 8:44] Ståle Deraas
Posted by developer:
 
There is also a TCP restriction around 65K on the number of open tcp connections to single port.
[12 Mar 2020 1:38] Fungo Wang
Hi Ståle Deraas,

> There is also a TCP restriction around 65K on the number of open tcp connections to single port.

There is no such 64K socket limitation that a linux tcp port can handle, you could easily set up more than 64K socket to a mysql-sever:3306 using several client boxes.

Pls also check my reply under my new bug #98911.

Thanks.