Bug #98911 MDL system become insane after 64K connections
Submitted: 11 Mar 2020 14:44 Modified: 9 Apr 2020 0:56
Reporter: Fungo Wang (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Connection Handling Severity:S1 (Critical)
Version:5.7.29 OS:Any
Assigned to: CPU Architecture:Any
Tags: 64K, hash pins, lf_hash, mdl

[11 Mar 2020 14:44] Fungo Wang
Description:
This bug is a sister bug of bug #98624, but reveal more detail about the root cause. The MDL system can not work because the limitation on the number fo lf_hash pins.

This bug is for 5.7, #98624 is for 8.0.

How to repeat:
Like bug #98624, you need to set up about 64K connections to repro this, or change the LF_PINBOX_MAX_PINS macro to a relative small number, to make it more easier to repro.

```
diff --git a/mysys/lf_alloc-pin.c b/mysys/lf_alloc-pin.c
index 34eea5a..bab7a0f 100644
--- a/mysys/lf_alloc-pin.c
+++ b/mysys/lf_alloc-pin.c
@@ -110,7 +110,7 @@
 #include "lf.h"
 #include "mysys_priv.h" /* key_memory_lf_node */

-#define LF_PINBOX_MAX_PINS 65536
+#define LF_PINBOX_MAX_PINS 128

 static void lf_pinbox_real_free(LF_PINS *pins);
 ```

1. Prepare a table and fill some records
mysql 1> create table t1 (id int, name varchar(30), primary key(id)) engine=innodb;
mysql 1> insert into t1 values (1, 'hello'), (2, 'fungo');

Make sure the table is filled with data as expected.

mysql 1> select * from t1; 
+----+-------+
| id | name  |
+----+-------+
|  1 | hello |
|  2 | fungo |
+----+-------+
2 rows in set (0.00 sec)

Keep this mysql 1> session connected.

2. Set up LF_PINBOX_MAX_PINS (64K or whatever you may change to) number of connections, each connection need issue at least one table query (which will request a MDL lock and thus hold a MDL hash pins), after the query, the connection can keep quiet, but keep connected.

3. Set up a new connection mysql 2> , and query the records, 

mysql 2> select * from t1;
Query OK, 0 rows affected (0.00 sec)

mysql 2> select * from t1;
Query OK, 0 rows affected (0.00 sec)

mysql 2> select * from t1;
Query OK, 0 rows affected (0.00 sec)

The records become disappeared? Yes for mysql 2> and No for mysql 1>

mysql 1> select * from t1;
+----+-------+
| id | name  |
+----+-------+
|  1 | hello |
|  2 | fungo |
+----+-------+
2 rows in set (0.00 sec)

*Note:* there is no transaction_isolation concern, I use auto commit.

4. If disconnect some connections established in step 2, the records come back to session mysql 2> :

mysql 2> select * from t1;
+----+-------+
| id | name  |
+----+-------+
|  1 | hello |
|  2 | fungo |
+----+-------+
2 rows in set (0.00 sec)

Looks like magic, disappeared data come back?

Note: all tables are affected, if you `select * from mysql.user`, behave exactly same a t1.

Suggested fix:
The reason is same as bug #98624, the stack is bellow:

```
(gdb) bt
#0  lf_pinbox_get_pins (pinbox=pinbox@entry=0x1bc3ca8 <mdl_locks+40>) at /home/fungo/Projects/mysql-server-57/mysys/lf_alloc-pin.c:170
#1  0x0000000000b896bc in lf_hash_get_pins (hash=0x1bc3c80 <mdl_locks>) at /home/fungo/Projects/mysql-server-57/include/lf.h:199
#2  MDL_map::get_pins (this=0x1bc3c80 <mdl_locks>) at /home/fungo/Projects/mysql-server-57/sql/mdl.cc:220
#3  MDL_context::fix_pins (this=0x7f12fc013ef8) at /home/fungo/Projects/mysql-server-57/sql/mdl.cc:1509
#4  MDL_context::try_acquire_lock_impl (this=this@entry=0x7f12fc013ef8, mdl_request=mdl_request@entry=0x7f12fc0028e0, out_ticket=out_ticket@entry=0x7f13df253238)
    at /home/fungo/Projects/mysql-server-57/sql/mdl.cc:3023
#5  0x0000000000b8a18c in MDL_context::acquire_lock (this=this@entry=0x7f12fc013ef8, mdl_request=mdl_request@entry=0x7f12fc0028e0, lock_wait_timeout=<optimized out>)
    at /home/fungo/Projects/mysql-server-57/sql/mdl.cc:3597
#6  0x0000000000bdbd55 in open_table_get_mdl_lock (mdl_ticket=<synthetic pointer>, flags=0, table_list=0x7f12fc002548, ot_ctx=0x7f13df253690, thd=0x7f12fc013e60)
    at /home/fungo/Projects/mysql-server-57/sql/sql_base.cc:2918
#7  open_table (thd=thd@entry=0x7f12fc013e60, table_list=table_list@entry=0x7f12fc002548, ot_ctx=ot_ctx@entry=0x7f13df253690)
    at /home/fungo/Projects/mysql-server-57/sql/sql_base.cc:3258
#8  0x0000000000be3012 in open_and_process_table (flags=0, ot_ctx=0x7f13df253690, has_prelocking_list=false, prelocking_strategy=0x7f13df253728, counter=0x7f12fc016090,
    tables=0x7f12fc002548, lex=<optimized out>, thd=0x7f12fc013e60) at /home/fungo/Projects/mysql-server-57/sql/sql_base.cc:5226
#9  open_tables (thd=thd@entry=0x7f12fc013e60, start=start@entry=0x7f13df253718, counter=<optimized out>, flags=flags@entry=0,
    prelocking_strategy=prelocking_strategy@entry=0x7f13df253728) at /home/fungo/Projects/mysql-server-57/sql/sql_base.cc:5849
#10 0x0000000000be39da in open_tables_for_query (thd=thd@entry=0x7f12fc013e60, tables=<optimized out>, tables@entry=0x7f12fc002548, flags=flags@entry=0)
    at /home/fungo/Projects/mysql-server-57/sql/sql_base.cc:6626
#11 0x0000000000737681 in execute_sqlcom_select (thd=thd@entry=0x7f12fc013e60, all_tables=0x7f12fc002548)
    at /home/fungo/Projects/mysql-server-57/sql/sql_parse.cc:5126
#12 0x0000000000c365f9 in mysql_execute_command (thd=thd@entry=0x7f12fc013e60, first_level=first_level@entry=true)
    at /home/fungo/Projects/mysql-server-57/sql/sql_parse.cc:2826
#13 0x0000000000c38c5d in mysql_parse (thd=thd@entry=0x7f12fc013e60, parser_state=parser_state@entry=0x7f13df254660)
    at /home/fungo/Projects/mysql-server-57/sql/sql_parse.cc:5584
#14 0x0000000000c39756 in dispatch_command (thd=thd@entry=0x7f12fc013e60, com_data=com_data@entry=0x7f13df254cc0, command=COM_QUERY)
    at /home/fungo/Projects/mysql-server-57/sql/sql_parse.cc:1491
#15 0x0000000000c3b160 in do_command (thd=thd@entry=0x7f12fc013e60) at /home/fungo/Projects/mysql-server-57/sql/sql_parse.cc:1032
#16 0x0000000000cfa120 in handle_connection (arg=arg@entry=0x3bf92d0) at /home/fungo/Projects/mysql-server-57/sql/conn_handler/connection_handler_per_thread.cc:313
#17 0x00000000011d2361 in pfs_spawn_thread (arg=0x3c19650) at /home/fungo/Projects/mysql-server-57/storage/perfschema/pfs.cc:2197
#18 0x00007f14e92f7e25 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f14e78e5f1d in clone () from /lib64/libc.so.6

```

165         if (!(pins= top_ver % LF_PINBOX_MAX_PINS))
166         {
167           /* the stack of free elements is empty */
168           pins= my_atomic_add32((int32 volatile*) &pinbox->pins_in_array, 1)+1;
169           if (unlikely(pins >= LF_PINBOX_MAX_PINS))
170             return 0;

there is no pins returned from lf_hash to MDL system, and MDL_context::try_acquire_lock_impl() was shortcuted here,

3018    Prepare context for lookup in MDL_map container by allocating pins
3019    if necessary. This also ensures that this MDL_context has pins allocated
3020    and ready for future attempts elements from MDL_map container (which
3021    might happen during lock release).
3022  */
3023  if (fix_pins())
3024    return TRUE;

And all the way being shortcuted to upper caller, no data and no error returned to client.

As I said earlier, the root cause is same as #98624:

1. lf_hash pins num limit is reached, so no hash pins returned to MDL
2. MDL system can not get hash pins, shortcuted and return (no error msg set)
3. the upper layer behavior become strange
   - for 8.0, i.e the bug descripted in #98624, can not create new connection, because the acl_check_host() make the very first request to MDL system.
   - for 5.7, acl_check_host() won't request MDL lock, so new connetion can still go in mysql, but what ever the first requset to MDL system will failure.

1. We should at least emit a error msg, in such conndition. Wrong result with no error is really bad...
or 
2. Totally fix this hash pins number limitation, this really limit the ability of mysql, make max_connections = 10K a fake number, 64K is the real max_connections :(

Note, as this affected the correctness of database, I label it with S1.
[11 Mar 2020 15:32] MySQL Verification Team
Hi Mr. Wang,

Thank you for your bug report.

I agree with your findings, although most OSs have a limit of 64 K handles on the single socket.

Verified as reported for 5.7.
[11 Mar 2020 15:33] MySQL Verification Team
This is the same as bug #98624, but it is regarding MySQL 5.7.
[12 Mar 2020 1:34] Fungo Wang
Hi Sinisa,

Thanks for your quick verified.

1. This bug in 5.7 has noting to do with connection handing, so I don't know why you put it under "MySQL Server: Connection Handling" category.

2.  > "although most OSs have a limit of 64 K handles on the single socket."

I guess you misunderstand the socket limitation in modern linux OS.

And probably, the limitation what you said, is at the client side. At client side, every socket is allocated n unique port number to connect mysql-server:3306, and this port number is limited to 64K (16bit), because the limitation of TCP protocol. Ant the real avaliable local port range is decidec by net.ipv4.ip_local_port_range. 

*BUT* this has nothing to do at the mysql-server:3306 side, there is only one port number needed for mysql-server, and the the socket number mysql-server:3306 could handled is not limited to 64K. 

You may check the answers here https://stackoverflow.com/questions/11129212/tcp-can-two-different-sockets-share-a-port.

And actually it's not that hard to establish around 100K connections using 2 client boxes, this is one of my local testing mysqld, and I creating new connections using to client:

`show global status like 'Threads_connected'` per sec

Thu Mar 12 09:24:51 CST 2020 Threads_connected  0
Thu Mar 12 09:24:52 CST 2020 Threads_connected  3186
Thu Mar 12 09:24:53 CST 2020 Threads_connected  9461
Thu Mar 12 09:24:54 CST 2020 Threads_connected  15669
Thu Mar 12 09:24:55 CST 2020 Threads_connected  21950
Thu Mar 12 09:24:56 CST 2020 Threads_connected  28228
Thu Mar 12 09:24:57 CST 2020 Threads_connected  34499
Thu Mar 12 09:24:58 CST 2020 Threads_connected  40776
Thu Mar 12 09:24:59 CST 2020 Threads_connected  47069
Thu Mar 12 09:25:00 CST 2020 Threads_connected  50001
Thu Mar 12 09:25:01 CST 2020 Threads_connected  53298
Thu Mar 12 09:25:02 CST 2020 Threads_connected  58983
Thu Mar 12 09:25:03 CST 2020 Threads_connected  64652
Thu Mar 12 09:25:04 CST 2020 Threads_connected  70298
Thu Mar 12 09:25:05 CST 2020 Threads_connected  75935
Thu Mar 12 09:25:06 CST 2020 Threads_connected  81589
Thu Mar 12 09:25:07 CST 2020 Threads_connected  87286
Thu Mar 12 09:25:08 CST 2020 Threads_connected  92838
Thu Mar 12 09:25:09 CST 2020 Threads_connected  95002

So I may wonder that have you guys really tried to repro this bug, using more than 64K connections?Pls try it.

3. As this can lead to wrong data result, for a database product, this is totally unacceptable. It's much more severe than deny-of-service (crash, hang, flutter etc.) !!!

So I guess S1 is a reasonable bug level for 5.7.
[12 Mar 2020 13:08] MySQL Verification Team
Hi Mr. Wang,

Thank you for your contribution.
[8 Apr 2020 18:04] Paul DuBois
Posted by developer:
 
Fixed in 8.0.21.

Pinbox exhaustion in the metadata locking subsystem could produce a
misleading error message.
[9 Apr 2020 0:56] Fungo Wang
Hi Paul,

This bug is for 5.7, and there is no misleading err msg, but unexpected data result.

Maybe bug #98624, aimed for 8.0 and emit misleading err msg at connect,  is the one should be closed.

Thanks.
[9 Apr 2020 12:37] MySQL Verification Team
Thank you, Paul .....