Bug #112302 Concurrent PROCESSLIST, PURGE BINLOG Blocks New Connections
Submitted: 9 Sep 2023 0:38 Modified: 21 Sep 2023 2:31
Reporter: Jervin Real Email Updates:
Status: Can't repeat Impact on me:
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.7.40 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[9 Sep 2023 0:38] Jervin Real
On 5.7.40, we'd experience a deadlock on LOCK_thd_* mutexes causing an instance to completely become unresponsive. Consider the following thread PMP, labeled with lines:

1: 537 __lll_lock_wait,pthread_mutex_lock,native_mutex_lock,my_mutex_lock,inline_mysql_mutex_lock,Global_THD_manager::remove_thd,handle_connection,pfs_spawn_thread,start_thread,clone
2: 428  __lll_lock_wait,pthread_mutex_lock,native_mutex_lock,my_mutex_lock,inline_mysql_mutex_lock,Global_THD_manager::add_thd,handle_connection,pfs_spawn_thread,start_thread,clone
3:   1 __lll_lock_wait,pthread_mutex_lock,native_mutex_lock,my_mutex_lock,inline_mysql_mutex_lock,Log_in_use::operator(),Do_THD::operator(),std::for_each<THD**,,Global_THD_manager::do_for_all_thd,log_in_use,MYSQL_BIN_LOG::purge_logs_before_date,purge_master_logs_before_date,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,clone
4:   1 __lll_lock_wait,pthread_mutex_lock,native_mutex_lock,my_mutex_lock,inline_mysql_mutex_lock,Global_THD_manager::do_for_all_thd_copy,fill_schema_processlist,do_fill_table,get_schema_tables_result,JOIN::prepare_result,JOIN::exec,handle_query,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,clone
5:   1 __lll_lock_wait,pthread_mutex_lock,native_mutex_lock,my_mutex_lock,inline_mysql_mutex_lock,Fill_process_list::operator(),Do_THD::operator(),std::for_each<THD**,,Global_THD_manager::do_for_all_thd_copy,fill_schema_processlist,do_fill_table,get_schema_tables_result,JOIN::prepare_result,JOIN::exec,handle_query,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,clone

- Line 05 is a show processlist, waiting for `LOCK_thd_data` and holds mutex `LOCK_thd_remove`.
- Line 04 is another show processlist, waiting for `LOCK_thd_remove`
- Line 03 is a PURGE BINARY LOGS BEFORE waiting for `LOCK_thd_data` and holds `LOCK_thd_list` and `LOCK_index` mutex.
- Line 02 is a connection trying to deregister from `Global_thd_manager` waiting for a mutex `LOCK_thd_remove`
- Line 01 is a connection trying to register a new thread with Global_thd_manager and waiting for a mutex `Lock_thd_list`

Possible circular dependency is 01 -> 03 -> 04 -> 05 -> 03 . There is no indication that this has been fixed in any of the 5.7 releases judging by the commits. The closest description was a bug fixed recently released on 8.0.34 https://bugs.mysql.com/bug.php?id=110494.

How to repeat:
Not sure yet.
[12 Sep 2023 14:09] MySQL Verification Team

I cannot reproduce this behavior no matter what I try.
[20 Sep 2023 1:31] Jervin Real
We've managed to reproduce this with the following, perhaps this can be simplified. We're using an m5.16xlarge instance.

Create a sysbench workload from one screen session.

/usr/local/bin/sysbench --mysql-host=<MYSQL_HOST> --mysql-port=3306 --mysql-user=username --mysql-password=password \
    --tables=32 --table-size=160000 --mysql-db=test --threads=512 \
    --events=0 --time=5 --report-interval=1 --histogram=on --thread-init-timeout=3600 \
    --db-ps-mode=disable --rate=1000 oltp_write_only prepare

while true ; do 
( /usr/local/bin/sysbench --mysql-host=<MYSQL_HOST> --mysql-port=3306 --mysql-user=username --mysql-password=password \
    --tables=32 --table-size=160000 --mysql-db=test --threads=512 \
    --events=0 --time=5 --report-interval=1 --histogram=on --thread-init-timeout=3600 \
    --db-ps-mode=disable --rate=1000 oltp_write_only run & ) ; sleep $((1 + RANDOM % 8))

On another screen window, run mysqlslap - I managed to used the test case from https://bugs.mysql.com/bug.php?id=89101 successfully.

Once the mysqlslap workload starts, execute one or more `SHOW PROCESSLIST` from another session. This will stall until the mysqlslap connections are established. Theoretically, if you overlap multiple mysqlslap sessions continously, the engine will show as frozen until it is not able to accept new connections.

We have tried the same test case on 5.7.42 and it is not repeatable, perhaps the fix on 5.7.41 below fixed this issue given the mutex on LOCK_thd_data.

Bug #34857411 : regression - slow connections/telnet block many other statements
                for connect_timeout seconds, causing pileups

This is a regression caused due to the fix made for the Bug 34094706. When a
connection somehow stalls/blocks during the authentication phase, where a mutex
is held, the other connections that are executing queries on I_S and P_S are
blocked until the first connection release the mutex.

Instead of using the mutex and checking the thd->active_vio, we now check the
value of net.vio type in the is_secure_transport() check.

Change-Id: Ic301a3a69b2bb828f2323441bdd0232e5a0ecef4
[20 Sep 2023 1:38] Jervin Real
Custom parameters:

| variable_name                | variable_value                                |
| innodb_io_capacity           | 1024                                          |
| innodb_lock_wait_timeout     | 120                                           |
| innodb_log_buffer_size       | 16777216                                      |
| innodb_log_file_size         | 4294967296                                    |
| innodb_print_all_deadlocks   | 1                                             |
| long_query_time              | 3                                             |
| max_allowed_packet           | 104857600                                     |
| max_connections              | 12000                                         |
| max_user_connections         | 6000                                          |
| performance_schema           | 1                                             |
| slave_parallel_workers       | 8                                             |
| slow_query_log               | 1                                             |
| table_open_cache_instances   | 16                                            |
| gtid-mode                    | OFF_PERMISSIVE                                |
| innodb_buffer_pool_instances | 16                                            |
| slave_parallel_type          | LOGICAL_CLOCK                                 |
| slave_preserve_commit_order  | 1                                             |
[20 Sep 2023 1:42] Jervin Real
Just want to mention https://bugs.mysql.com/bug.php?id=73168 since the weight of holding LOCK_thd_data is likely to have an impact here as well.
[20 Sep 2023 11:40] MySQL Verification Team

thanks for the update.

While I'm retrying this - the "show processlist" is the actual thing that blocks?

Have you tried querying I_S directly?
[20 Sep 2023 17:56] Jervin Real
Hey thanks for looking. SHOW PROCESSLIST is one of the queries that blocks, but not only that - this instance is monitored by Percona Monitoring and Management so there are other PROCESSLIST queries including directly to I_S.PROCESSLIST. In some cases, I can throw in PURGE BINARY LOGS and that gets blocked too as shown in the pmps.
[20 Sep 2023 18:31] Jervin Real
This is also repeatable in 8.0.31 but not in 8.0.34.
[20 Sep 2023 19:45] MySQL Verification Team

I managed to reproduce with 5.7 and not with 8.0 nor 8.1 and only with show processlist, I did not manage to reproduce select from i_s to stall. I'm not sure it is the same issue as we did have instances of show processlist stall on "heavy loaded server" and that was declared as not-a-bug by the dev team. I'll verify this but since I cannot reproduce this on 8 not sure if the patch is small enough that it can go to 5.7
[20 Sep 2023 20:37] Jervin Real
Can you confirm if you reproduced this on 5.7.40? How about 5.7.42/43?
[20 Sep 2023 21:02] MySQL Verification Team
I reproduced on 5.7.40, I did not reproduce on any 8.x (8.1.0, 8.0.34).

I am testing it as we speak with 5.7.43, did not reproduce yet, if I do not manage to reproduce with 5.7.43 I will close the bug as solved.
[21 Sep 2023 2:31] MySQL Verification Team

I cannot reproduce on any modern version of MySQL so I'm setting this bug to can't reproduce. I believe this issue is solved in latest versions of MySQL
[11 Jun 7:34] lei yue
I followed the above method and the problem did not reoccur in 8.0.24. The execution order of the three windows is as follows:
1. sysbench run
2. mysqlslap run
3. show processlist
[11 Jun 7:57] lei yue
Sorry, closing the thread pool can reproduce