MySQL Bugs: #109320: mysql freeze on SHOW GLOBAL STATUS

Bug #109320	mysql freeze on SHOW GLOBAL STATUS
Submitted:	8 Dec 2022 13:19	Modified:	8 Dec 2022 14:39
Reporter:	Ruxandra Herastrau	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	8.0.31	OS:	CentOS (CentOS Linux release 7.9.2009 (Core))
Assigned to:		CPU Architecture:	x86 (Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz)
Tags:	freeze, lock, SHOW GLOBAL STATUS, too many connections

Description:
MySQL stops accepting new connections, any new connection fails with "Too many connections" error. It takes very long time to restart (or needs to be killed with -9).

This happened multiple times on different physical servers not related to each other.

I believe some global lock is not released. Please review the following details.

syslog after "systemctl stop mysqld" (it usually takes a few seconds)
Dec  8 09:26:34 s51 systemd: Stopping MySQL Server...
Dec  8 09:39:18 s51 systemd: Stopped MySQL Server.

MySQL Slow log (way too long for "SHOW GLOBAL STATUS;") - only record in that period of time. it finished just a few seconds before stopping so this is the SQL it was waiting for.

# Time: 2022-12-08T09:39:06.918627Z
# User@Host: zabbix[zabbix] @ localhost []  Id:    23
# Query_time: 6802.891509  Lock_time: 0.000001 Rows_sent: 491  Rows_examined: 491
SET timestamp=1670485544;
SHOW GLOBAL STATUS;
/usr/sbin/mysqld, Version: 8.0.31 (MySQL Community Server - GPL). started with:

MySQL error log:

2022-12-06T16:24:15.167521Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /home/mysql/mysqlx.sock
2022-12-08T09:26:34.609062Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.31).
2022-12-08T09:39:08.935184Z 0 [Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close of thread 503369  user: 'user1'.
2022-12-08T09:39:08.935467Z 0 [Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close of thread 505827  user: 'user2'.
2022-12-08T09:39:08.935558Z 0 [Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close of thread 505828  user: 'user2'.
2022-12-08T09:39:08.935707Z 0 [Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close of thread 503364  user: 'user1'.
2022-12-08T09:39:17.852834Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.31)  MySQL Community Server - GPL.

I've tried the following:

- stopped all programs that can generate new connections (nginx, php, crond, zabbix agent)
- killed all processes already running
- checked with netstat (only few connections remaining from remote host: <10 of 600 max connections)

MySQL will not allow new connection. whatever I do.
CPU usage from mysqld is not zero but not big also: around 30% of one core. 

Server has plenty of free RAM and CPU.
strace-ing the process doesn't reveal any relevant system calls that could cause the freeze (after calling systemctl stop mysqld)

Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
CentOS Linux release 7.9.2009 (Core)
mysql-community-server-8.0.31-1.el7.x86_64 (official repository)
Linux [hostname] 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

How to repeat:
It happens randomly, sometimes after 10 minutes, other times after one week.

Suggested fix:
It's better to stop the "show global status" with an error than to lock the whole server.

Hi Mrs. Herastrau,

Thank you for your bug report.

However, this looks like a regular issue of MySQL Server misconfiguration.

You should re-configure the number of active connections, both for MySQL and InnoDB. You should also minimise the amount of the sleeping time, so that sleeping connections should be closed automatically.

We do not see how this could be our bug, unless you send us a fully repeatable test case. We do not see a proper test case in this report. If we get a test case and if we can repeat it on the properly configured MySQL server, we shall change the status of this bug report.

Can't repeat.

I have the same issue sometimes on other multiple hardware and different configurations.

In my research, I came to the conclusion that this is a race condition effect in the thread instrumentation code added in release 8.0.27, as a result of which the LOCK_status mutex is not released

The last time this happened was in the code:
Thread 120 (Thread 0x7f284465d700 (LWP 1080893)):
#0  0x00007f28c96eae2b in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f28c96e39d5 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x0000559e3c6b5329 in native_mutex_lock (mutex=<optimized out>) at /mnt/build/test/mysql-server-8.0.29/include/thr_mutex.h:94
#3  my_mutex_lock (mp=<optimized out>) at /mnt/build/test/mysql-server-8.0.29/include/thr_mutex.h:190
#4  inline_mysql_mutex_lock (that=<optimized out>, src_file=0x559e3d257260 "/mnt/build/test/mysql-server-8.0.29/storage/perfschema/table_global_status.cc", src_line=103) at /mnt/build/test/mysql-server-8.0.29/include/mysql/psi/mysql_mutex.h:274
#5  table_global_status::get_row_count () at /mnt/build/test/mysql-server-8.0.29/storage/perfschema/table_global_status.cc:103
#6  0x0000559e3c659adc in ha_perfschema::info (this=0x7efeade4e030, flag=18) at /mnt/build/test/mysql-server-8.0.29/storage/perfschema/ha_perfschema.cc:1716
#7  0x0000559e3b1608ab in JOIN::init_planner_arrays (this=this@entry=0x7efeaddf1ca0) at /mnt/build/test/mysql-server-8.0.29/sql/sql_optimizer.cc:5357
#8  0x0000559e3b16ec4f in JOIN::make_join_plan (this=0x7efeaddf1ca0) at /mnt/build/test/mysql-server-8.0.29/sql/sql_optimizer.cc:5184
#9  0x0000559e3b17157c in JOIN::optimize (this=0x7efeaddf1ca0, finalize_access_paths=finalize_access_paths@entry=true) at /mnt/build/test/mysql-server-8.0.29/sql/sql_optimizer.cc:642
#10 0x0000559e3b1d6ace in Query_block::optimize (this=this@entry=0x7efeade73be8, thd=thd@entry=0x7efeadc1e000, finalize_access_paths=finalize_access_paths@entry=true) at /mnt/build/test/mysql-server-8.0.29/sql/sql_select.cc:1817
#11 0x0000559e3b2570ce in Query_expression::optimize (this=this@entry=0x7efeade73440, thd=thd@entry=0x7efeadc1e000, materialize_destination=materialize_destination@entry=0x7efd3a625498, create_iterators=create_iterators@entry=false, finalize_access_paths=finalize_access_paths@entry=true) at /mnt/build/test/mysql-server-8.0.29/sql/sql_union.cc:714
#12 0x0000559e3b1299c3 in TABLE_LIST::optimize_derived (this=0x7efeade75098, thd=0x7efeadc1e000) at /mnt/build/test/mysql-server-8.0.29/sql/sql_derived.cc:1515
#13 0x0000559e3b170945 in JOIN::optimize (this=0x7efeaddf1838, finalize_access_paths=finalize_access_paths@entry=true) at /mnt/build/test/mysql-server-8.0.29/sql/sql_optimizer.cc:348
#14 0x0000559e3b1d6ace in Query_block::optimize (this=this@entry=0x7efe25a88838, thd=thd@entry=0x7efeadc1e000, finalize_access_paths=finalize_access_paths@entry=true) at /mnt/build/test/mysql-server-8.0.29/sql/sql_select.cc:1817
#15 0x0000559e3b2570ce in Query_expression::optimize (this=this@entry=0x7efe25a88090, thd=thd@entry=0x7efeadc1e000, materialize_destination=materialize_destination@entry=0x0, create_iterators=create_iterators@entry=true, finalize_access_paths=finalize_access_paths@entry=true) at /mnt/build/test/mysql-server-8.0.29/sql/sql_union.cc:714
#16 0x0000559e3b1d58d0 in Sql_cmd_dml::execute_inner (this=0x7efe25a88bd8, thd=0x7efeadc1e000) at /mnt/build/test/mysql-server-8.0.29/sql/sql_select.cc:769
#17 0x0000559e3b1d4fee in Sql_cmd_dml::execute (this=0x7efe25a88bd8, thd=thd@entry=0x7efeadc1e000) at /mnt/build/test/mysql-server-8.0.29/sql/sql_select.cc:584
#18 0x0000559e3b1e40f0 in Sql_cmd_show::execute (this=<optimized out>, thd=thd@entry=0x7efeadc1e000) at /mnt/build/test/mysql-server-8.0.29/sql/sql_show.cc:207
#19 0x0000559e3b1e5dbc in Sql_cmd_show_status::execute (this=<optimized out>, thd=0x7efeadc1e000) at /mnt/build/test/mysql-server-8.0.29/sql/sql_show.cc:771
#20 0x0000559e3b17aba3 in mysql_execute_command (thd=0x7efeadc1e000, first_level=<optimized out>) at /mnt/build/test/mysql-server-8.0.29/sql/sql_parse.cc:4860
#21 0x0000559e3b17edfc in dispatch_sql_command (thd=thd@entry=0x7efeadc1e000, parser_state=parser_state@entry=0x7f284465c540, update_userstat=update_userstat@entry=false) at /mnt/build/test/mysql-server-8.0.29/sql/sql_parse.cc:5499
#22 0x0000559e3b1802de in dispatch_command (thd=0x7efeadc1e000, com_data=<optimized out>, command=COM_QUERY) at /mnt/build/test/mysql-server-8.0.29/sql/sql_parse.cc:2050
#23 0x0000559e3b18230a in do_command (thd=thd@entry=0x7efeadc1e000) at /mnt/build/test/mysql-server-8.0.29/sql/sql_parse.cc:1414
#24 0x0000559e3b2ca860 in handle_connection (arg=arg@entry=0x7efeb51037e0) at /mnt/build/test/mysql-server-8.0.29/sql/conn_handler/connection_handler_per_thread.cc:308
#25 0x0000559e3c663205 in pfs_spawn_thread (arg=0x7efeb510a360) at /mnt/build/test/mysql-server-8.0.29/storage/perfschema/pfs.cc:2943
#26 0x00007f28c96e0ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007f28c8e16a2f in clone () from /lib/x86_64-linux-gnu/libc.so.6 

frame 5
102 ha_rows table_global_status::get_row_count(void) {
103 mysql_mutex_lock(&LOCK_status);
104 ha_rows status_var_count = all_status_vars.size();
105 mysql_mutex_unlock(&LOCK_status);
106 return status_var_count;
107 } 

If we look at the com_data structure in the 23rd frame, we will see: 
p com_data
$1 = {
  com_init_db = {
    db_name = 0x7efeadc81021 "show global status like 'rpl_semi_sync_%'",
    length = 41
  },
  com_refresh = {
    options = 33 '!'
  },
  com_kill = {
    id = 139632302362657
  },
  com_set_option = {
    opt_command = 2915569697
  },
  com_stmt_execute = {
    stmt_id = 139632302362657,
    open_cursor = 41,
    parameters = 0x0,
    parameter_count = 0,
    has_new_types = 224 '\340'
  },
  com_stmt_fetch = {
    stmt_id = 139632302362657,
    num_rows = 41
  },
  com_stmt_send_long_data = {
    stmt_id = 139632302362657,
    param_number = 41,
    longdata = 0x0,
    length = 0
  },
  com_stmt_prepare = {
    query = 0x7efeadc81021 "show global status like 'rpl_semi_sync_%'",
    length = 41
  },
  com_stmt_close = {
    stmt_id = 2915569697
  },
  com_stmt_reset = {
    stmt_id = 2915569697
  },
  com_query = {
    query = 0x7efeadc81021 "show global status like 'rpl_semi_sync_%'",
    length = 41,
    parameters = 0x0,
    parameter_count = 0
  },
  com_field_list = {
    table_name = 0x7efeadc81021 "show global status like 'rpl_semi_sync_%'",
    table_name_length = 41,
    query = 0x0,
    query_length = 0
  }
} 

The next 121st thread will contain the request:
$2 = {
  com_init_db = {
    db_name = 0x7efeac081021 "show global status like 'Uptime'",
    length = 32
  },
  com_refresh = {
    options = 33 '!'
  },
  com_kill = {
    id = 139632273002529
  },
  com_set_option = {
    opt_command = 2886209569
  },
  com_stmt_execute = {
    stmt_id = 139632273002529,
    open_cursor = 32,
    parameters = 0x0,
    parameter_count = 0,
    has_new_types = 0 '\000'
  },
  com_stmt_fetch = {
    stmt_id = 139632273002529,
    num_rows = 32
  },
  com_stmt_send_long_data = {
    stmt_id = 139632273002529,
    param_number = 32,
    longdata = 0x0,
    length = 0
  },
  com_stmt_prepare = {
    query = 0x7efeac081021 "show global status like 'Uptime'",
    length = 32
  },
  com_stmt_close = {
    stmt_id = 2886209569
  },
  com_stmt_reset = {
    stmt_id = 2886209569
  },

Unfortunately, it has not yet been possible to create a simple script to reproduce the error. Do you have any suggestions for debugging?

PS Ruxandra Herastrau, you may try these two workarounds:
1. disable thread instrumentation in PS (update performance_schema.setup_consumers set ENABLED='NO' where name='thread_instrumentation';)
2. total disable PS in MySQL config.

and also can you provide more details (config, replicas, monitoring, proxysql) about your instance?

Hi,

This code:

mysql_mutex_lock(&LOCK_status);
ha_rows status_var_count = all_status_vars.size();
mysql_mutex_unlock(&LOCK_status);

shows cleaerly that mutex is held for a very short time, hence that part of code is not a problem..

This part of the code is not the problem, but what if the lock is held by some other code and here it just cannot be granted ?

In my particular case I have zabbix for monitoring and it will timeout and make a new connection and issue the "show global status again". After some time it uses all the connections (up to max_connections), all are waiting for this lock to be granted and then the server is frozen.

I see in some locations that the lock is granted like this:

  if (m_current_thd->fill_status_recursion_level++ == 0) {
    mysql_mutex_lock(&LOCK_status);
  }
  
[some work here]
 
  if (m_current_thd->fill_status_recursion_level-- == 1) {
    mysql_mutex_unlock(&LOCK_status);
  }

But there are 3 locations where m_current_thd->fill_status_recursion_level--; is called without calling mysql_mutex_unlock if the value of fill_status_recursion_level was 1:

pfs_variable.cc:1259
pfs_variable.cc:1306
pfs_variable.cc:1347

All 3 calls are added recently

>This part of the code is not the problem, but what if the lock is held by some other code and here it just cannot be granted ?
That's right! I gave the code for an example to understand what the problem is with the lack of removing the LOCK_status mutex;

Hi,

These ones are immaterial, since they are only changed in different object.

I've stopped Zabbix monitoring, so I can catch the server after the problem is triggered and before all connections are used. I've tried various investigations and here is what I can find:

 - this can happen on master and/or replica. This investigation is done on replica, I can't keep master in this status long enough for investigation.
 - If I run "show processlist" / "show full processlist" / "select * form performance_schema.processlist" that connection will hang (forever).
 - "stop replica" also hangs. (you can see "Waiting for replica mutex on exit" in threads list below)
 - I've tried killing the oldest thread (from performance_schema.threads - thread/sql/one_connection) and this also hangs (thread_id 2164016 in the below table, with processlist_id 2163878)
 - This also happens on "Red Hat Enterprise Linux release 9.1 (Plow)" (first reported from Centos 7)
 
What other information would be useful for confirming and investigating this issue?

(split due to: Your comment is too long)

select * from performance_schema.threads where PROCESSLIST_STATE is not null or PROCESSLIST_ID=2163878 order by thread_id;

Attachment: 1.threads.txt (text/plain), 8.03 KiB.

select * from performance_schema.metadata_locks;

Attachment: 2.metadata_locks.txt (text/plain), 1.25 KiB.

select * from performance_schema.replication_applier_status_by_worker;

Attachment: 3.replication_applier_status_by_worker.txt (text/plain), 7.53 KiB.

Hi,

We have run SHOW GLOBAL STATUS many times and we did not experience any problems.

Hence, we are still waiting on the proper and fully repeatable test case.

The "show global status" is not the cause for this freeze. I'm not sure exactly what triggers this. show global status is one of the queries frozen after the problem occurs. As i've stated in the previous comment "show processlist" is also affected. 

I've added as many details as i could so someone who knows exactly how the lock system works can figure out where to start the investigation.

 If there are any other details please let me know and I will add them as soon as it's happening again.

Thank you.

Sorry, 

But, we can only process those reports with fully repeatable test cases ......

Cristi, we have been experiencing the same problem on dozens of servers since we upgraded to 8.0.31, and also on 5.7.40.

I can't contribute much more than what you comment, the origin is the same, for what it's worth, I was going through all the bug reports and I see one that seems to be related, with status "verified", even contributing with a patch:

https://bugs.mysql.com/bug.php?id=109322

Maybe the MySQL team can tell us something about it, and if they can be the same problem.

Thanks

Hi,

That bug was verified based on the code analysis, but further analysis has shown that patch is not improving things ...

Hence, we still need a repeatable test case.

Also, with thousands of open connections, you must expect a certain timeout (in minutes) with global status checks. Hence , the workaround is to limit number of connections or to use some proxy intermediary .... There is a number of those on the market, but we can not recommend any , since we do not use them nor do we need them .......

Hi, Team.

We use a proxy it's doesn't help with this bug. After upgrading to 8.0.30 now we have random crashes on random tables and queries.

for example:
#1
stack_bottom = 7f9d183174e0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x55a33852bfee]
/usr/sbin/mysqld(print_fatal_signal(int)+0x2fb) [0x55a33768002b]
/usr/sbin/mysqld(handle_fatal_signal+0xc5) [0x55a3376800f5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7fc8d27bd140]
/usr/sbin/mysqld(PT_item_list::contextualize(Parse_context*)+0x6b) [0x55a337586c9b]
/usr/sbin/mysqld(PT_update::make_cmd(THD*)+0xc9) [0x55a3373f4f09]
/usr/sbin/mysqld(LEX::make_sql_cmd(Parse_tree_root*)+0x27) [0x55a3374e4287]
/usr/sbin/mysqld(THD::sql_parser()+0x4b) [0x55a33748e0ab]
/usr/sbin/mysqld(parse_sql(THD*, Parser_state*, Object_creation_ctx*)+0x118) [0x55a337508038]
/usr/sbin/mysqld(dispatch_sql_command(THD*, Parser_state*, bool)+0x2c5) [0x55a33750e255]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xf3e) [0x55a33750f90e]
/usr/sbin/mysqld(do_command(THD*)+0x252) [0x55a337511a92]
/usr/sbin/mysqld(+0x126f620) [0x55a337671620]
/usr/sbin/mysqld(+0x2637415) [0x55a338a39415]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7fc8d27b1ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc8d1c16a2f]

#2
stack_bottom = 7f398f2434e0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x55e9400bdfee]
/usr/sbin/mysqld(print_fatal_signal(int)+0x2fb) [0x55e93f21202b]
/usr/sbin/mysqld(handle_fatal_signal+0xc5) [0x55e93f2120f5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f6457a8d140]
/usr/sbin/mysqld(JOIN::cleanup_item_list(mem_root_deque<Item*> const&) const+0x35) [0x55e93f0f80d5]
/usr/sbin/mysqld(JOIN::destroy()+0x20c) [0x55e93f0fe9bc]
/usr/sbin/mysqld(Query_block::cleanup(THD*, bool)+0x75) [0x55e93f179355]
/usr/sbin/mysqld(Query_expression::cleanup(THD*, bool)+0x8e) [0x55e93f17952e]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0xa9a) [0x55e93f09c49a]
/usr/sbin/mysqld(dispatch_sql_command(THD*, Parser_state*, bool)+0x4cc) [0x55e93f0a045c]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xf3e) [0x55e93f0a190e]
/usr/sbin/mysqld(do_command(THD*)+0x252) [0x55e93f0a3a92]
/usr/sbin/mysqld(+0x126f620) [0x55e93f203620]
/usr/sbin/mysqld(+0x2637415) [0x55e9405cb415]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f6457a81ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f6456e16a2f]

Hi Mr. Irgiznov,

Thank you for your comment, but it does not help at all.

Especially in your case. Those two stacktraces are from vastly different commands, which means that you might have a hardware problem.

This is strange, since you are using a CPU which can support ECC checking and correcting RAM modules. If you do have those RAM modules, then you have some transient glitch in the disk or OS. If you don't, then you might have a transient glitch in RAM. Those transient errors are impossible to diagnose with existing software.

Hi Team.

There are no hardware problems (checked primarily by replacement), this happens on different hardware(not virtual machines) and on different DB.

Hi Mr. Irgiznov,

Thank you for your comments.

We believe you that you have checked running of MySQL on various hardware, so we accept that it is not hardware problem.

However, we can not proceed in processing further this report until we have a fully repeatable test case. As we have written, it should consist from a set of SQL statements which always lead to the stacktraces that all contributors here are reporting.

All that we can conclude is that both stacktraces are caused by an UPDATE which uses expressions.

We are waiting on the  detailed and fully repeatable  test case.