Bug #118842 | Transactions get stuck in stage/sql/waiting for handler commit until mysqld restart (MySQL 8.4.3), heavy concurrent DML | ||
---|---|---|---|
Submitted: | 14 Aug 8:50 | Modified: | 22 Sep 3:27 |
Reporter: | kayukaran Parameswaran | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S1 (Critical) |
Version: | 8.4.6 | OS: | Red Hat (9.4) |
Assigned to: | CPU Architecture: | x86 | |
Tags: | 8.4.3, binlog group commit, commit stall, innodb, replication, waiting for handler commit |
[14 Aug 8:50]
kayukaran Parameswaran
[14 Aug 9:02]
kayukaran Parameswaran
We could not reproduce this issue on non-production environment
[15 Aug 3:08]
kayukaran Parameswaran
Can we get the update on this what is way forward on this ?
[18 Aug 10:01]
MySQL Verification Team
Hello, Could you please clarify whether you are running this on bare metal or within a virtual machine (VM)? Without the ability to reproduce the issue, our options for assistance are limited. Based on the information provided, this appears to be a scaling or configuration issue rather than a bug. In such cases, MySQL Support would be the most appropriate channel for further assistance. Thank you for using MySQL.
[18 Aug 10:20]
kayukaran Parameswaran
Hello, Thank you for your feedback. This MySQL instance is running inside an OpenStack VM (not bare metal). We do not have any evidence that this issue is related to scaling or a configuration mismatch. The stall occurs only occasionally, and once it happens, all active sessions remain stuck in waiting for handler commit until we restart mysqld. After the restart, it works without any changes to configuration or environment. This behavior makes us confirms that the server enters a hung state internally rather than it being purely a configuration issue. Could this be related to the issue described in MySQL Bug #117407? Thanks , Regards,
[18 Aug 10:31]
MySQL Verification Team
Hi, There are some communication issues solved in 8.4.6 so you should upgrade to latest version and test. Some of these "stuck" issues are happening only on VM and are hard to reproduce as they are often triggered by bad quality of the network on cloud providers. Please upgrade to 8.4.6 and test if that will solve the problem.
[18 Aug 11:03]
kayukaran Parameswaran
Hello, Thank you for your immediate feedback. We experienced MySQL getting stuck while flushing the binlog file to disk. During that time, replication continued and MySQL clients were able to connect to the server, but they could not perform DELETE/UPDATE/INSERT operations. Only SELECT queries worked. Given this behavior, we are unsure how it could be related to network quality on the cloud provider. We also verified disk I/O at the time and it appeared normal. Unfortunately, we could not determine the exact root cause of the hang. Thanks, Regards,
[19 Aug 0:05]
MySQL Verification Team
Hi, As number of these type of issues are visible only on VM and never on bare metal it is only a guess. Without ability to reliably reproduce I can't say anything for sure. Number of bugfixes exist that could solve this so please upgrade and let us know if you reproduce it with latest MySQL Thanks
[8 Sep 1:51]
kayukaran Parameswaran
Version: MySQL Community 8.4.6 Environment: OpenStack VM (same vCPU, RAM, and disk as the prior version), identical application traffic and schema. Issue: After some time under steady write load, client sessions accumulate in the stage “waiting for handler commit.” Throughput drops significantly. Restarting mysqld immediately clears the condition, but the issue reappears later under the same workload. Configuration highlights: max_connections=500 innodb_buffer_pool_size=8G innodb_log_file_size=2G # legacy setting innodb_redo_log_capacity (not explicitly set, default in use) innodb_flush_log_at_trx_commit=0 innodb_flush_method=fsync innodb_flush_neighbors=1 innodb_max_dirty_pages_pct=75 innodb_max_dirty_pages_pct_lwm=0 innodb_autoinc_lock_mode=1 Observations during the stall: Many sessions remain in “waiting for handler commit.” Request: Please confirm if there are any behavioral changes in 8.4.6. Recommend if additional instrumentation is required for further diagnosis. Full my.cnf and SHOW GLOBAL VARIABLES output. Configuration Value max_connections 500 key_buffer_size 32M join_buffer_size 16M sort_buffer_size 16M read_buffer_size 8M read_rnd_buffer_size 16M max_allowed_packet 16M thread_cache_size 64 tmp_table_size 128M max_heap_table_size 512M bulk_insert_buffer_size 256M table_definition_cache 10000 table_open_cache 512 secure_file_priv "" innodb_data_home_dir /data/mysql-84 innodb_buffer_pool_size 8G innodb_file_per_table innodb_data_file_path ibdata1:4G;ibdata2:2G:autoextend innodb_fast_shutdown 1 innodb_log_file_size 2G innodb_log_buffer_size 32M innodb_flush_log_at_trx_commit 0 innodb_lock_wait_timeout 50 long_query_time 0.4 innodb_buffer_pool_instances 4 innodb_doublewrite 0 innodb_io_capacity 2000 innodb_buffer_pool_dump_at_shutdown ON innodb_buffer_pool_load_at_startup ON innodb_sort_buffer_size 4M explicit_defaults_for_timestamp 1 sql_mode ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE' mysql-native-password ON character_set_server latin1 collation_server latin1_swedish_ci event_scheduler OFF innodb_autoinc_lock_mode 1 innodb_flush_method fsync innodb_flush_neighbors 1 innodb_max_dirty_pages_pct 75 innodb_max_dirty_pages_pct_lwm 0 innodb_undo_log_truncate OFF log_error_verbosity 3 max_error_count 1024 # Previous default was 64. But this can be increased to 1024 optimizer_trace_max_mem_size 1MB # Previous default was 16KB but can be increased to new default of 1MB performance_schema_consumer_events_transactions_current OFF performance_schema_consumer_events_transactions_history OFF plugin-load validate_password.so validate-password FORCE_PLUS_PERMANENT validate_password_length 8 validate_password_mixed_case_count 1 validate_password_number_count 1 validate_password_special_char_count 1 validate_password_policy MEDIUM
[8 Sep 6:47]
MySQL Verification Team
Hi, Having issue reproducing such behavior. What is the IO load before the problem start to express and what is the IO load when the problem is apparent?
[8 Sep 6:58]
kayukaran Parameswaran
We collected the following data while the MySQL server was in the waiting for handler commit state, and the I/O status showed 30% on the VM. I have already included the SHOW PROCESSLIST output and the InnoDB status dump in the comments. mysql> SELECT NAME, COUNT -> FROM information_schema.innodb_metrics -> WHERE NAME IN ('log_lsn_current','log_lsn_checkpoint','trx_rseg_history_len'); +----------------------+-------+ | NAME | COUNT | +----------------------+-------+ | trx_rseg_history_len | 1 | | log_lsn_current | 0 | +----------------------+-------+ 2 rows in set (0.01 sec) mysql> mysql> SELECT trx_id, trx_state, trx_started, trx_rows_locked, trx_is_read_only -> FROM information_schema.innodb_trx -> ORDER BY trx_started -> LIMIT 10; +-----------+-----------+---------------------+-----------------+------------------+ | trx_id | trx_state | trx_started | trx_rows_locked | trx_is_read_only | +-----------+-----------+---------------------+-----------------+------------------+ | 116163629 | RUNNING | 2025-09-08 07:41:42 | 1 | 0 | | 116341709 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341696 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341708 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341707 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341706 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341703 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341702 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341698 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | | 116341722 | RUNNING | 2025-09-08 07:55:14 | 1 | 0 | +-----------+-----------+---------------------+-----------------+------------------+ 10 rows in set (0.00 sec) mysql> SHOW GLOBAL STATUS LIKE 'Innodb_log_waits'; +------------------+-------+ | Variable_name | Value | +------------------+-------+ | Innodb_log_waits | 0 | +------------------+-------+ 1 row in set (0.02 sec) mysql> SHOW GLOBAL STATUS LIKE 'Binlog_group_commits'; Empty set (0.01 sec) mysql> SHOW GLOBAL STATUS LIKE 'Binlog_commits'; Empty set (0.00 sec)
[9 Sep 3:27]
kayukaran Parameswaran
Can you please provide an update on the way forward for this issue? If you need any additional details, we can provide them. We are currently in the middle of the production activity, but due to this issue, we are unable to move forward.
[9 Sep 6:21]
MySQL Verification Team
Hi, Do you happen to have binlog_gtid_simple_recovery=OFF ? I've seen temporary hangs with that during binary log rotation, when there are many binary logs. How many and what's total size of all binary logs in SHOW BINARY LOGS output? If mysqld ia hanged totally, please get thread stacks next time before restart: $ pstack `pidof mysqld` > stacks_bug118842.txt Thanks! -- Shane Bester, MySQL Senior Principal Technical Support Engineer Oracle Corporation http://dev.mysql.com/
[9 Sep 7:10]
kayukaran Parameswaran
Hi, Thank you for the quick response. I have attached the stack trace of the MySQL hung service along with the other details you requested. data/mysql-84]# ls -lhtr mysql-bin.* -rw-r-----. 1 mysql mysql 1.1G Aug 31 17:34 mysql-bin.000135 -rw-r-----. 1 mysql mysql 1.1G Sep 1 09:34 mysql-bin.000136 -rw-r-----. 1 mysql mysql 1.1G Sep 1 16:14 mysql-bin.000137 -rw-r-----. 1 mysql mysql 881M Sep 2 02:10 mysql-bin.000138 -rw-r-----. 1 mysql mysql 181 Sep 2 02:22 mysql-bin.000139 -rw-r-----. 1 mysql mysql 3.8K Sep 2 02:29 mysql-bin.000140 -rw-r-----. 1 mysql mysql 181 Sep 2 02:31 mysql-bin.000141 -rw-r-----. 1 mysql mysql 1.1G Sep 2 12:30 mysql-bin.000142 -rw-r-----. 1 mysql mysql 1.1G Sep 2 18:24 mysql-bin.000143 -rw-r-----. 1 mysql mysql 1.1G Sep 3 09:44 mysql-bin.000144 -rw-r-----. 1 mysql mysql 1.1G Sep 3 17:24 mysql-bin.000145 -rw-r-----. 1 mysql mysql 1.1G Sep 4 08:54 mysql-bin.000146 -rw-r-----. 1 mysql mysql 1.1G Sep 4 15:54 mysql-bin.000147 -rw-r-----. 1 mysql mysql 1.1G Sep 5 08:04 mysql-bin.000148 -rw-r-----. 1 mysql mysql 1.1G Sep 5 15:54 mysql-bin.000149 -rw-r-----. 1 mysql mysql 1.1G Sep 6 08:24 mysql-bin.000150 -rw-r-----. 1 mysql mysql 1.1G Sep 6 18:14 mysql-bin.000151 -rw-r-----. 1 mysql mysql 1.1G Sep 7 14:24 mysql-bin.000152 -rw-r-----. 1 mysql mysql 1015M Sep 8 07:55 mysql-bin.000153 mysql> show variables like '%binlog_gtid_simple_recovery%'; +-----------------------------+-------+ | Variable_name | Value | +-----------------------------+-------+ | binlog_gtid_simple_recovery | ON | +-----------------------------+-------+ 1 row in set (0.00 sec) mysql> show binary logs;(This didn't show out put it take time )
[9 Sep 7:24]
MySQL Verification Team
unique/collapsed stacks
Attachment: mysql_stack_09092025.txt.collapsed (application/octet-stream, text), 15.57 KiB.
[9 Sep 7:29]
kayukaran Parameswaran
Hi, I didn’t quite understand the reason. Could you please elaborate? We still haven’t restarted the MySQL server. I can attach the current SHOW PROCESSLIST output, but the application is not receiving any traffic.
[9 Sep 7:33]
MySQL Verification Team
Hello! Can you confirm that all servers have binlog_format=ROW ? Are all replicated table InnoDB ? Reason for asking, is there's a known internal Bug 38084282 where the above two aren't true. -- Shane Bester, MySQL Senior Principal Technical Support Engineer Oracle Corporation http://dev.mysql.com/
[9 Sep 7:43]
kayukaran Parameswaran
Hi , binlog format is MIXED and all replicated tabled are innodb.
[9 Sep 7:54]
MySQL Verification Team
This looks like an active/active setup, in which case internal Bug 38084282 *could* be to blame. Suggest binlog_format=ROW everywhere to see if it's solved. It also means if some tables don't have a PRIMARY KEY, you should create one. Can you test it?
[9 Sep 8:13]
kayukaran Parameswaran
Hi , If we switch to the ROW binlog format, the following impacts were mentioned based on web references. Can you please confirm if this is acceptable for production traffic? Higher disk usage for binlogs, higher network bandwidth for replication traffic, and possibly more disk I/O. Fewer replication conflicts and inconsistencies. ROW makes conflicts easier to detect and reduces “silent drift,” but does not eliminate logical conflicts entirely. I have also attached the replicated DB schema. Please note that one database was not replicated under that schema. Could this be related to the issue described in MySQL Bug #117407? Thanks, Regards,
[9 Sep 23:54]
MySQL Verification Team
Hi, The bug is verified and linked to our internal bug that is already being worked on. As for the questions about how RAW format will affect your system I suggest you contact our support team or check out mysql forums for support. Thank you for using MySQL
[9 Sep 23:55]
MySQL Verification Team
If you change to RAW and this does not help please update the report.
[10 Sep 2:45]
kayukaran Parameswaran
Hi, Thank you for the information. I reviewed the stack trace, and most of the threads appear to be waiting or idle. Please find the details below: Top Stack Signatures clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;dispatch_command;dispatch_sql_command;mysql_execute_command;trans_commit_stmt;ha_commit_trans;MYSQL_BIN_LOG::commit;MYSQL_BIN_LOG::ordered_commit;MYSQL_BIN_LOG::change_stage;Commit_stage_manager::enroll_for;inline_mysql_cond_wait;my_cond_wait;native_cond_wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 244 threads clone3;start_thread;pfs_spawn_thread;handle_connection;Per_thread_connection_handler::block_until_new_connection;inline_mysql_cond_wait;my_cond_wait;native_cond_wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 57 threads clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;dispatch_command;dispatch_sql_command;mysql_execute_command;trans_commit;ha_commit_trans;MYSQL_BIN_LOG::commit;MYSQL_BIN_LOG::ordered_commit;MYSQL_BIN_LOG::change_stage;Commit_stage_manager::enroll_for;inline_mysql_cond_wait;my_cond_wait;native_cond_wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 16 threads clone3;start_thread;execute_native_thread_routine;std::thread::_State_impl;std::thread::_Invoker;std::thread::_Invoker;std::__invoke;std::__invoke_impl;Detached_thread::operator;std::invoke;std::__invoke;std::__invoke_impl;io_handler_thread;fil_aio_wait;os_aio_handler;os_aio_linux_handler;LinuxAIOHandler::poll;LinuxAIOHandler::collect;__io_getevents_0_4;syscall — 9 threads clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;Protocol_classic::get_command;Protocol_classic::read_packet;my_net_read;net_read_uncompressed_packet;net_read_packet;net_read_packet_header;net_read_raw_loop;vio_ssl_read;vio_socket_io_wait;vio_io_wait;ppoll;ppoll — 8 threads clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;dispatch_command;dispatch_sql_command;mysql_execute_command;lock_tables_open_and_lock_tables;MDL_context::upgrade_shared_lock;MDL_context::acquire_lock;MDL_wait::timed_wait;inline_mysql_cond_timedwait;my_cond_timedwait;native_cond_timedwait;pthread_cond_timedwait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 4 threads clone3;start_thread;execute_native_thread_routine;Detached_thread::operator;std::invoke;std::__invoke;std::__invoke_impl;buf_flush_page_cleaner_thread;os_event_wait;os_event_wait_low;os_event::wait_low;os_event::wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 3 threads Detected Wait Patterns pthread_cond_wait — 330 threads __futex_abstimed_wait_common — 20 threads poll — 20 threads read — 6 threads MYSQL_BIN_LOG::ordered_commit — 3 threads At that time, the SHOW PROCESSLIST output showed the following threads waiting for commit status: cat full_processlist_09092025.txt | grep "waiting for handler commit" | wc -l 265 My concern is whether MySQL is unable to allocate more threads to handle ordered_commit. Could this be an issue with queries being handed over to the ordered_commit thread? From your earlier comments, I understand that you are working on a fix for this. However, do you think using a workaround is a reliable solution, or is it more of a trial-and-error approach? We already upgraded the MySQL version 8.4.3 to 8.4.6 as suggested, but the issue persisted, and we had to revert our application traffic. Could you please recommend a stable latest MySQL version that addresses this problem?
[10 Sep 7:44]
Simon Mudd
Hi kayukaran Parameswaran, I'm lurking as your bug report *may* be similar/related to one I've reported. I'm a heavy user of MySQL and you ask about using ROW format so a few comments from what I see in a traditional / GR usage. - ROW format works really well. There was some comment about master-master setups. I've not used that for a very long time and if you use it would suggest you move away. - I use minimal RBR which reduces binlog sizes considerably only sending for each change the table/pk/columns changed and this is very efficient and works well - you can use compressed binlogs and this also can save a huge amount of space of binlogs I'm not aware of any reason not to use minimal RBR (with binlog compression). The config change is trivial and the end effect is a much more performant replication setup. Clearly you'll need to test in your setup, but it's a good change to make and I think you'll find the change worthwhile. I use this on all the systems I manage.
[10 Sep 8:20]
kayukaran Parameswaran
Hi Simon Mudd, Thank you for your valuable feedback. You mentioned that a similar bug had been reported. Could you please confirm if this refers to the MySQL server hanging while in the “waiting for handler commit” state? Additionally, I understand that the suggested workaround was to switch the binlog format to ROW — could you kindly share the bug reference details for further review? Furthermore, in our environment we have database replication configured, where data is periodically loaded from files (using LOAD DATA INFILE) and replicated from one database node to another. Could you please confirm whether using the ROW binlog format would have any negative impact on this type of replication? Thank you in advance for your clarification and guidance.
[10 Sep 9:13]
MySQL Verification Team
I suspected internal Bug 38084282 because that is only repeatable when binlog_format!=ROW and also server acts as a master and a slave. Yes, symptom is also long 'waiting for handler commit'. So please try ROW binlog_format and see if it still hangs. regards, -- Shane Bester, MySQL Senior Principal Technical Support Engineer Oracle Corporation http://dev.mysql.com/
[11 Sep 2:28]
kayukaran Parameswaran
Hi Shane Bester, Thank you for your feedback. If we are using MySQL master–master replication and frequently using LOAD DATA INFILE to load data into tables, and those records are replicated to another database node, I assume this will not impact MySQL performance if we change the binlog format to ROW. Thanks, Regards,
[15 Sep 3:48]
kayukaran Parameswaran
When are you planning to release MySQL 8.4.7, and will that release contain the above bug fix?
[17 Sep 8:09]
kayukaran Parameswaran
Hi All, As suggested, we changed the MySQL binlog format from MIX to ROW. We observed that performance lasted for more than one hour compared to the MIX bin log format. This is significant because, with the MIX binlog type, under the same application load, the MySQL server would hang within 10 minutes and show the status “waiting for handler commit.” I have attached the pstack output and the I/O output from the server. Could you please provide a solution for this issue?
[18 Sep 5:40]
kayukaran Parameswaran
Hi All, Can you please update way forward of this issue ? Thanks, Regards,
[22 Sep 3:27]
kayukaran Parameswaran
Hi All, Can you please update way forward of this issue ? we are stuck on this. Thanks, Regards,