MySQL Bugs: #118842: Transactions get stuck in stage/sql/waiting for handler commit until mysqld restart (MySQL 8.4.3), heavy concurrent DML

Bug #118842	Transactions get stuck in stage/sql/waiting for handler commit until mysqld restart (MySQL 8.4.3), heavy concurrent DML
Submitted:	14 Aug 8:50	Modified:	22 Sep 3:27
Reporter:	kayukaran Parameswaran	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	8.4.6	OS:	Red Hat (9.4)
Assigned to:		CPU Architecture:	x86
Tags:	8.4.3, binlog group commit, commit stall, innodb, replication, waiting for handler commit

Description:
On MySQL 8.4.3 we observe periodic stalls where many client sessions hang in the stage waiting for handler commit for thousands of seconds, and application connections start aborting (“Got an error reading communication packets”). The condition persists until the MySQL server is restarted, after which the same workload proceeds normally.

Key observations during an incident:

Dozens/hundreds of threads show State = waiting for handler commit on short transactions that do single-row UPDATE/DELETE/INSERT against an InnoDB table radius_auth.re_auth_ctx (partitioned by HASH on the PK). Example processlist snippet (trimmed):

| 31098 | sdpuser | ... | radius_auth | Query | 7651 | waiting for handler commit | update re_auth_ctx set next_re_auth_id=... |
| 31100 | sdpuser | ... | radius_auth | Query | 7651 | waiting for handler commit | update re_auth_ctx set next_re_auth_id=... |
...
Server error log during the stall shows repeated aborted connections from the app while threads are stuck in commit:

[Note] [MY-010914] Aborted connection <id> to db: 'radius_auth' user: 'sdpuser' host: '192.168.1.x' (Got an error reading communication packets).
Replication is configured on this MySQL 8.4 instance (async). SHOW PROCESSLIST shows the source thread idle (“Source has sent all binlog to replica; waiting for more updates”). We do not use Group Replication in this environment.

Workload: very high QPS of short transactions each doing a single PK UPDATE/INSERT/DELETE in re_auth_ctx (RADIUS re-auth context store). Multiple app nodes (sdpuser from 192.168.1.2/.3).

Relevant my.cnf items (8.4):

makefile
Copy
Edit
innodb_flush_log_at_trx_commit=1
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_log_file_size=2G
innodb_buffer_pool_size=8G
max_connections=500
binlog enabled (async replication; standard settings)

How to repeat:
We could reproduce this issue on non-production environment 

Suggested fix:
Unknown root cause. Based on symptoms:

Threads are stuck at the storage engine commit stage (waiting for handler commit) rather than row-lock waits or redo fsync backlog.

Replication appears idle and does not unblock the latch; restart clears internal commit state.

We could not reproduce this issue on non-production environment

Can we get the update on this what is way forward on this ?

Hello, 

Could you please clarify whether you are running this on bare metal or within a virtual machine (VM)? Without the ability to reproduce the issue, our options for assistance are limited. 

Based on the information provided, this appears to be a scaling or configuration issue rather than a bug. In such cases, MySQL Support would be the most appropriate channel for further assistance. 

Thank you for using MySQL.

Hello,

Thank you for your feedback.

This MySQL instance is running inside an OpenStack VM (not bare metal). We do not have any evidence that this issue is related to scaling or a configuration mismatch. The stall occurs only occasionally, and once it happens, all active sessions remain stuck in waiting for handler commit until we restart mysqld. After the restart, it works without any changes to configuration or environment.

This behavior makes us confirms that the server enters a hung state internally rather than it being purely a configuration issue.

Could this be related to the issue described in MySQL Bug #117407?

Thanks ,
Regards,

Hi,

There are some communication issues solved in 8.4.6 so you should upgrade to latest version and test. 

Some of these "stuck" issues are happening only on VM and are hard to reproduce as they are often triggered by bad quality of the network on cloud providers.

Please upgrade to 8.4.6 and test if that will solve the problem.

Hello,

Thank you for your immediate feedback. We experienced MySQL getting stuck while flushing the binlog file to disk. During that time, replication continued and MySQL clients were able to connect to the server, but they could not perform DELETE/UPDATE/INSERT operations. Only SELECT queries worked.

Given this behavior, we are unsure how it could be related to network quality on the cloud provider. We also verified disk I/O at the time and it appeared normal. Unfortunately, we could not determine the exact root cause of the hang.

Thanks,
Regards,

Hi,
As number of these type of issues are visible only on VM and never on bare metal it is only a guess. Without ability to reliably reproduce I can't say anything for sure. 

Number of bugfixes exist that could solve this so please upgrade and let us know if you reproduce it with latest MySQL

Thanks

Version: MySQL Community 8.4.6
Environment: OpenStack VM (same vCPU, RAM, and disk as the prior version), identical application traffic and schema.

Issue:
After some time under steady write load, client sessions accumulate in the stage “waiting for handler commit.” Throughput drops significantly. Restarting mysqld immediately clears the condition, but the issue reappears later under the same workload.

Configuration highlights:

max_connections=500
innodb_buffer_pool_size=8G
innodb_log_file_size=2G        # legacy setting
innodb_redo_log_capacity (not explicitly set, default in use)
innodb_flush_log_at_trx_commit=0
innodb_flush_method=fsync
innodb_flush_neighbors=1
innodb_max_dirty_pages_pct=75
innodb_max_dirty_pages_pct_lwm=0
innodb_autoinc_lock_mode=1

Observations during the stall:

Many sessions remain in “waiting for handler commit.”

Request:

Please confirm if there are any behavioral changes in 8.4.6.

Recommend if additional instrumentation is required for further diagnosis.

Full my.cnf and SHOW GLOBAL VARIABLES output.

Configuration	Value
max_connections	500
key_buffer_size	32M
join_buffer_size	16M
sort_buffer_size	16M
read_buffer_size	8M
read_rnd_buffer_size	16M
max_allowed_packet	16M
thread_cache_size	64
tmp_table_size	128M
max_heap_table_size	512M
bulk_insert_buffer_size	256M
table_definition_cache	10000
table_open_cache	512
secure_file_priv	""
innodb_data_home_dir	/data/mysql-84
innodb_buffer_pool_size	8G
innodb_file_per_table	
innodb_data_file_path	ibdata1:4G;ibdata2:2G:autoextend
innodb_fast_shutdown	1
innodb_log_file_size	2G
innodb_log_buffer_size	32M
innodb_flush_log_at_trx_commit	0
innodb_lock_wait_timeout	50
long_query_time	0.4
innodb_buffer_pool_instances	4
innodb_doublewrite	0
innodb_io_capacity	2000
innodb_buffer_pool_dump_at_shutdown	ON
innodb_buffer_pool_load_at_startup	ON
innodb_sort_buffer_size	4M
explicit_defaults_for_timestamp	1
sql_mode	ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE'
mysql-native-password	ON
character_set_server	latin1
collation_server	latin1_swedish_ci
event_scheduler	OFF
innodb_autoinc_lock_mode	1
innodb_flush_method	fsync
innodb_flush_neighbors	1
innodb_max_dirty_pages_pct	75
innodb_max_dirty_pages_pct_lwm	0
innodb_undo_log_truncate	OFF
log_error_verbosity	3
max_error_count	1024 # Previous default was 64. But this can be increased to 1024
optimizer_trace_max_mem_size	1MB # Previous default was 16KB but can be increased to new default of 1MB
performance_schema_consumer_events_transactions_current	OFF
performance_schema_consumer_events_transactions_history	OFF
plugin-load	validate_password.so
validate-password	FORCE_PLUS_PERMANENT
validate_password_length	8
validate_password_mixed_case_count	1
validate_password_number_count	1
validate_password_special_char_count	1
validate_password_policy	MEDIUM

Hi,

Having issue reproducing such behavior.
What is the IO load before the problem start to express and what is the IO load when the problem is apparent?

We collected the following data while the MySQL server was in the waiting for handler commit state, and the I/O status showed 30% on the VM. I have already included the SHOW PROCESSLIST output and the InnoDB status dump in the comments.

mysql> SELECT NAME, COUNT
    -> FROM information_schema.innodb_metrics
    -> WHERE NAME IN ('log_lsn_current','log_lsn_checkpoint','trx_rseg_history_len');
+----------------------+-------+
| NAME                 | COUNT |
+----------------------+-------+
| trx_rseg_history_len |     1 |
| log_lsn_current      |     0 |
+----------------------+-------+
2 rows in set (0.01 sec)

mysql>
mysql> SELECT trx_id, trx_state, trx_started, trx_rows_locked, trx_is_read_only
    -> FROM information_schema.innodb_trx
    -> ORDER BY trx_started
    -> LIMIT 10;
+-----------+-----------+---------------------+-----------------+------------------+
| trx_id    | trx_state | trx_started         | trx_rows_locked | trx_is_read_only |
+-----------+-----------+---------------------+-----------------+------------------+
| 116163629 | RUNNING   | 2025-09-08 07:41:42 |               1 |                0 |
| 116341709 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341696 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341708 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341707 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341706 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341703 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341702 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341698 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
| 116341722 | RUNNING   | 2025-09-08 07:55:14 |               1 |                0 |
+-----------+-----------+---------------------+-----------------+------------------+
10 rows in set (0.00 sec)

mysql> SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_log_waits | 0     |
+------------------+-------+
1 row in set (0.02 sec)

mysql> SHOW GLOBAL STATUS LIKE 'Binlog_group_commits';
Empty set (0.01 sec)

mysql> SHOW GLOBAL STATUS LIKE 'Binlog_commits';
Empty set (0.00 sec)

Can you please provide an update on the way forward for this issue? If you need any additional details, we can provide them. We are currently in the middle of the production activity, but due to this issue, we are unable to move forward.

Hi,

Do you happen to have binlog_gtid_simple_recovery=OFF ? I've seen temporary hangs with that during binary log rotation, when there are many binary logs.

How many and what's total size of all binary logs in SHOW BINARY LOGS output?

If mysqld ia hanged totally,  please get thread stacks next time before restart:

$ pstack `pidof mysqld`  >  stacks_bug118842.txt

Thanks!
-- 
Shane Bester, MySQL Senior Principal Technical Support Engineer
Oracle Corporation
http://dev.mysql.com/

Hi,

Thank you for the quick response. I have attached the stack trace of the MySQL hung service along with the other details you requested.

data/mysql-84]# ls -lhtr mysql-bin.*
-rw-r-----. 1 mysql mysql  1.1G Aug 31 17:34 mysql-bin.000135
-rw-r-----. 1 mysql mysql  1.1G Sep  1 09:34 mysql-bin.000136
-rw-r-----. 1 mysql mysql  1.1G Sep  1 16:14 mysql-bin.000137
-rw-r-----. 1 mysql mysql  881M Sep  2 02:10 mysql-bin.000138
-rw-r-----. 1 mysql mysql   181 Sep  2 02:22 mysql-bin.000139
-rw-r-----. 1 mysql mysql  3.8K Sep  2 02:29 mysql-bin.000140
-rw-r-----. 1 mysql mysql   181 Sep  2 02:31 mysql-bin.000141
-rw-r-----. 1 mysql mysql  1.1G Sep  2 12:30 mysql-bin.000142
-rw-r-----. 1 mysql mysql  1.1G Sep  2 18:24 mysql-bin.000143
-rw-r-----. 1 mysql mysql  1.1G Sep  3 09:44 mysql-bin.000144
-rw-r-----. 1 mysql mysql  1.1G Sep  3 17:24 mysql-bin.000145
-rw-r-----. 1 mysql mysql  1.1G Sep  4 08:54 mysql-bin.000146
-rw-r-----. 1 mysql mysql  1.1G Sep  4 15:54 mysql-bin.000147
-rw-r-----. 1 mysql mysql  1.1G Sep  5 08:04 mysql-bin.000148
-rw-r-----. 1 mysql mysql  1.1G Sep  5 15:54 mysql-bin.000149
-rw-r-----. 1 mysql mysql  1.1G Sep  6 08:24 mysql-bin.000150
-rw-r-----. 1 mysql mysql  1.1G Sep  6 18:14 mysql-bin.000151
-rw-r-----. 1 mysql mysql  1.1G Sep  7 14:24 mysql-bin.000152
-rw-r-----. 1 mysql mysql 1015M Sep  8 07:55 mysql-bin.000153

mysql> show variables like '%binlog_gtid_simple_recovery%';
+-----------------------------+-------+
| Variable_name               | Value |
+-----------------------------+-------+
| binlog_gtid_simple_recovery | ON    |
+-----------------------------+-------+
1 row in set (0.00 sec)
  
mysql> show binary logs;(This didn't show out put it take time )

unique/collapsed stacks

Attachment: mysql_stack_09092025.txt.collapsed (application/octet-stream, text), 15.57 KiB.

Hi,

I didn’t quite understand the reason. Could you please elaborate? We still haven’t restarted the MySQL server. I can attach the current SHOW PROCESSLIST output, but the application is not receiving any traffic.

Hello!

Can you confirm that all servers have binlog_format=ROW ?
Are all replicated table InnoDB ? 

Reason for asking, is there's a known internal Bug 38084282 where the above two aren't true.

-- 
Shane Bester, MySQL Senior Principal Technical Support Engineer
Oracle Corporation
http://dev.mysql.com/

Hi ,

binlog format is MIXED and all replicated tabled are innodb.

This looks like an active/active setup, in which case internal Bug 38084282 *could* be to blame.  

Suggest binlog_format=ROW everywhere to see if it's solved.  
It also means if some tables don't have a PRIMARY KEY, you should create one.

Can you test it?

Hi ,

If we switch to the ROW binlog format, the following impacts were mentioned based on web references. Can you please confirm if this is acceptable for production traffic?

Higher disk usage for binlogs, higher network bandwidth for replication traffic, and possibly more disk I/O.
Fewer replication conflicts and inconsistencies.
ROW makes conflicts easier to detect and reduces “silent drift,” but does not eliminate logical conflicts entirely.

I have also attached the replicated DB schema. Please note that one database was not replicated under that schema.

Could this be related to the issue described in MySQL Bug #117407?

Thanks,
Regards,

Hi,

The bug is verified and linked to our internal bug that is already being worked on. As for the questions about how RAW format will affect your system I suggest you contact our support team or check out mysql forums for support.

Thank you for using MySQL

If you change to RAW and this does not help please update the report.

Hi,

Thank you for the information. I reviewed the stack trace, and most of the threads appear to be waiting or idle. Please find the details below:

Top Stack Signatures

clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;dispatch_command;dispatch_sql_command;mysql_execute_command;trans_commit_stmt;ha_commit_trans;MYSQL_BIN_LOG::commit;MYSQL_BIN_LOG::ordered_commit;MYSQL_BIN_LOG::change_stage;Commit_stage_manager::enroll_for;inline_mysql_cond_wait;my_cond_wait;native_cond_wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 244 threads

clone3;start_thread;pfs_spawn_thread;handle_connection;Per_thread_connection_handler::block_until_new_connection;inline_mysql_cond_wait;my_cond_wait;native_cond_wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 57 threads

clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;dispatch_command;dispatch_sql_command;mysql_execute_command;trans_commit;ha_commit_trans;MYSQL_BIN_LOG::commit;MYSQL_BIN_LOG::ordered_commit;MYSQL_BIN_LOG::change_stage;Commit_stage_manager::enroll_for;inline_mysql_cond_wait;my_cond_wait;native_cond_wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 16 threads

clone3;start_thread;execute_native_thread_routine;std::thread::_State_impl;std::thread::_Invoker;std::thread::_Invoker;std::__invoke;std::__invoke_impl;Detached_thread::operator;std::invoke;std::__invoke;std::__invoke_impl;io_handler_thread;fil_aio_wait;os_aio_handler;os_aio_linux_handler;LinuxAIOHandler::poll;LinuxAIOHandler::collect;__io_getevents_0_4;syscall — 9 threads

clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;Protocol_classic::get_command;Protocol_classic::read_packet;my_net_read;net_read_uncompressed_packet;net_read_packet;net_read_packet_header;net_read_raw_loop;vio_ssl_read;vio_socket_io_wait;vio_io_wait;ppoll;ppoll — 8 threads

clone3;start_thread;pfs_spawn_thread;handle_connection;do_command;dispatch_command;dispatch_sql_command;mysql_execute_command;lock_tables_open_and_lock_tables;MDL_context::upgrade_shared_lock;MDL_context::acquire_lock;MDL_wait::timed_wait;inline_mysql_cond_timedwait;my_cond_timedwait;native_cond_timedwait;pthread_cond_timedwait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 4 threads

clone3;start_thread;execute_native_thread_routine;Detached_thread::operator;std::invoke;std::__invoke;std::__invoke_impl;buf_flush_page_cleaner_thread;os_event_wait;os_event_wait_low;os_event::wait_low;os_event::wait;pthread_cond_wait@@GLIBC_2.3.2;__futex_abstimed_wait_common — 3 threads

Detected Wait Patterns

pthread_cond_wait — 330 threads

__futex_abstimed_wait_common — 20 threads

poll — 20 threads

read — 6 threads

MYSQL_BIN_LOG::ordered_commit — 3 threads

At that time, the SHOW PROCESSLIST output showed the following threads waiting for commit status:

cat full_processlist_09092025.txt | grep "waiting for handler commit" | wc -l
265

My concern is whether MySQL is unable to allocate more threads to handle ordered_commit. Could this be an issue with queries being handed over to the ordered_commit thread?

From your earlier comments, I understand that you are working on a fix for this. However, do you think using a workaround is a reliable solution, or is it more of a trial-and-error approach? We already upgraded the MySQL version 8.4.3 to 8.4.6 as suggested, but the issue persisted, and we had to revert our application traffic.

Could you please recommend a stable latest MySQL version that addresses this problem?

Hi kayukaran Parameswaran,

I'm lurking as your bug report *may* be similar/related to one I've reported.

I'm a heavy user of MySQL and you ask about using ROW format so a few comments from what I see in a traditional / GR usage.

- ROW format works really well. There was some comment about master-master setups. I've not used that for a very long time and if you use it would suggest you move away.
- I use minimal RBR which reduces binlog sizes considerably only sending for each change the table/pk/columns changed and this is very efficient and works well
- you can use compressed binlogs and this also can save a huge amount of space of binlogs

I'm not aware of any reason not to use minimal RBR (with binlog compression). The config change is trivial and the end effect is a much more performant replication setup.

Clearly you'll need to test in your setup, but it's a good change to make and I think you'll find the change worthwhile.  I use this on all the systems I manage.

Hi Simon Mudd,

Thank you for your valuable feedback. You mentioned that a similar bug had been reported. Could you please confirm if this refers to the MySQL server hanging while in the “waiting for handler commit” state? Additionally, I understand that the suggested workaround was to switch the binlog format to ROW — could you kindly share the bug reference details for further review?

Furthermore, in our environment we have database replication configured, where data is periodically loaded from files (using LOAD DATA INFILE) and replicated from one database node to another. Could you please confirm whether using the ROW binlog format would have any negative impact on this type of replication?

Thank you in advance for your clarification and guidance.

I suspected internal  Bug 38084282 because that is only repeatable when binlog_format!=ROW and also server acts as a master and a slave.
Yes, symptom is also long 'waiting for handler commit'.
So please try ROW binlog_format and see if it still hangs.

regards,
-- 
Shane Bester, MySQL Senior Principal Technical Support Engineer
Oracle Corporation
http://dev.mysql.com/

Hi Shane Bester,

Thank you for your feedback. If we are using MySQL master–master replication and frequently using LOAD DATA INFILE to load data into tables, and those records are replicated to another database node, I assume this will not impact MySQL performance if we change the binlog format to ROW.

Thanks,
Regards,

When are you planning to release MySQL 8.4.7, and will that release contain the above bug fix?

Hi All,

As suggested, we changed the MySQL binlog format from MIX to ROW. We observed that performance lasted for more than one hour compared to the MIX bin log format. This is significant because, with the MIX binlog type, under the same application load, the MySQL server would hang within 10 minutes and show the status “waiting for handler commit.”

I have attached the pstack output and the I/O output from the server.

Could you please provide a solution for this issue?

Hi All, 

Can you please update way forward of this issue ?

Thanks,
Regards,

Hi All, 

Can you please update way forward of this issue ? we are stuck on this.

Thanks,
Regards,