Bug #118842 Transactions get stuck in stage/sql/waiting for handler commit until mysqld restart (MySQL 8.4.3), heavy concurrent DML
Submitted: 14 Aug 8:50 Modified: 19 Aug 0:05
Reporter: kayukaran Parameswaran Email Updates:
Status: Need Feedback Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:8.4.3 OS:Red Hat (9.4)
Assigned to: MySQL Verification Team CPU Architecture:x86
Tags: 8.4.3, binlog group commit, commit stall, innodb, replication, waiting for handler commit

[14 Aug 8:50] kayukaran Parameswaran
Description:
On MySQL 8.4.3 we observe periodic stalls where many client sessions hang in the stage waiting for handler commit for thousands of seconds, and application connections start aborting (“Got an error reading communication packets”). The condition persists until the MySQL server is restarted, after which the same workload proceeds normally.

Key observations during an incident:

Dozens/hundreds of threads show State = waiting for handler commit on short transactions that do single-row UPDATE/DELETE/INSERT against an InnoDB table radius_auth.re_auth_ctx (partitioned by HASH on the PK). Example processlist snippet (trimmed):

| 31098 | sdpuser | ... | radius_auth | Query | 7651 | waiting for handler commit | update re_auth_ctx set next_re_auth_id=... |
| 31100 | sdpuser | ... | radius_auth | Query | 7651 | waiting for handler commit | update re_auth_ctx set next_re_auth_id=... |
...
Server error log during the stall shows repeated aborted connections from the app while threads are stuck in commit:

[Note] [MY-010914] Aborted connection <id> to db: 'radius_auth' user: 'sdpuser' host: '192.168.1.x' (Got an error reading communication packets).
Replication is configured on this MySQL 8.4 instance (async). SHOW PROCESSLIST shows the source thread idle (“Source has sent all binlog to replica; waiting for more updates”). We do not use Group Replication in this environment.

Workload: very high QPS of short transactions each doing a single PK UPDATE/INSERT/DELETE in re_auth_ctx (RADIUS re-auth context store). Multiple app nodes (sdpuser from 192.168.1.2/.3).

Relevant my.cnf items (8.4):

makefile
Copy
Edit
innodb_flush_log_at_trx_commit=1
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_log_file_size=2G
innodb_buffer_pool_size=8G
max_connections=500
binlog enabled (async replication; standard settings)

How to repeat:
We could reproduce this issue on non-production environment 

Suggested fix:
Unknown root cause. Based on symptoms:

Threads are stuck at the storage engine commit stage (waiting for handler commit) rather than row-lock waits or redo fsync backlog.

Replication appears idle and does not unblock the latch; restart clears internal commit state.
[14 Aug 9:02] kayukaran Parameswaran
We could not reproduce this issue on non-production environment
[15 Aug 3:08] kayukaran Parameswaran
Can we get the update on this what is way forward on this ?
[18 Aug 10:01] MySQL Verification Team
Hello, 

Could you please clarify whether you are running this on bare metal or within a virtual machine (VM)? Without the ability to reproduce the issue, our options for assistance are limited. 

Based on the information provided, this appears to be a scaling or configuration issue rather than a bug. In such cases, MySQL Support would be the most appropriate channel for further assistance. 

Thank you for using MySQL.
[18 Aug 10:20] kayukaran Parameswaran
Hello,

Thank you for your feedback.

This MySQL instance is running inside an OpenStack VM (not bare metal). We do not have any evidence that this issue is related to scaling or a configuration mismatch. The stall occurs only occasionally, and once it happens, all active sessions remain stuck in waiting for handler commit until we restart mysqld. After the restart, it works without any changes to configuration or environment.

This behavior makes us confirms that the server enters a hung state internally rather than it being purely a configuration issue.

Could this be related to the issue described in MySQL Bug #117407?

Thanks ,
Regards,
[18 Aug 10:31] MySQL Verification Team
Hi,

There are some communication issues solved in 8.4.6 so you should upgrade to latest version and test. 

Some of these "stuck" issues are happening only on VM and are hard to reproduce as they are often triggered by bad quality of the network on cloud providers.

Please upgrade to 8.4.6 and test if that will solve the problem.
[18 Aug 11:03] kayukaran Parameswaran
Hello,

Thank you for your immediate feedback. We experienced MySQL getting stuck while flushing the binlog file to disk. During that time, replication continued and MySQL clients were able to connect to the server, but they could not perform DELETE/UPDATE/INSERT operations. Only SELECT queries worked.

Given this behavior, we are unsure how it could be related to network quality on the cloud provider. We also verified disk I/O at the time and it appeared normal. Unfortunately, we could not determine the exact root cause of the hang.

Thanks,
Regards,
[19 Aug 0:05] MySQL Verification Team
Hi,
As number of these type of issues are visible only on VM and never on bare metal it is only a guess. Without ability to reliably reproduce I can't say anything for sure. 

Number of bugfixes exist that could solve this so please upgrade and let us know if you reproduce it with latest MySQL

Thanks