MySQL Bugs: #116813: The transaction commit takes a long time even if it doesn't affect any rows

Bug #116813	The transaction commit takes a long time even if it doesn't affect any rows
Submitted:	28 Nov 2024 2:57	Modified:	11 Jul 2:28
Reporter:	Ruyi Zhang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.40, 8.0.42	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
We recently found that if a transaction scans a large number of rows under the Read Committed (RC) transaction isolation level, even if the transaction itself doesn't make any modifications, it will still take a significant amount of time when the transaction is committed. 

This causes the commits of other transactions to be blocked, and the Transactions Per Second (TPS) of MySQL drops to zero in a short period of time.

How to repeat:
innodb_buffer_pool_size=42949672960 
transaction_isolation=READ-COMMITTED 
long_query_time=0

mysql> show create table a;
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                       |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a     | CREATE TABLE `a` (
  `id` int NOT NULL AUTO_INCREMENT,
  `val` int DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1471064131 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select count(1) from a;
+------------+
| count(1)   |
+------------+
| 1073741824 |
+------------+
1 row in set (21.49 sec)

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> delete from a where val=-1;
Query OK, 0 rows affected (21 min 51.39 sec)

mysql> commit;
Query OK, 0 rows affected (0.34 sec)

In the slow log:
# Time: 2024-11-28T02:10:51.823738Z
# User@Host: root[root] @ localhost []  Id: 289344
# Query_time: 0.000137  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1732759851;
begin;
# Time: 2024-11-28T02:32:48.404241Z
# User@Host: root[root] @ localhost []  Id: 289344
# Query_time: 1311.384721  Lock_time: 0.000004 Rows_sent: 0  Rows_examined: 1073741824
SET timestamp=1732759857;
delete from a where val=-1;
# Time: 2024-11-28T02:32:53.921264Z
# User@Host: root[root] @ localhost []  Id: 289344
# Query_time: 0.348172  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
SET timestamp=1732761173;
commit;

Even though the DELETE statement didn't make any modifications, the transaction commit took 348 milliseconds. 

This duration is positively correlated with the number of scanned rows. 

In the scenario we encountered, the transaction commit took more than 40 seconds.

Suggested fix:
Under the Read Committed (RC) isolation level, MySQL will only lock the records that meet the conditions. 

However, it seems that MySQL creates a large number of row lock structures during the scanning process.

Hello Ruyi Zhang!
 
Indeed, InnoDB avoids deallocating lock_t structs allocated for a transaction until it commits. Instead, in case of LOCK_REC locks, to release the lock request for a given record, it merely clears the bit corresponding to the heap_no of the record from the bitmap attached to the lock. The reason lock structs are kept around is to avoid various problems with dangling pointers, ABA, and data-races. In particular there are places in code which obtain a pointer to the lock_t struct while holding required mutexes, then release the mutexes, and assume the pointer is still valid (for example: we still can take a look at its type and mode fields), as long as the transaction is not committed.

In the scenario that you have provided a lock is initially acquired by InnoDB on a row to gain the right to look at its value, then, once the Server layer determines that `val=-1` condition doesn't hold, the Server asks InnoDB to release the lock most recently acquired by this transaction, which is done by clearing the bit only.

One might ponder: 
Could InnoDB, once the bitmap is zeroed, keep the lock allocated (leave the memory allocated from trx->lock.lock_heap intact), but remove them from trx->lock.trx_locks, so that they don't have to be iterated over in try_release_all_locks()?
Effectively it means the lock struct would be "leaked", and thus one should ask: 
Can it lead to running out of memory? 
The good news is that the whole trx->lock.lock_heap is freed at once in lock_trx_release_locks(trx), when the transaction commits, so no permanent leak is possible. 
However, the very bad news is: you'd run out of memory *during* the transaction. 

In particular, let's ponder the reason behind having a bitmap associated with the lock_t struct in the first place: during a scan it's typical to request same type of a lock (say, LOCK_REC|LOCK_REC_NOT_GAP|LOCK_X) on many rows within same page. Instead of allocating new struct for each of them, we allocate one lock_t struct instance per "type+page_no" combination, attaching a bitmap to it, where we merely mark which records from that page the struct is concerned about (each record on a page has a small heap_no identifier which we use as offset in the bitmap).
So, in scenario like yours, where there are thousands of records per page, we allocate one struct, and one bitmap, then set/reset various bits in it. You can see it by placing a breakpoint in line
```
lock_rec_release(lock, heap_no);
``` 
of `lock_rec_unlock(..)` called from `row_prebuilt_t::try_unlock(..)` called from `ha_innobase::unlock_row(..)` during DELETE FROM a WHERE val=-1.
You'll see the same `lock` address over and over, but heap_no changing thousands of times, until you get to next leaf page.

If we instead "leaked" the lock whenever its bitmap is zeroed, we would have to allocate a new lock for each visited *row*, not page.
This would increase memory consumption - you seem to have 1G of rows, and lock_t struct is something around 96 bytes, so you'd need 96GB for lock structs.
This memory would be stolen by trx->lock.lock_heap from Buffer Pool, probably leading to eviction of real useful data pages causing increased I/O, or outright abort the transaction (or maybe even crash or stall) if your buffer pool is too small.
You seem to have 42GB Buffer Pool, so the DELETE wouldn't succeed.

So, I don't see an easy way to change the way these structs are allocated.

But, the part of the report which most concerns me, is the claim that 
> MySQL drops to zero in a short period of time
We've did all we could to make `try_release_all_locks` not only fast, but also not blocking other threads.
In particular you will see that not only it tries to latch just one of the 512 shards, it also monitors if some other thread is waiting for an x-latch on the whole system and gives it a priority.
Thus, if possible, I'd like to know more about the stall.
What is the thread doing at that moment?
What are the other threads waiting for?

Hello, Jakub Lopuszanski,

Thank you for your detailed explanation. Indeed, MySQL's Transactions Per Second (TPS) dropped to zero for a while in the scenario we encountered. I think this might be because:

1. Our transaction scanned two billion rows of data but only deleted eight thousand rows. Since the average row length is more than 1 kilobyte, the transaction thus created at least 200 million lock structures. And the time spent on cleaning up these lock structures when the transaction was committed was over 40 seconds.

2. Since we enabled the binary log (binlog), when this transaction was committed, it joined(or led) the commit group (binlog group commit). The large number of lock structures made it impossible for other transactions in the group to complete the commit for a long time as well.

I've tried to construct this scenario, and you can reproduce the scenario by following these steps:

Start the Sysbench load

/usr/bin/sysbench oltp_write_only --mysql-socket=/tmp/mysql.sock --mysql-user=root --mysql-db=test --tables=1 --table-size=1000 --report-interval=1 --percentile=99 --max-time=3000000 --threads=100 --rand-type=uniform prepare/run

Execute the special transaction

mysql> show create table a;
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a | CREATE TABLE a (
id bigint NOT NULL AUTO_INCREMENT,
val varbinary(3000) DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=67828711 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
insert into a values (null, repeat('a',3000)); // loop
mysql> select max(id) from a;
+----------+
| max(id) |
+----------+
| 67763687 |
+----------+
1 row in set (0.00 sec)
mysql> update a set val='b' where id=67763687;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> delete from a where val='b';
Query OK, 1 row affected (6 min 8.69 sec)
mysql> commit;
Query OK, 0 rows affected (3.18 sec)

It can be observed that during the execution of the COMMIT, the TPS of the Sysbench load dropped to zero.

[ 456s ] thds: 100 tps: 35926.94 qps: 215490.63 (r/w/o: 0.00/143644.76/71845.88) lat (ms,99%): 12.52 err/s: 2.00 reconn/s: 0.00
[ 457s ] thds: 100 tps: 34449.96 qps: 206749.78 (r/w/o: 0.00/137834.85/68914.93) lat (ms,99%): 13.70 err/s: 7.00 reconn/s: 0.00
[ 458s ] thds: 100 tps: 19493.31 qps: 116943.86 (r/w/o: 0.00/77951.24/38992.62) lat (ms,99%): 9.39 err/s: 4.00 reconn/s: 0.00
[ 459s ] thds: 100 tps: 6.00 qps: 152.00 (r/w/o: 0.00/139.00/13.00) lat (ms,99%): 1376.60 err/s: 1.00 reconn/s: 0.00
[ 460s ] thds: 100 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,99%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 461s ] thds: 100 tps: 6845.85 qps: 41017.12 (r/w/o: 0.00/27327.42/13689.71) lat (ms,99%): 3267.19 err/s: 1.00 reconn/s: 0.00
[ 462s ] thds: 100 tps: 36843.18 qps: 221055.08 (r/w/o: 0.00/147363.72/73691.36) lat (ms,99%): 11.87 err/s: 3.00 reconn/s: 0.00
[ 463s ] thds: 100 tps: 33417.34 qps: 200541.05 (r/w/o: 0.00/133699.36/66841.68) lat (ms,99%): 10.84 err/s: 7.00 reconn/s: 0.00

Also, you mentioned that directly removing the lock structure from trx_locks during the try_unlock phase might lead to a dangling pointer problem. 

I analyzed the references to lock_sys->rec_hash and trx_locks, and it seems that they are both accessed under the protection of the relevant lock, and the rec - lock is not further cached after the lock is released. 

Does this mean that it's possible to move the lock structures that will no longer be valid from trx_locks into a cache during try_unlock and reuse them during alloc_lock?

Hello Zhang Ruyi,

Thank you for the report and test case.
My sincere apologies, I missed your last update. 

regards,
Umesh

8.0.42 test results

Attachment: 116813_8.0.42.results (application/octet-stream, text), 95.46 KiB.

Hello, here are some of our subsequent findings. This issue is only triggered when `binlog_order_commits=ON`. The reason is that when it is set to ON, `MYSQL_BIN_LOG::ordered_commit` holds the `LOCK_commit` lock during the `commit_stage`. If the `commit_stage` needs to handle a large number of lock releases, the submission of subsequent transactions will be blocked for a long time.