Description:
The current InnoDB B+Tree index insert path has three major performance bottlenecks:
1. Repeated B+Tree descent
A pessimistic insert may go through the B+Tree descent path multiple times:
1. Optimistic insert attempt:
btr_cur_search_to_nth_level(BTR_MODIFY_LEAF) -> btr_cur_optimistic_insert
2. Pessimistic insert with page split:
btr_cur_search_to_nth_level(BTR_MODIFY_TREE) -> btr_cur_optimistic_insert -> btr_cur_pessimistic_insert
Each path requires btr_cur_search_to_nth_level() to descend the B+Tree again.
2. BTR_MODIFY_TREE serializes SMO through SX(dict_index_t::lock)
The pessimistic insert path acquires SX(dict_index_t::lock), which means all SMO operations are effectively serialized.
3. The pessimistic BTR_MODIFY_TREE phase holds X latches on the affected subtree
MySQL introduced SX(dict_index_t::lock) in 8.0. Although it is compatible with the S(dict_index_t::lock) held by readers and optimistic writers, and therefore improves concurrency between SMO and other operations to some extent, the subtree involved in the pessimistic BTR_MODIFY_TREE phase is still held with X latches until the SMO completes.
During this period, no other read or write operation can enter that subtree. This becomes even worse when cascading SMO reaches higher levels of the tree.
How to repeat:
1. The second bottleneck can be reproduced with a high-concurrency insert workload that triggers frequent B+Tree page splits.
For example, prepare a table so that many leaf pages are close to the split boundary, then run multiple concurrent insert threads with randomized keys. In this workload, many sessions enter the pessimistic insert path and contend on SX(dict_index_t::lock), which serializes SMO operations.
2. The third bottleneck can be reproduced by extending the above workload with concurrent read requests.
While concurrent inserts are triggering frequent cascading splits, run read queries against the same index ranges. These reads may be blocked by X-latched subtrees held during the BTR_MODIFY_TREE pessimistic phase, especially when the split cascades to higher internal levels.
Suggested fix:
To optimize these issues, in principle we need:
1. A new B+Tree descent, ascent, and page-latching protocol, so that after an optimistic insert fails, we do not need to release the current leaf page latch and restart from the root. Instead, we should be able to start the pessimistic insert process in place.
2. Pessimistic insert should no longer hold SX(dict_index_t::lock). It should only need S(dict_index_t::lock), so that SMO operations can run in parallel.
3. SMO should not pre-lock the whole affected subtree. The latch granularity should be reduced. The split should proceed bottom-up, one level at a time: latch, split, and release. This allows the subtree being modified by SMO to remain readable and writable as much as possible.
Combining these three points, we can borrow ideas from B-link Trees:
https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf
and perform a more complete redesign.
------
--- High-Level Design ---
## 1. Introduce a new B-link-style descent path: blink_search_to_nth_level. The new search path works as follows:
1. Acquire S(dict_index_t::lock).
2. Starting from the root page, acquire an S latch on the current page, get the child page number, release the current page latch, then descend and acquire an S latch on the child page. Repeat this until reaching the level immediately above the target level. During descent, record the page numbers of the internal pages in path[].
3. After obtaining the target page number at the target level, acquire either an S latch or an X latch on the target page depending on whether the operation is a read or write.
There is no longer a fundamental distinction between BTR_MODIFY_LEAF and BTR_MODIFY_TREE in the descent protocol. The main difference is only the target level.
After descent completes, the operation holds the latch on the target page, and also has the page numbers of all internal pages on the path. These can later be used to optimistically locate the parent page if needed.
## 2. Latching-order protocol
1. During descent, latch coupling is not required. We latch one level at a time: parent -> child page number -> release parent -> latch child.
2. During ascent, latch coupling is also not required.
3. When moving right on the same level through the B-link high key, latch coupling is required. The protocol is current -> right: acquire the right page latch while still holding the current page latch, then release the current page latch.
4. Pages are split only to the right, which fits the B-link Tree right-move logic.
## 3. Full insert flow
1. Use blink_search_to_nth_level() to descend to the target leaf page L, acquire an X latch on L, and return the internal-page path path[].
2. Try an optimistic insert. If it succeeds, the insert is complete and returns.
3. If the optimistic insert fails, continue holding the X latch on L, acquire an X latch on L’s right sibling R, allocate a new page N, move part of L’s records to N, and insert N between L and R.
At the same time, set the INCOMPLETE_SPLIT flag on L, and set L’s high key.
The semantics of INCOMPLETE_SPLIT are:
- concurrent readers and optimistic writers can still enter L;
- they can use the high key to decide whether they need to move right to continue the operation;
- concurrent pessimistic writers are blocked by this flag and return retry, waiting for the previous unfinished cascade to make progress.
After the split is complete, obtain the node_ptr for N, then release all latches on L, N, and R.
4. Use path[] to optimistically acquire an X latch on the parent page P. If P has changed, descend again to locate the correct P and acquire its X latch.
If P has enough space for an optimistic insert, insert the node_ptr, then reacquire the latch on L, clear L’s INCOMPLETE_SPLIT flag, and return success.
5. If P cannot accept the node_ptr, perform step 3 at P’s level and split P. Then reacquire the X latch on L and clear L’s INCOMPLETE_SPLIT flag.
6. Perform step 4 at P’s level and insert the node_ptr of P’s new right page into P’s parent.
7. Continue this process upward until the cascading split completes.
With this design, we can effectively address the three bottlenecks described above.
I implemented a PoC on MySQL 9.7.0. In a carefully constructed workload with 32 concurrent insert threads that triggers a large number of concurrent page splits, almost every insert event triggers a leaf split, the comparison is:
| Version | TPS | Avg latency | P95 latency |
| ----------------- | --------- | ----------- | ----------- |
| MySQL 9.7.0 | 5,666.22 | 5.65 ms | 17.01 ms |
| Optimized version | 91,523.92 | 0.35 ms | 0.40 ms |
This is a 16.2x improvement in TPS, a 16.1x reduction in average latency, and a 42.5x reduction in P95 latency.
I also collected perf samples on the optimized version. The hotspot has shifted from index-lock contention to redo log commit wait. In particular, log_wait_for_write accounts for about 40% of the samples.
This indicates that SMO concurrency inside the B-link-style protocol is no longer the bottleneck.
The improvement is significant, so I believe this proposal is worth further investigation.
------
--- InnoDB Low-Level Design ---
Due to the length limit of this field, the full InnoDB low-level design is available here:
https://kernelmaker.github.io/MySQL-proposal-1
This proposal is large. It involves major changes to core InnoDB modules and is challenging. Also, this is only the split part. Merge support will definitely be needed later. I plan to put the merge design into a separate follow-up proposal, or as an extension of this proposal.
However, the PoC at least proves the feasibility of this direction and shows significant potential benefits.
If the proposal is accepted, the actual development can be designed, refined, and iterated in phases.
I am also willing to continue participating in the follow-up work and contribute to the implementation.
Description: The current InnoDB B+Tree index insert path has three major performance bottlenecks: 1. Repeated B+Tree descent A pessimistic insert may go through the B+Tree descent path multiple times: 1. Optimistic insert attempt: btr_cur_search_to_nth_level(BTR_MODIFY_LEAF) -> btr_cur_optimistic_insert 2. Pessimistic insert with page split: btr_cur_search_to_nth_level(BTR_MODIFY_TREE) -> btr_cur_optimistic_insert -> btr_cur_pessimistic_insert Each path requires btr_cur_search_to_nth_level() to descend the B+Tree again. 2. BTR_MODIFY_TREE serializes SMO through SX(dict_index_t::lock) The pessimistic insert path acquires SX(dict_index_t::lock), which means all SMO operations are effectively serialized. 3. The pessimistic BTR_MODIFY_TREE phase holds X latches on the affected subtree MySQL introduced SX(dict_index_t::lock) in 8.0. Although it is compatible with the S(dict_index_t::lock) held by readers and optimistic writers, and therefore improves concurrency between SMO and other operations to some extent, the subtree involved in the pessimistic BTR_MODIFY_TREE phase is still held with X latches until the SMO completes. During this period, no other read or write operation can enter that subtree. This becomes even worse when cascading SMO reaches higher levels of the tree. How to repeat: 1. The second bottleneck can be reproduced with a high-concurrency insert workload that triggers frequent B+Tree page splits. For example, prepare a table so that many leaf pages are close to the split boundary, then run multiple concurrent insert threads with randomized keys. In this workload, many sessions enter the pessimistic insert path and contend on SX(dict_index_t::lock), which serializes SMO operations. 2. The third bottleneck can be reproduced by extending the above workload with concurrent read requests. While concurrent inserts are triggering frequent cascading splits, run read queries against the same index ranges. These reads may be blocked by X-latched subtrees held during the BTR_MODIFY_TREE pessimistic phase, especially when the split cascades to higher internal levels. Suggested fix: To optimize these issues, in principle we need: 1. A new B+Tree descent, ascent, and page-latching protocol, so that after an optimistic insert fails, we do not need to release the current leaf page latch and restart from the root. Instead, we should be able to start the pessimistic insert process in place. 2. Pessimistic insert should no longer hold SX(dict_index_t::lock). It should only need S(dict_index_t::lock), so that SMO operations can run in parallel. 3. SMO should not pre-lock the whole affected subtree. The latch granularity should be reduced. The split should proceed bottom-up, one level at a time: latch, split, and release. This allows the subtree being modified by SMO to remain readable and writable as much as possible. Combining these three points, we can borrow ideas from B-link Trees: https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf and perform a more complete redesign. ------ --- High-Level Design --- ## 1. Introduce a new B-link-style descent path: blink_search_to_nth_level. The new search path works as follows: 1. Acquire S(dict_index_t::lock). 2. Starting from the root page, acquire an S latch on the current page, get the child page number, release the current page latch, then descend and acquire an S latch on the child page. Repeat this until reaching the level immediately above the target level. During descent, record the page numbers of the internal pages in path[]. 3. After obtaining the target page number at the target level, acquire either an S latch or an X latch on the target page depending on whether the operation is a read or write. There is no longer a fundamental distinction between BTR_MODIFY_LEAF and BTR_MODIFY_TREE in the descent protocol. The main difference is only the target level. After descent completes, the operation holds the latch on the target page, and also has the page numbers of all internal pages on the path. These can later be used to optimistically locate the parent page if needed. ## 2. Latching-order protocol 1. During descent, latch coupling is not required. We latch one level at a time: parent -> child page number -> release parent -> latch child. 2. During ascent, latch coupling is also not required. 3. When moving right on the same level through the B-link high key, latch coupling is required. The protocol is current -> right: acquire the right page latch while still holding the current page latch, then release the current page latch. 4. Pages are split only to the right, which fits the B-link Tree right-move logic. ## 3. Full insert flow 1. Use blink_search_to_nth_level() to descend to the target leaf page L, acquire an X latch on L, and return the internal-page path path[]. 2. Try an optimistic insert. If it succeeds, the insert is complete and returns. 3. If the optimistic insert fails, continue holding the X latch on L, acquire an X latch on L’s right sibling R, allocate a new page N, move part of L’s records to N, and insert N between L and R. At the same time, set the INCOMPLETE_SPLIT flag on L, and set L’s high key. The semantics of INCOMPLETE_SPLIT are: - concurrent readers and optimistic writers can still enter L; - they can use the high key to decide whether they need to move right to continue the operation; - concurrent pessimistic writers are blocked by this flag and return retry, waiting for the previous unfinished cascade to make progress. After the split is complete, obtain the node_ptr for N, then release all latches on L, N, and R. 4. Use path[] to optimistically acquire an X latch on the parent page P. If P has changed, descend again to locate the correct P and acquire its X latch. If P has enough space for an optimistic insert, insert the node_ptr, then reacquire the latch on L, clear L’s INCOMPLETE_SPLIT flag, and return success. 5. If P cannot accept the node_ptr, perform step 3 at P’s level and split P. Then reacquire the X latch on L and clear L’s INCOMPLETE_SPLIT flag. 6. Perform step 4 at P’s level and insert the node_ptr of P’s new right page into P’s parent. 7. Continue this process upward until the cascading split completes. With this design, we can effectively address the three bottlenecks described above. I implemented a PoC on MySQL 9.7.0. In a carefully constructed workload with 32 concurrent insert threads that triggers a large number of concurrent page splits, almost every insert event triggers a leaf split, the comparison is: | Version | TPS | Avg latency | P95 latency | | ----------------- | --------- | ----------- | ----------- | | MySQL 9.7.0 | 5,666.22 | 5.65 ms | 17.01 ms | | Optimized version | 91,523.92 | 0.35 ms | 0.40 ms | This is a 16.2x improvement in TPS, a 16.1x reduction in average latency, and a 42.5x reduction in P95 latency. I also collected perf samples on the optimized version. The hotspot has shifted from index-lock contention to redo log commit wait. In particular, log_wait_for_write accounts for about 40% of the samples. This indicates that SMO concurrency inside the B-link-style protocol is no longer the bottleneck. The improvement is significant, so I believe this proposal is worth further investigation. ------ --- InnoDB Low-Level Design --- Due to the length limit of this field, the full InnoDB low-level design is available here: https://kernelmaker.github.io/MySQL-proposal-1 This proposal is large. It involves major changes to core InnoDB modules and is challenging. Also, this is only the split part. Merge support will definitely be needed later. I plan to put the merge design into a separate follow-up proposal, or as an extension of this proposal. However, the PoC at least proves the feasibility of this direction and shows significant potential benefits. If the proposal is accepted, the actual development can be designed, refined, and iterated in phases. I am also willing to continue participating in the follow-up work and contribute to the implementation.