MySQL Bugs: #120434: InnoDB B+Tree Performance Optimization, Part 1: Insert Path Improvements and Concurrent Split Handling

Bug #120434	InnoDB B+Tree Performance Optimization, Part 1: Insert Path Improvements and Concurrent Split Handling
Submitted:	10 May 21:26	Modified:	14 Jul 9:45
Reporter:	ZHAO SONG	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	9.7.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	ROADMAP_CANDIDATE

Description:
The current InnoDB B+Tree index insert path has three major performance bottlenecks:

1. Repeated B+Tree descent

   A pessimistic insert may go through the B+Tree descent path multiple times:

   1. Optimistic insert attempt:
      btr_cur_search_to_nth_level(BTR_MODIFY_LEAF) -> btr_cur_optimistic_insert
   2. Pessimistic insert with page split:
      btr_cur_search_to_nth_level(BTR_MODIFY_TREE) -> btr_cur_optimistic_insert -> btr_cur_pessimistic_insert

   Each path requires btr_cur_search_to_nth_level() to descend the B+Tree again.

2. BTR_MODIFY_TREE serializes SMO through SX(dict_index_t::lock)

   The pessimistic insert path acquires SX(dict_index_t::lock), which means all SMO operations are effectively serialized.

3. The pessimistic BTR_MODIFY_TREE phase holds X latches on the affected subtree

   MySQL introduced SX(dict_index_t::lock) in 8.0. Although it is compatible with the S(dict_index_t::lock) held by readers and optimistic writers, and therefore improves concurrency between SMO and other operations to some extent, the subtree involved in the pessimistic BTR_MODIFY_TREE phase is still held with X latches until the SMO completes.

   During this period, no other read or write operation can enter that subtree. This becomes even worse when cascading SMO reaches higher levels of the tree.

How to repeat:
1. The second bottleneck can be reproduced with a high-concurrency insert workload that triggers frequent B+Tree page splits.

   For example, prepare a table so that many leaf pages are close to the split boundary, then run multiple concurrent insert threads with randomized keys. In this workload, many sessions enter the pessimistic insert path and contend on SX(dict_index_t::lock), which serializes SMO operations.

2. The third bottleneck can be reproduced by extending the above workload with concurrent read requests.

   While concurrent inserts are triggering frequent cascading splits, run read queries against the same index ranges. These reads may be blocked by X-latched subtrees held during the BTR_MODIFY_TREE pessimistic phase, especially when the split cascades to higher internal levels.

Suggested fix:
To optimize these issues, in principle we need:

1. A new B+Tree descent, ascent, and page-latching protocol, so that after an optimistic insert fails, we do not need to release the current leaf page latch and restart from the root. Instead, we should be able to start the pessimistic insert process in place.
2. Pessimistic insert should no longer hold SX(dict_index_t::lock). It should only need S(dict_index_t::lock), so that SMO operations can run in parallel.
3. SMO should not pre-lock the whole affected subtree. The latch granularity should be reduced. The split should proceed bottom-up, one level at a time: latch, split, and release. This allows the subtree being modified by SMO to remain readable and writable as much as possible.

Combining these three points, we can borrow ideas from B-link Trees:

https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf

and perform a more complete redesign.

------
--- High-Level Design ---

## 1. Introduce a new B-link-style descent path: blink_search_to_nth_level. The new search path works as follows:

1. Acquire S(dict_index_t::lock).
2. Starting from the root page, acquire an S latch on the current page, get the child page number, release the current page latch, then descend and acquire an S latch on the child page. Repeat this until reaching the level immediately above the target level. During descent, record the page numbers of the internal pages in path[].
3. After obtaining the target page number at the target level, acquire either an S latch or an X latch on the target page depending on whether the operation is a read or write.

There is no longer a fundamental distinction between BTR_MODIFY_LEAF and BTR_MODIFY_TREE in the descent protocol. The main difference is only the target level.

After descent completes, the operation holds the latch on the target page, and also has the page numbers of all internal pages on the path. These can later be used to optimistically locate the parent page if needed.

## 2. Latching-order protocol

1. During descent, latch coupling is not required. We latch one level at a time: parent -> child page number -> release parent -> latch child.
2. During ascent, latch coupling is also not required.
3. When moving right on the same level through the B-link high key, latch coupling is required. The protocol is current -> right: acquire the right page latch while still holding the current page latch, then release the current page latch.
4. Pages are split only to the right, which fits the B-link Tree right-move logic.

## 3. Full insert flow

1. Use blink_search_to_nth_level() to descend to the target leaf page L, acquire an X latch on L, and return the internal-page path path[].

2. Try an optimistic insert. If it succeeds, the insert is complete and returns.

3. If the optimistic insert fails, continue holding the X latch on L, acquire an X latch on L’s right sibling R, allocate a new page N, move part of L’s records to N, and insert N between L and R.

   At the same time, set the INCOMPLETE_SPLIT flag on L, and set L’s high key.

   The semantics of INCOMPLETE_SPLIT are:

   - concurrent readers and optimistic writers can still enter L;
   - they can use the high key to decide whether they need to move right to continue the operation;
   - concurrent pessimistic writers are blocked by this flag and return retry, waiting for the previous unfinished cascade to make progress.

   After the split is complete, obtain the node_ptr for N, then release all latches on L, N, and R.

4. Use path[] to optimistically acquire an X latch on the parent page P. If P has changed, descend again to locate the correct P and acquire its X latch.

   If P has enough space for an optimistic insert, insert the node_ptr, then reacquire the latch on L, clear L’s INCOMPLETE_SPLIT flag, and return success.

5. If P cannot accept the node_ptr, perform step 3 at P’s level and split P. Then reacquire the X latch on L and clear L’s INCOMPLETE_SPLIT flag.

6. Perform step 4 at P’s level and insert the node_ptr of P’s new right page into P’s parent.

7. Continue this process upward until the cascading split completes.

With this design, we can effectively address the three bottlenecks described above.

I implemented a PoC on MySQL 9.7.0. In a carefully constructed workload with 32 concurrent insert threads that triggers a large number of concurrent page splits, almost every insert event triggers a leaf split, the comparison is:

| Version           | TPS       | Avg latency | P95 latency |
| ----------------- | --------- | ----------- | ----------- |
| MySQL 9.7.0       | 5,666.22  | 5.65 ms     | 17.01 ms    |
| Optimized version | 91,523.92 | 0.35 ms     | 0.40 ms     |

This is a 16.2x improvement in TPS, a 16.1x reduction in average latency, and a 42.5x reduction in P95 latency.

I also collected perf samples on the optimized version. The hotspot has shifted from index-lock contention to redo log commit wait. In particular, log_wait_for_write accounts for about 40% of the samples.

This indicates that SMO concurrency inside the B-link-style protocol is no longer the bottleneck.

The improvement is significant, so I believe this proposal is worth further investigation.

------

--- InnoDB Low-Level Design ---

Due to the length limit of this field, the full InnoDB low-level design is available here:

https://kernelmaker.github.io/MySQL-proposal-1

This proposal is large. It involves major changes to core InnoDB modules and is challenging. Also, this is only the split part. Merge support will definitely be needed later. I plan to put the merge design into a separate follow-up proposal, or as an extension of this proposal.

However, the PoC at least proves the feasibility of this direction and shows significant potential benefits.

If the proposal is accepted, the actual development can be designed, refined, and iterated in phases.

I am also willing to continue participating in the follow-up work and contribute to the implementation.

MySQL Contributor Summit Presentation Slices

Attachment: MySQL-proposal_Zhao-Song.pdf (application/pdf, text), 1.20 MiB.

The proposal includes promising performance numbers, but the document does not provide enough detail about the code changes made in the proof of concept. I understand that the POC may not be in a publishable or production-ready state yet. However, since the performance results are being shared, it would be helpful to also share more information about the implementation scope (of poc) and the remaining gaps.

In particular, could you please provide more details on the following points?

1. What changes were made to the insert path?
   - Which parts of the optimistic and pessimistic insert flows were changed?
   - Was the POC able to transition from optimistic insert to the split path without restarting the full descent?

2. What changes were made to the search/select path?
   - Was the search path modified to understand high keys and right links?
   - If so, which lookup paths were covered?

3. Were delete and update paths changed?
   - If they were not changed, how does the POC prevent those paths from interacting incorrectly with pages using the new B-link-style metadata?

4. Can the old and new algorithms operate on the same table or index?
   - For example, can some pages/indexes use the existing InnoDB B+Tree behavior while others use the new B-link-style behavior?
   - If mixed operation is not supported, what feature switches or preconditions were used?

5. Were there changes related to index latching in the pessimistic path?
   - In particular, were there changes to how `dict_index_t::lock`, root page latching, or page X-latches are acquired and released during SMOs?

6. What was the result of the full MTR test suite after the POC changes?
   - Were there known failures?
   - If some tests were not run, which areas remain unverified?

7. What functionality was intentionally left incomplete in the POC?
   - For example: merge handling, delete/update support, crash recovery, incomplete split cleanup, FSP/page allocation changes, or compatibility with existing page formats.

These details would make it easier to evaluate the proposal beyond the benchmark results and understand which parts are already implemented, which parts are simulated or simplified, and which parts still need design work before the approach can be considered for integration.

Q1. What changes were made to the insert path?

At a high level, the optimistic insert path mostly keeps the same behavior. The pessimistic/split path is the part that was rewritten to use a B-link-style right split. The two paths are also connected, so when optimistic insert fails on a full leaf page, we can start the split from the leaf page we already reached, instead of going back to the root and doing a second descent.

* What was changed:

1. Optimistic insert

The behavior is mostly unchanged. It still follows the same logic as before.

The main difference is the failure case. When it finds the target leaf page is full, it does not simply release everything and fall back to the old pessimistic path. Instead, it keeps the leaf page latched and directly enters the new split path.

2. Pessimistic insert path

This part was replaced.

The split path no longer starts with a new descent from the root to find the leaf again. It works on the leaf page that the optimistic insert already latched. It first splits that leaf page to the right in B-link style, and then propagates the new child pointer upward, one level at a time.

3. SMO structure

The two paths are joined so the descent can be reused, and the old large SMO mini-transaction is broken into a chain of smaller ones.

In upstream InnoDB, the pessimistic path holds one mtr with SX(index) and X-latches on the affected subtree until commit. In the B-link path, this is split into one mtr per tree level. The index latch kept across these mtr boundaries is only S(index), passed through a new mtr_t::transfer_to() helper.

Once the leaf-split mtr commits, the split is visible even before the parent pointer has been installed. A concurrent descent that lands on the old page can check the high key and follow the right link if needed.

INCOMPLETE_SPLIT marks a page whose parent propagation is still in progress. It is cleared in the same mtr that installs the parent pointer, so the parent pointer and the flag are not seen out of sync. This also gives crash recovery a clear state to recognize and finish an interrupted cascade.

* Was the PoC able to transition from optimistic insert to the split path without restarting the full descent?

Yes, for this specific transition. When optimistic insert reaches a full leaf page, the split starts from that same leaf page. There is no second descent just to reach the leaf again.

There are two clarifications to avoid overstating this:

1. In my PoC, the upward cascade still re-descends from the root to locate each parent level. This is a PoC simplification and is also shown in the flow. The intended proposal is to use the cached descent path for this part.
2. Under contention, if the leaf we are trying to split is already in the middle of another thread's split cascade (its INCOMPLETE_SPLIT flag is set). This check happens before we modify anything, so the retry is clean. Higher up, if a parent is found in the same state mid-cascade, the retry is internal, a short backoff and re-descend to that level only.

Here is the flow for your reference:

row_ins_clust_index_entry(entry):
  retry-loop:
    row_ins_clust_index_entry_low(BTR_MODIFY_LEAF):
      mtr_start(mtr_1)
      
      blink_search_to_nth_level(leaf, MODIFY_LEAF)
      [ S(index) on mtr_1; leaf taken in X ]
      // non-coupled: release parent before fetching child
      // self-healing INSIDE the descent:
      //   right-link chase: if key drifted past a page, latch R then release L
      //   root S->X re-fetch if the target level turns out to be the root
      
      blink_optimistic_insert_at_cursor(cursor, mtr_1)
        fits? yes: mtr_commit(mtr_1) -> done
        no fit (DB_FAIL):
          
          blink_pessimistic_insert(cursor, mtr_1)
          (same cursor, same mtr, leaf still X-latched)
          
            CASE A L is the root: 
              root raise (single mtr, atomic; NO cascade, NO hand-off)
            
            CASE B L is a non-root leaf:
              split leaf L -> new right sibling N
                holds S(index), X(L), X(N), X(old R) if L not rightmost
                link L->N + high-key fence
                set INCOMPLETE_SPLIT(L)
            
                // HAND-OFF mtr_1 -> holder
                mtr_start(holder);
                transfer_to(mtr_1 -> holder, S(index));
                mtr_commit(mtr_1)
              
              cascade UP, one level per mtr (blink_insert_into_level, prev_child_no = L):
                level k:
                HAND-OFF holder -> mtr_k
                 // PROPOSAL vs POC: the parent at level k could be reached directly from the
                 // non-leaf page numbers cached during the initial descent. The POC instead
                 // re-descends from the root to level k here, for simplicity; caching the path
                 // is a planned optimization, not a correctness requirement.
                descend to parent P (S(index) already held, X(P))
                if INCOMPLETE_SPLIT(P):
                  park S(index), mtr_commit(mtr_k), backoff, retry
                install child pointer in P
                  P has room:
                    X(prev_child); clear INCOMPLETE_SPLIT(prev_child)
                    mtr_commit(mtr_k)
                  P overflows:
                    if P is root:
                      root raise inside mtr_k; X(prev_child); clear INCOMPLETE_SPLIT(prev_child), mtr_commit(mtr_k)
                    else:
                      split P (X(N_p), X(P's right sibling) if present)
                      set INCOMPLETE_SPLIT(P); 
                      X(prev_child); clear INCOMPLETE_SPLIT(prev_child);
                      HAND-OFF mtr_k -> holder'
                      recurse level k+1 (blink_insert_into_level, prev_child_no = P)
              mtr_start(mtr_1)
      mtr_commit(mtr_1)

Q2. What changes were made to the search/select path?

At a high level, the search path was changed to understand high keys and right links. There are two parts to this: how the tree descent finds the correct page, and how record-level code handles a page that contains a high-key record.

Most of the changes are in shared code used by normal lookups, so I did not need to patch every read path one by one.

1. Tree descent

InnoDB uses one shared search function for key-based lookups, btr_cur_search_to_nth_level(). I added the dispatch there. If the index is B-link enabled and the search mode is supported, it uses the new B-link descent. Otherwise it still goes through the original code path.

The new descent is different from the old one in two main ways:

a. It does not use parent-child latch coupling. It reads the child pointer from the parent, releases the parent page, and then fetches the child page. So the tree may change between these two steps.

b. Because of that, every page we land on is checked with the move-right rule. If the search key is beyond the page’s high key, we latch the right sibling, release the current page, and keep moving right.

This is safe because a split only moves records to the right. When the split becomes visible, the left page already has the new high key and the right link. So if a descent follows a stale parent pointer, it may land on the correct page or on a page to the left of it, but not to the right. Following right links can still find the correct page.

Readers do not check INCOMPLETE_SPLIT, and they do not wait for an in-flight parent propagation. For readers, the high key and right link are enough. INCOMPLETE_SPLIT is only used by pessimistic writers and crash recovery.

The descent flow is roughly:

blink_search_to_nth_level(K, target_level):
S(index) // readers take the same shared index latch as before
page = root, latched in S
loop:
// move-right rule, applied on every page we land on
while page is a blink page
and page is not rightmost
and K is beyond high_key(page):
latch right sibling in S
release current page
page = right sibling
if level(page) == target_level:
position cursor in page by binary search
done
// target latch is S for reads, X for writes
child = child pointer covering K
release page // before fetching the child; no latch coupling
page = fetch child, latched in S
// child identity is re-validated here
// index id, level, page type, etc.

One special case is root modification. If the target of a write turns out to be the root, the code re-fetches it with X latch. If a root raise happened in between, the descent is restarted.

2. Reading records on a page

The high key is stored as a real record, just before supremum. That means any code that walks records could otherwise see it as a normal user record.

I did not want to patch every record loop separately, so the shared record helpers were changed to treat the high-key record like supremum. Most scan code already sits on top of these helpers, so it works without local changes.

There were still a few places that needed special handling:

a. The SELECT row fetch loop skips the high-key record and continues to the next page.

b. Persistent cursor restore never anchors the stored position on the high-key record.

c. Locking never places a record lock on the high-key record. Gap inheritance and next-key locking step over it to supremum.

d. Duplicate-key and FK checks skip the high-key record. The fence key is the same as the first key on the right sibling, so otherwise it could create false duplicate-key matches or false “referenced row exists” results.

e. Statistics sampling and parallel scan also skip it when counting or iterating records.

3. Which lookup paths were covered

Covered in the PoC:

* point and range SELECT
* DML positioning descents
* duplicate-key checks
* FK checks
* persistent cursor restore for both forward and backward scans
* optimizer dives
* statistics sampling
* parallel scan

Backward scan is a bit special. In the PoC it re-positions through the old descent. I think this is still correct because a right split does not move the first record of the original page, and that is the anchor used for re-positioning.

Not covered in the PoC:

1. The internal query-graph read path used by FTS auxiliary tables does not skip the high-key record. These tables are excluded by the feature gate.
2. “Open at index edge” still uses the old descent. The left edge is safe because splits only move records to the right. The right edge, for example auto-increment initialization, has a narrow race with an in-flight split of the rightmost page. This is debug-asserted in the PoC.
3. Range estimates currently count the high-key record, so the estimate can be off by roughly one record per page.
4. AHI is disabled for B-link indexes.

Q3. Were delete and update paths changed? If not, how does the PoC prevent them from interacting incorrectly with pages using the new B-link-style metadata?

First, the scope: this PoC focuses on the insert/split path. Update and delete were not converted to the new protocol.

For the PoC, I handled them in a conservative way so they can still run correctly together with concurrent B-link inserts.

1. Update path

Delete-mark and in-place update are unchanged. They only modify records inside one leaf page, so they do not need special handling for the B-link metadata.

Secondary-index updates are delete-mark plus insert, so the insert part naturally goes through the new insert path described in Q1.

The case that needed special handling is a clustered-index update where the new row version becomes larger and no longer fits the page. That can trigger a split from the update path. The old split code does not maintain high keys and right links, so it must not be used on a B-link page.

The temporary PoC solution is to make this path synchronous under X(index):

pessimistic UPDATE (row grew, page is full):
mtr_start(mtr)
X(index) // wait for all in-flight B-link writers to finish
position back on the record (leaf X)
remove old version
try to insert new version
if it does not fit:
split leaf L -> N
- set high key
- set right link
- set INCOMPLETE_SPLIT
install parent pointers synchronously, level by level
clear INCOMPLETE_SPLIT
mtr_commit(mtr)

This works because every B-link writer holds S(index) from the beginning of the operation until its split cascade is finished. So when this path gets X(index), it has waited for all in-flight B-link cascades to finish. Inside that window, there is no concurrent B-link writer on the same index.

So this path is basically the old pessimistic SMO shape, but it writes the B-link page format.

A few rare paths use the same X(index) bridge in the PoC: rollback re-inserting an old row version, the log-apply phase of online table rebuild, and storing very large BLOB fields.

The cost is that these operations serialize with all writers on that index. I accepted this for the PoC because the target of this work is the insert/split path, and these update-triggered split cases are much less frequent than normal insert splits.

I do not think there is a fundamental reason they cannot use the concurrent B-link split path later. The remaining work is mostly around the exact atomicity of remove-old/insert-new and the allocation order for large BLOB fields.

2. Delete path

Delete-mark is unchanged. It is just a leaf-page change.

Purge and rollback, which physically remove records, still use the original delete code. They already run under S(index), like the B-link writers, so the record-removal part is compatible with the new protocol.

What I disabled is the tree-shrinking part: page merge and empty-page discard. So in the PoC, a leaf page can become completely empty and still remain in the tree. It keeps its high key and right link, and descents or scans can pass through it normally. Deleting user records does not delete the high key, so the fence remains valid even if all user records on the page are purged.

I deferred this intentionally because page reclamation needs a separate protocol in a B-link tree [TODO].

So the plan is not to re-enable the old inline merge directly. Page reclamation should be added later as its own B-link-compatible merge/delete protocol.

3. How delete/update are kept away from B-link metadata

There are three main protections in the PoC:

a. Cursor positioning

Delete and update find their target records through the same descent and record helpers as reads, described in Q2. Those helpers skip the high-key record, so normal delete/update cursors should not land on the fence.

The only code that creates, rewrites, or removes a high key is the split code itself.

b. Locking rule

No record lock is placed on the high-key record.

When the record just before the high key is deleted, gap-lock inheritance skips over the high key and goes to supremum. Without this, a gap lock could end up attached to the fence record, where later readers or inserters would not expect to find it.

This was one of the trickier parts in the PoC. Debug builds assert this rule in the places where a record lock could be attached, and in the split code where a high key is removed or rewritten.

c. Index-lock exclusion

Any path that cannot safely follow the normal S(index) B-link writer protocol takes X(index) first.

That drains all in-flight B-link writers on the same index. So legacy structure-changing code does not run concurrently with B-link splits. On B-link pages, the old structure-changing behavior is either rerouted through the synchronous B-link-format split path, as in update, or disabled, as in merge and page discard.

Q4. Can the old and new algorithms operate on the same table or index? If mixed operation is not supported, what feature switches or preconditions were used?

On the same table: yes. The switch is per index, so B-link and normal indexes can coexist in one table.

On the same index: no. Structural changes are owned by one algorithm, selected when the index is created and persisted in the data dictionary.

The PoC uses a global variable, innodb_blink_enabled, which is OFF by default. It only controls whether newly created empty indexes are stamped with the durable B-link flag. Runtime dispatch depends only on that persisted flag, not on the current global setting.

This is intentionally one-way. Once an index may contain B-link pages, disabling the variable must not redirect it to legacy split, merge, discard, or change-buffer paths. To disable B-link for an index, it must be rebuilt with the variable OFF. Existing populated indexes are not converted in the PoC.

Pages also carry a page-level B-link flag. This lets readers distinguish normal and B-link pages locally and prepares for possible future lazy conversion, although that path is not enabled or validated yet.
In principle, an existing populated B+Tree could be stamped as B-link at the index level, and pages could be converted lazily when they split. Readers would be able to handle old and new pages side by side because the checks are page-local.

The PoC enables B-link only for indexes meeting these preconditions:

* COMPACT row format
* Not compressed
* Not system, temporary, intrinsic, FTS auxiliary, spatial, or SDI indexes
* Not being built by online DDL

Anything outside these conditions continues using the original code unchanged.

Q5. Were there changes to how dict_index_t::lock, root page latching, or page X-latches are acquired and released during SMOs?

Yes. All 3 changed.

1. dict_index_t::lock

In the B-link path, a writer only holds S(index) from the beginning of the operation until the split cascade is finished. This covers the descent, the leaf split, and the upward parent propagation.

This S(index) latch is the one latch carried across the mtr chain. Page latches are released at mtr boundaries, but the index S latch is not.
X(index) keeps its old meaning, and the PoC also uses it for the temporary bridge paths.

2. Root page latching

The difficult part is page allocation.

In InnoDB, the index file-segment headers live on the root page. This was the main blocker in the design.

The PoC solution is to move allocation out of the insert path. Each B-link index has a pool of pre-allocated pages, and a background allocator refills the pool using the normal FSP latch order. The split path then takes a page from the pool instead of calling the FSP allocator while holding B-tree page latches.

This worked in the benchmark. After this change, the bottleneck moved away from the index latch and page allocation path, and the next visible hotspot became redo commit.

But I see this as a working PoC solution, not necessarily the final design. Pool sizing, wasted space, crash handling for unused pre-allocated pages, and interaction with tablespace management all need more discussion. This is one of the areas where I would most like to hear the team’s opinion.

3. Page X-latches during SMO

In the B-link path, X-latches are held level by level.

The leaf-split mtr X-latches the old leaf, the new right sibling, and the old right neighbor whose left link needs to be updated.

Each parent-install mtr X-latches the parent, and briefly latches the child page whose INCOMPLETE_SPLIT flag is being cleared. If the parent also overflows, that same mtr performs the parent split and latches the pages needed for that level.

So at any moment, the X-latch footprint is limited to a few pages around one level, instead of the whole affected path from leaf to root.

Q6. What was the result of the full MTR test suite after the POC changes?

I have not spent much time on broad MTR coverage yet, since this is still a PoC.

What I did add is a dedicated MTR suite for the B-link protocol itself. It has some tests covering the main new behavior, all pass.

For regression, I just ran the innodb suite with the feature OFF, which is the default setting. That also passed in my run.

Q7. What functionality was intentionally left incomplete in the PoC?

* Merge: not implemented. Merge and empty-page discard are disabled. An emptied leaf stays linked with its high key. Future work should be a B-link-compatible merge protocol.
* Delete/update: Q3.
* Crash recovery: the main interrupted-split case is handled, but discovery is still buffer-pool based. If an INCOMPLETE_SPLIT page was flushed and checkpointed long before crash, recovery may miss it.
* Incomplete split cleanup: the writer that starts a split finishes it. Other writers only back off and retry. No writer-side helping yet.
* FSP/page allocation: the pre-allocator is PoC-grade. Pool sizing, crash reclamation, and DISCARD TABLESPACE fencing need a real design. This is the main one I'd like to discuss with the you guys.

I see that you mentioned this, "the upward cascade still re-descends from the root to locate each parent level".  Can you please elaborate this further?

After splitting the leaf, the proposal first tries the parent page cached during the original descent. If it is still the correct parent, it inserts the new node_ptr there.

Otherwise, it falls back to descending from the root to find the current parent, similar to existing InnoDB behavior. The PoC only implements the fallback path.