Bug #120634 buf_stat_per_index inc/dec is not paired: root page inc is never decremented, causing unbounded growth of ut_lock_free_h
Submitted: 9 Jun 4:07
Reporter: Demon Chen Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:8.0.44 8.4.9 OS:Any
Assigned to: CPU Architecture:Any

[9 Jun 4:07] Demon Chen
Description:
`buf_stat_per_index` (an `ut_lock_free_hash_t` keyed by `(space_id, index_id)`)
is updated through two paths that are not symmetric on the index-drop path,
which makes the hash grow monotonically over the lifetime of the server.

The contract:

  inc() side
    - btr_create()       -> buf_stat_per_index->inc(index_id_t(space, id));
    - btr_page_create()  -> same inc() call for every newly created B-tree page.

  dec() side
    - buf_LRU_block_remove_hashed() reads the page's *current* PAGE_INDEX_ID
      and calls buf_stat_per_index->dec(index_id_t(space, that_id)).
      So the design assumes: "whatever index_id the page reports at LRU
      removal time, decrement that key."

The asymmetry on DROP INDEX / TRUNCATE:

When an index tree is freed, btr_free_root_invalidate() (storage/innobase/btr/
btr0btr.cc) overwrites the root page's PAGE_INDEX_ID with the sentinel
BTR_FREED_INDEX_ID = 0:

    static const space_index_t BTR_FREED_INDEX_ID = 0;

    static void btr_free_root_invalidate(buf_block_t *block, mtr_t *mtr) {
      ut_ad(page_is_root(block->frame));
      btr_page_set_index_id(buf_block_get_frame(block),
                            buf_block_get_page_zip(block),
                            BTR_FREED_INDEX_ID, mtr);
    }

This function does NOT call buf_stat_per_index->dec() / erase() for the
original (space, real_index_id) key. It only rewrites the in-page identifier.

After this point, when the freed root block is eventually evicted by the LRU
and buf_LRU_block_remove_hashed() runs:

  1. The page's PAGE_INDEX_ID is now 0 (sentinel), or the page has been reused
     for FSP_FREE / a different segment.
  2. The dec() call therefore either targets a different key (key 0, or a
     newly assigned id), or is skipped entirely because of page-type checks.
  3. The original (space, real_index_id) entry never receives a matching
     dec() and is never erase()d.

Net effect per dropped index: at minimum one orphan entry with value == 1 is
permanently retained in buf_stat_per_index. Repeated CREATE/DROP INDEX and
TRUNCATE drive the hash to monotonic growth. INFORMATION_SCHEMA.
INNODB_CACHED_INDEXES diverges from INFORMATION_SCHEMA.INNODB_INDEXES over
time (cached >> live) on long-running instances; on busy production servers
the gap reaches millions of entries.

This is consistent with the "leaks memory by design" property already
acknowledged in Bug#35357691 for ut_lock_free_hash_t, but the problem
reported here is at a higher level: the *caller's* inc/dec contract is
broken, not the hash's internal erase() behavior.

Relationship to Bug#120441 (Patch approved 2025-06-08):

Bug#120441 addresses a different facet of the same hash. Its problem is
that the very first inc() for a new key inside btr_create() is issued while
the mini-transaction holds FSP / page SX latches; if that inc() happens to
trigger ut_lock_free_hash_t::grow() -> optimize(), the optimize runs
synchronously under those latches and a single btr_create() can stall for
hundreds of seconds. The approved fix pre-registers the key
(ensure_present(key, 0)) before mtr_start(), so the in-mtr inc() only
mutates an already-present entry.

That fix moves *when* the first insertion happens; it does not change
*whether* the entry is ever removed. After Bug#120441's fix is applied:

  Aspect                                             Before 120441 fix    After 120441 fix    This bug
  Where inc() runs                                   inside mtr (latched) before mtr_start()  unchanged
  Can inc() trigger grow()/optimize() under latch?   Yes (the stall)      No                  unchanged
  Is (space, index_id) ever erased on DROP/TRUNCATE? No                   No                  No -- this bug
  Long-term hash size                                grows monotonically  grows monotonically  grows monotonically
  INNODB_CACHED_INDEXES vs INNODB_INDEXES            diverges             diverges            diverges

So Bug#120441's fix removes the acute latch stall but leaves the underlying
memory growth -- and the fact that optimize() will eventually be invoked
again as the orphan-inflated hash crosses the next grow() threshold --
untouched. The two bugs are orthogonal:

  - Bug#120441   = "do not call inc() at the wrong moment"
  - this report  = "the matching dec() / erase() for the drop path is missing"

Both fixes are required. Neither subsumes the other.

Affected files / functions:

  - storage/innobase/include/buf0stats.h
      buf_stat_per_index_t (wraps ut_lock_free_hash_t)
  - storage/innobase/btr/btr0btr.cc
      btr_create()                  -- calls inc()
      btr_page_create()             -- calls inc()
      btr_free_root_invalidate()    -- rewrites PAGE_INDEX_ID, no dec()/erase()
      btr_free_if_exists()          -- drop-path entry point
  - storage/innobase/buf/buf0lru.cc
      buf_LRU_block_remove_hashed() -- calls dec() based on current PAGE_INDEX_ID
  - storage/innobase/handler/i_s.cc
      INFORMATION_SCHEMA.INNODB_CACHED_INDEXES (the symptom signal)

Impact:

  1. Memory leak: ~one entry per dropped index, retained for the lifetime of
     the server. Unbounded on instances with frequent DDL / TRUNCATE.
  2. Indirect amplification of Bug#120441: the inflated hash crosses
     ut_lock_free_hash_t::grow() thresholds earlier and more often. Even
     after Bug#120441 moves inc() out of the latch, optimize() itself still
     runs, costs CPU, and prolongs DDL-heavy windows. Fixing this leak
     directly reduces how often optimize() is triggered.
  3. Diagnostic noise: INNODB_CACHED_INDEXES becomes unusable as an
     indicator of actual cached B-tree footprint.

How to repeat:
Any workload that repeatedly creates and drops indexes (or truncates tables)
on an 8.0 server reproduces it. Minimal SQL:

    CREATE DATABASE IF NOT EXISTS leak_db;
    USE leak_db;

    DROP TABLE IF EXISTS t;
    CREATE TABLE t (id INT PRIMARY KEY, a INT, b INT, c INT) ENGINE=InnoDB;

    -- baseline
    SELECT COUNT(*) AS cached_before FROM information_schema.innodb_cached_indexes;
    SELECT COUNT(*) AS live_before   FROM information_schema.innodb_indexes;

    -- repeatedly create/drop a secondary index
    -- (each iteration consumes a new index_id, so each iteration leaks 1 entry)
    DELIMITER $$
    CREATE PROCEDURE churn(IN n INT)
    BEGIN
      DECLARE i INT DEFAULT 0;
      WHILE i < n DO
        SET @s = CONCAT('CREATE INDEX ix_', i, ' ON t(a)');
        PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt;
        SET @s = CONCAT('DROP INDEX ix_', i, ' ON t');
        PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt;
        SET i = i + 1;
      END WHILE;
    END$$
    DELIMITER ;

    CALL churn(10000);

    -- after
    SELECT COUNT(*) AS cached_after FROM information_schema.innodb_cached_indexes;
    SELECT COUNT(*) AS live_after   FROM information_schema.innodb_indexes;

Observed:

    cached_before  ~ N0
    live_before    ~ M0
    cached_after  ~ N0 + 10000        <-- monotonically grows by ~ #drops
    live_after    ~ M0                <-- unchanged

Equivalent reproductions:

  - TRUNCATE TABLE on a table with N indexes leaks ~N entries per truncation,
    because innobase_truncate goes through delete_impl() -> create_impl()
    and each freed B-tree root takes the same btr_free_root_invalidate()
    path.
  - Any DDL workload that drops indexes/tables (sysbench prepare/cleanup
    cycles, partition exchange/drop, etc.) shows the same monotonic growth.

On a long-running production instance:

    SELECT
      (SELECT COUNT(*) FROM information_schema.innodb_cached_indexes) AS cached,
      (SELECT COUNT(*) FROM information_schema.innodb_indexes)        AS live,
      (SELECT COUNT(*) FROM information_schema.innodb_cached_indexes)
      - (SELECT COUNT(*) FROM information_schema.innodb_indexes)      AS leaked;

`leaked` only ever increases, never decreases, regardless of buffer pool
pressure or restart-free uptime spent idle.

Suggested fix:
Close the inc/dec asymmetry on the index-drop path. Two equivalent options;
(A) is the smallest and most local.

Option A -- explicit erase in btr_free_root_invalidate()

    static void btr_free_root_invalidate(buf_block_t *block, mtr_t *mtr) {
      ut_ad(page_is_root(block->frame));

      const space_id_t    space   = block->page.id.space();
      const space_index_t orig_id = btr_page_get_index_id(block->frame);

      btr_page_set_index_id(buf_block_get_frame(block),
                            buf_block_get_page_zip(block),
                            BTR_FREED_INDEX_ID, mtr);

      if (orig_id != BTR_FREED_INDEX_ID) {
        /* Symmetric counterpart of the inc() in btr_create() /
           btr_page_create(). After this point PAGE_INDEX_ID is the sentinel
           and buf_LRU_block_remove_hashed() will not produce a paired
           dec() against the original key any more. */
        buf_stat_per_index->erase(index_id_t(space, orig_id));
      }
    }

This is the symmetric counterpart of the inc() in btr_create() and is
invoked exactly once per freed root, on the only code path that overwrites
PAGE_INDEX_ID with the sentinel.

Option B -- drop-path callback

If keeping btr_free_root_invalidate() purely structural is preferred, do
the erase() one level up, in the drop-index / free-root callers
(btr_free_if_exists() and the truncate / TRUNCATE-equivalent paths in
dict0crea.cc), iterating over each index of the table being dropped or
recreated.

In either case the invariant becomes: every (space, index_id) produced by
btr_create() is removed by exactly one drop-path erase(), and the LRU dec()
path becomes a no-op for already-erased keys (the existing code already
tolerates this because LRU removal of a freed root sees BTR_FREED_INDEX_ID
and will not match the original key anyway).