Description:
`buf_stat_per_index` (an `ut_lock_free_hash_t` keyed by `(space_id, index_id)`)
is updated through two paths that are not symmetric on the index-drop path,
which makes the hash grow monotonically over the lifetime of the server.
The contract:
inc() side
- btr_create() -> buf_stat_per_index->inc(index_id_t(space, id));
- btr_page_create() -> same inc() call for every newly created B-tree page.
dec() side
- buf_LRU_block_remove_hashed() reads the page's *current* PAGE_INDEX_ID
and calls buf_stat_per_index->dec(index_id_t(space, that_id)).
So the design assumes: "whatever index_id the page reports at LRU
removal time, decrement that key."
The asymmetry on DROP INDEX / TRUNCATE:
When an index tree is freed, btr_free_root_invalidate() (storage/innobase/btr/
btr0btr.cc) overwrites the root page's PAGE_INDEX_ID with the sentinel
BTR_FREED_INDEX_ID = 0:
static const space_index_t BTR_FREED_INDEX_ID = 0;
static void btr_free_root_invalidate(buf_block_t *block, mtr_t *mtr) {
ut_ad(page_is_root(block->frame));
btr_page_set_index_id(buf_block_get_frame(block),
buf_block_get_page_zip(block),
BTR_FREED_INDEX_ID, mtr);
}
This function does NOT call buf_stat_per_index->dec() / erase() for the
original (space, real_index_id) key. It only rewrites the in-page identifier.
After this point, when the freed root block is eventually evicted by the LRU
and buf_LRU_block_remove_hashed() runs:
1. The page's PAGE_INDEX_ID is now 0 (sentinel), or the page has been reused
for FSP_FREE / a different segment.
2. The dec() call therefore either targets a different key (key 0, or a
newly assigned id), or is skipped entirely because of page-type checks.
3. The original (space, real_index_id) entry never receives a matching
dec() and is never erase()d.
Net effect per dropped index: at minimum one orphan entry with value == 1 is
permanently retained in buf_stat_per_index. Repeated CREATE/DROP INDEX and
TRUNCATE drive the hash to monotonic growth. INFORMATION_SCHEMA.
INNODB_CACHED_INDEXES diverges from INFORMATION_SCHEMA.INNODB_INDEXES over
time (cached >> live) on long-running instances; on busy production servers
the gap reaches millions of entries.
This is consistent with the "leaks memory by design" property already
acknowledged in Bug#35357691 for ut_lock_free_hash_t, but the problem
reported here is at a higher level: the *caller's* inc/dec contract is
broken, not the hash's internal erase() behavior.
Relationship to Bug#120441 (Patch approved 2025-06-08):
Bug#120441 addresses a different facet of the same hash. Its problem is
that the very first inc() for a new key inside btr_create() is issued while
the mini-transaction holds FSP / page SX latches; if that inc() happens to
trigger ut_lock_free_hash_t::grow() -> optimize(), the optimize runs
synchronously under those latches and a single btr_create() can stall for
hundreds of seconds. The approved fix pre-registers the key
(ensure_present(key, 0)) before mtr_start(), so the in-mtr inc() only
mutates an already-present entry.
That fix moves *when* the first insertion happens; it does not change
*whether* the entry is ever removed. After Bug#120441's fix is applied:
Aspect Before 120441 fix After 120441 fix This bug
Where inc() runs inside mtr (latched) before mtr_start() unchanged
Can inc() trigger grow()/optimize() under latch? Yes (the stall) No unchanged
Is (space, index_id) ever erased on DROP/TRUNCATE? No No No -- this bug
Long-term hash size grows monotonically grows monotonically grows monotonically
INNODB_CACHED_INDEXES vs INNODB_INDEXES diverges diverges diverges
So Bug#120441's fix removes the acute latch stall but leaves the underlying
memory growth -- and the fact that optimize() will eventually be invoked
again as the orphan-inflated hash crosses the next grow() threshold --
untouched. The two bugs are orthogonal:
- Bug#120441 = "do not call inc() at the wrong moment"
- this report = "the matching dec() / erase() for the drop path is missing"
Both fixes are required. Neither subsumes the other.
Affected files / functions:
- storage/innobase/include/buf0stats.h
buf_stat_per_index_t (wraps ut_lock_free_hash_t)
- storage/innobase/btr/btr0btr.cc
btr_create() -- calls inc()
btr_page_create() -- calls inc()
btr_free_root_invalidate() -- rewrites PAGE_INDEX_ID, no dec()/erase()
btr_free_if_exists() -- drop-path entry point
- storage/innobase/buf/buf0lru.cc
buf_LRU_block_remove_hashed() -- calls dec() based on current PAGE_INDEX_ID
- storage/innobase/handler/i_s.cc
INFORMATION_SCHEMA.INNODB_CACHED_INDEXES (the symptom signal)
Impact:
1. Memory leak: ~one entry per dropped index, retained for the lifetime of
the server. Unbounded on instances with frequent DDL / TRUNCATE.
2. Indirect amplification of Bug#120441: the inflated hash crosses
ut_lock_free_hash_t::grow() thresholds earlier and more often. Even
after Bug#120441 moves inc() out of the latch, optimize() itself still
runs, costs CPU, and prolongs DDL-heavy windows. Fixing this leak
directly reduces how often optimize() is triggered.
3. Diagnostic noise: INNODB_CACHED_INDEXES becomes unusable as an
indicator of actual cached B-tree footprint.
How to repeat:
Any workload that repeatedly creates and drops indexes (or truncates tables)
on an 8.0 server reproduces it. Minimal SQL:
CREATE DATABASE IF NOT EXISTS leak_db;
USE leak_db;
DROP TABLE IF EXISTS t;
CREATE TABLE t (id INT PRIMARY KEY, a INT, b INT, c INT) ENGINE=InnoDB;
-- baseline
SELECT COUNT(*) AS cached_before FROM information_schema.innodb_cached_indexes;
SELECT COUNT(*) AS live_before FROM information_schema.innodb_indexes;
-- repeatedly create/drop a secondary index
-- (each iteration consumes a new index_id, so each iteration leaks 1 entry)
DELIMITER $$
CREATE PROCEDURE churn(IN n INT)
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < n DO
SET @s = CONCAT('CREATE INDEX ix_', i, ' ON t(a)');
PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt;
SET @s = CONCAT('DROP INDEX ix_', i, ' ON t');
PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt;
SET i = i + 1;
END WHILE;
END$$
DELIMITER ;
CALL churn(10000);
-- after
SELECT COUNT(*) AS cached_after FROM information_schema.innodb_cached_indexes;
SELECT COUNT(*) AS live_after FROM information_schema.innodb_indexes;
Observed:
cached_before ~ N0
live_before ~ M0
cached_after ~ N0 + 10000 <-- monotonically grows by ~ #drops
live_after ~ M0 <-- unchanged
Equivalent reproductions:
- TRUNCATE TABLE on a table with N indexes leaks ~N entries per truncation,
because innobase_truncate goes through delete_impl() -> create_impl()
and each freed B-tree root takes the same btr_free_root_invalidate()
path.
- Any DDL workload that drops indexes/tables (sysbench prepare/cleanup
cycles, partition exchange/drop, etc.) shows the same monotonic growth.
On a long-running production instance:
SELECT
(SELECT COUNT(*) FROM information_schema.innodb_cached_indexes) AS cached,
(SELECT COUNT(*) FROM information_schema.innodb_indexes) AS live,
(SELECT COUNT(*) FROM information_schema.innodb_cached_indexes)
- (SELECT COUNT(*) FROM information_schema.innodb_indexes) AS leaked;
`leaked` only ever increases, never decreases, regardless of buffer pool
pressure or restart-free uptime spent idle.
Suggested fix:
Close the inc/dec asymmetry on the index-drop path. Two equivalent options;
(A) is the smallest and most local.
Option A -- explicit erase in btr_free_root_invalidate()
static void btr_free_root_invalidate(buf_block_t *block, mtr_t *mtr) {
ut_ad(page_is_root(block->frame));
const space_id_t space = block->page.id.space();
const space_index_t orig_id = btr_page_get_index_id(block->frame);
btr_page_set_index_id(buf_block_get_frame(block),
buf_block_get_page_zip(block),
BTR_FREED_INDEX_ID, mtr);
if (orig_id != BTR_FREED_INDEX_ID) {
/* Symmetric counterpart of the inc() in btr_create() /
btr_page_create(). After this point PAGE_INDEX_ID is the sentinel
and buf_LRU_block_remove_hashed() will not produce a paired
dec() against the original key any more. */
buf_stat_per_index->erase(index_id_t(space, orig_id));
}
}
This is the symmetric counterpart of the inc() in btr_create() and is
invoked exactly once per freed root, on the only code path that overwrites
PAGE_INDEX_ID with the sentinel.
Option B -- drop-path callback
If keeping btr_free_root_invalidate() purely structural is preferred, do
the erase() one level up, in the drop-index / free-root callers
(btr_free_if_exists() and the truncate / TRUNCATE-equivalent paths in
dict0crea.cc), iterating over each index of the table being dropped or
recreated.
In either case the invariant becomes: every (space, index_id) produced by
btr_create() is removed by exactly one drop-path erase(), and the LRU dec()
path becomes a no-op for already-erased keys (the existing code already
tolerates this because LRU removal of a freed root sees BTR_FREED_INDEX_ID
and will not match the original key anyway).
Description: `buf_stat_per_index` (an `ut_lock_free_hash_t` keyed by `(space_id, index_id)`) is updated through two paths that are not symmetric on the index-drop path, which makes the hash grow monotonically over the lifetime of the server. The contract: inc() side - btr_create() -> buf_stat_per_index->inc(index_id_t(space, id)); - btr_page_create() -> same inc() call for every newly created B-tree page. dec() side - buf_LRU_block_remove_hashed() reads the page's *current* PAGE_INDEX_ID and calls buf_stat_per_index->dec(index_id_t(space, that_id)). So the design assumes: "whatever index_id the page reports at LRU removal time, decrement that key." The asymmetry on DROP INDEX / TRUNCATE: When an index tree is freed, btr_free_root_invalidate() (storage/innobase/btr/ btr0btr.cc) overwrites the root page's PAGE_INDEX_ID with the sentinel BTR_FREED_INDEX_ID = 0: static const space_index_t BTR_FREED_INDEX_ID = 0; static void btr_free_root_invalidate(buf_block_t *block, mtr_t *mtr) { ut_ad(page_is_root(block->frame)); btr_page_set_index_id(buf_block_get_frame(block), buf_block_get_page_zip(block), BTR_FREED_INDEX_ID, mtr); } This function does NOT call buf_stat_per_index->dec() / erase() for the original (space, real_index_id) key. It only rewrites the in-page identifier. After this point, when the freed root block is eventually evicted by the LRU and buf_LRU_block_remove_hashed() runs: 1. The page's PAGE_INDEX_ID is now 0 (sentinel), or the page has been reused for FSP_FREE / a different segment. 2. The dec() call therefore either targets a different key (key 0, or a newly assigned id), or is skipped entirely because of page-type checks. 3. The original (space, real_index_id) entry never receives a matching dec() and is never erase()d. Net effect per dropped index: at minimum one orphan entry with value == 1 is permanently retained in buf_stat_per_index. Repeated CREATE/DROP INDEX and TRUNCATE drive the hash to monotonic growth. INFORMATION_SCHEMA. INNODB_CACHED_INDEXES diverges from INFORMATION_SCHEMA.INNODB_INDEXES over time (cached >> live) on long-running instances; on busy production servers the gap reaches millions of entries. This is consistent with the "leaks memory by design" property already acknowledged in Bug#35357691 for ut_lock_free_hash_t, but the problem reported here is at a higher level: the *caller's* inc/dec contract is broken, not the hash's internal erase() behavior. Relationship to Bug#120441 (Patch approved 2025-06-08): Bug#120441 addresses a different facet of the same hash. Its problem is that the very first inc() for a new key inside btr_create() is issued while the mini-transaction holds FSP / page SX latches; if that inc() happens to trigger ut_lock_free_hash_t::grow() -> optimize(), the optimize runs synchronously under those latches and a single btr_create() can stall for hundreds of seconds. The approved fix pre-registers the key (ensure_present(key, 0)) before mtr_start(), so the in-mtr inc() only mutates an already-present entry. That fix moves *when* the first insertion happens; it does not change *whether* the entry is ever removed. After Bug#120441's fix is applied: Aspect Before 120441 fix After 120441 fix This bug Where inc() runs inside mtr (latched) before mtr_start() unchanged Can inc() trigger grow()/optimize() under latch? Yes (the stall) No unchanged Is (space, index_id) ever erased on DROP/TRUNCATE? No No No -- this bug Long-term hash size grows monotonically grows monotonically grows monotonically INNODB_CACHED_INDEXES vs INNODB_INDEXES diverges diverges diverges So Bug#120441's fix removes the acute latch stall but leaves the underlying memory growth -- and the fact that optimize() will eventually be invoked again as the orphan-inflated hash crosses the next grow() threshold -- untouched. The two bugs are orthogonal: - Bug#120441 = "do not call inc() at the wrong moment" - this report = "the matching dec() / erase() for the drop path is missing" Both fixes are required. Neither subsumes the other. Affected files / functions: - storage/innobase/include/buf0stats.h buf_stat_per_index_t (wraps ut_lock_free_hash_t) - storage/innobase/btr/btr0btr.cc btr_create() -- calls inc() btr_page_create() -- calls inc() btr_free_root_invalidate() -- rewrites PAGE_INDEX_ID, no dec()/erase() btr_free_if_exists() -- drop-path entry point - storage/innobase/buf/buf0lru.cc buf_LRU_block_remove_hashed() -- calls dec() based on current PAGE_INDEX_ID - storage/innobase/handler/i_s.cc INFORMATION_SCHEMA.INNODB_CACHED_INDEXES (the symptom signal) Impact: 1. Memory leak: ~one entry per dropped index, retained for the lifetime of the server. Unbounded on instances with frequent DDL / TRUNCATE. 2. Indirect amplification of Bug#120441: the inflated hash crosses ut_lock_free_hash_t::grow() thresholds earlier and more often. Even after Bug#120441 moves inc() out of the latch, optimize() itself still runs, costs CPU, and prolongs DDL-heavy windows. Fixing this leak directly reduces how often optimize() is triggered. 3. Diagnostic noise: INNODB_CACHED_INDEXES becomes unusable as an indicator of actual cached B-tree footprint. How to repeat: Any workload that repeatedly creates and drops indexes (or truncates tables) on an 8.0 server reproduces it. Minimal SQL: CREATE DATABASE IF NOT EXISTS leak_db; USE leak_db; DROP TABLE IF EXISTS t; CREATE TABLE t (id INT PRIMARY KEY, a INT, b INT, c INT) ENGINE=InnoDB; -- baseline SELECT COUNT(*) AS cached_before FROM information_schema.innodb_cached_indexes; SELECT COUNT(*) AS live_before FROM information_schema.innodb_indexes; -- repeatedly create/drop a secondary index -- (each iteration consumes a new index_id, so each iteration leaks 1 entry) DELIMITER $$ CREATE PROCEDURE churn(IN n INT) BEGIN DECLARE i INT DEFAULT 0; WHILE i < n DO SET @s = CONCAT('CREATE INDEX ix_', i, ' ON t(a)'); PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt; SET @s = CONCAT('DROP INDEX ix_', i, ' ON t'); PREPARE stmt FROM @s; EXECUTE stmt; DEALLOCATE PREPARE stmt; SET i = i + 1; END WHILE; END$$ DELIMITER ; CALL churn(10000); -- after SELECT COUNT(*) AS cached_after FROM information_schema.innodb_cached_indexes; SELECT COUNT(*) AS live_after FROM information_schema.innodb_indexes; Observed: cached_before ~ N0 live_before ~ M0 cached_after ~ N0 + 10000 <-- monotonically grows by ~ #drops live_after ~ M0 <-- unchanged Equivalent reproductions: - TRUNCATE TABLE on a table with N indexes leaks ~N entries per truncation, because innobase_truncate goes through delete_impl() -> create_impl() and each freed B-tree root takes the same btr_free_root_invalidate() path. - Any DDL workload that drops indexes/tables (sysbench prepare/cleanup cycles, partition exchange/drop, etc.) shows the same monotonic growth. On a long-running production instance: SELECT (SELECT COUNT(*) FROM information_schema.innodb_cached_indexes) AS cached, (SELECT COUNT(*) FROM information_schema.innodb_indexes) AS live, (SELECT COUNT(*) FROM information_schema.innodb_cached_indexes) - (SELECT COUNT(*) FROM information_schema.innodb_indexes) AS leaked; `leaked` only ever increases, never decreases, regardless of buffer pool pressure or restart-free uptime spent idle. Suggested fix: Close the inc/dec asymmetry on the index-drop path. Two equivalent options; (A) is the smallest and most local. Option A -- explicit erase in btr_free_root_invalidate() static void btr_free_root_invalidate(buf_block_t *block, mtr_t *mtr) { ut_ad(page_is_root(block->frame)); const space_id_t space = block->page.id.space(); const space_index_t orig_id = btr_page_get_index_id(block->frame); btr_page_set_index_id(buf_block_get_frame(block), buf_block_get_page_zip(block), BTR_FREED_INDEX_ID, mtr); if (orig_id != BTR_FREED_INDEX_ID) { /* Symmetric counterpart of the inc() in btr_create() / btr_page_create(). After this point PAGE_INDEX_ID is the sentinel and buf_LRU_block_remove_hashed() will not produce a paired dec() against the original key any more. */ buf_stat_per_index->erase(index_id_t(space, orig_id)); } } This is the symmetric counterpart of the inc() in btr_create() and is invoked exactly once per freed root, on the only code path that overwrites PAGE_INDEX_ID with the sentinel. Option B -- drop-path callback If keeping btr_free_root_invalidate() purely structural is preferred, do the erase() one level up, in the drop-index / free-root callers (btr_free_if_exists() and the truncate / TRUNCATE-equivalent paths in dict0crea.cc), iterating over each index of the table being dropped or recreated. In either case the invariant becomes: every (space, index_id) produced by btr_create() is removed by exactly one drop-path erase(), and the LRU dec() path becomes a no-op for already-erased keys (the existing code already tolerates this because LRU removal of a freed root sees BTR_FREED_INDEX_ID and will not match the original key anyway).