MySQL Bugs: #120441: InnoDB btr_create() can hold page latches for a long time because buf_stat_per_index->inc() may synchronously run ut

Bug #120441	InnoDB btr_create() can hold page latches for a long time because buf_stat_per_index->inc() may synchronously run ut_loc
Submitted:	11 May 11:29	Modified:	12 May 2:31
Reporter:	Demon Chen	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	trunk 8.0.24 8.4.9	OS:	Any
Assigned to:		CPU Architecture:	Any
Description:
InnoDB can hold page latches for a very long time during B-tree creation because btr_create() calls buf_stat_per_index->inc() before the caller commits the mini-transaction.

We observed this in production with a TRUNCATE TABLE statement stuck for hundreds of seconds in the InnoDB transaction state “creating index”.

The observed statement was:

TRUNCATE TABLE hvbi.ODS_C_CAT;

SHOW ENGINE INNODB STATUS showed:

TRANSACTION ..., ACTIVE ... sec creating index MySQL thread id 25531586, OS thread handle 281451691169904 TRUNCATE TABLE hvbi.ODS_C_CAT

The SEMAPHORES section showed a different thread waiting for a page latch:

Thread 281453378132080 has waited at buf0flu.cc line 1413 ... SX-lock on RW-latch ... a writer (thread id 281451691169904) has reserved it in mode SX Last time write locked in .../storage/innobase/fsp/fsp0fsp.cc line 203

The latch holder thread 281451691169904 was the TRUNCATE thread. The waiting thread 281453378132080 was a flush/page-cleaner thread waiting at buf0flu.cc:1413.

The relevant wait point is:

rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE)

The latch was last write locked at fsp0fsp.cc:203:

blk = buf_page_get(page_id_t(id, 0), page_size, RW_SX_LATCH, mtr);

This indicates that the TRUNCATE thread was holding a tablespace page 0 / FSP header page latch, while a flush thread was waiting for the same page latch.

A core dump showed that the TRUNCATE thread was not stuck in file extension or I/O. Instead, it was stuck inside the per-index buffer pool statistics hash table:

#0 ut_lock_free_hash_t::insert_or_get_position_in_array(...) #1 ut_lock_free_hash_t::insert_or_update(...) #2 ut_lock_free_hash_t::copy_to_another_array(...) #3 ut_lock_free_hash_t::optimize() #4 ut_lock_free_hash_t::insert_or_update(...) #5 buf_stat_per_index_t::inc(...) #6 btr_create(...) #7 dict_create_index_tree_in_mem(...) #8 row_create_index_for_mysql(...) #9 create_index(...) #10 create_table_info_t::create_table(...) #11 innobase_basic_ddl::create_impl(...) #12 innobase_truncate::truncate(...) #13 innobase_truncate::exec(...) #14 ha_innobase::truncate_impl(...) #15 ha_innobase::create(...) #16 handler::ha_create(...) #17 ha_create_table(...) #18 Sql_cmd_truncate_table::truncate_base(...) #19 Sql_cmd_truncate_table::execute(...)

In frame copy_to_another_array(), GDB showed:

k = 138019246553550166 v = 1 i = 5957568

The key corresponds to buf_stat_per_index’s key format:

key = (space_id << 32) | index_id

Decoded:

space_id = key >> 32 = 321139 index_id = key & 0xffffffff = 532119

The value i = 5957568 means that copy_to_another_array() had already scanned about 5.9 million hash slots. This indicates that ut_lock_free_hash_t::optimize() was synchronously migrating a very large hash array.

The problematic structure is the following.

dict_create_index_tree_in_mem() starts an mtr, calls btr_create(), and commits the mtr only after btr_create() returns:

mtr_start(&mtr);

page_no = btr_create(index->type, index->space, dict_table_page_size(index->table), index->id, index, &mtr);

index->page = page_no; index->trx_id = trx->id;

mtr_commit(&mtr);

However, btr_create() calls buf_stat_per_index->inc() before returning:

ut_ad(page_get_max_insert_size(page, 2) > 2 * BTR_PAGE_MAX_REC_SIZE);

buf_stat_per_index->inc(index_id_t(space, index_id));

return (page_no);

Therefore, buf_stat_per_index->inc() is executed before the caller commits the mtr.

buf_stat_per_index_t::inc() calls the lock-free hash:

void inc(const index_id_t &id) { if (should_skip(id)) { return; }

m_store->inc(id.conv_to_int()); }

ut_lock_free_hash_t::insert_or_update() may grow the hash table and then synchronously call optimize():

next = arr->grow(DELETED, &grown_by_this_thread);

if (grown_by_this_thread) { call_optimize = true; }

...

if (optimize_allowed && call_optimize) { optimize(); }

optimize() synchronously migrates entries from the old array to the new array:

void optimize() { mutex_enter(&m_optimize_latch);

for (;;) { arr_node_t *arr = m_data.load(std::memory_order_relaxed); arr_node_t *next = arr->m_next.load(std::memory_order_relaxed);
if (next == nullptr) {
  break;
}

copy_to_another_array(arr, next);

...
}

mutex_exit(&m_optimize_latch); }

copy_to_another_array() scans the entire old array:

void copy_to_another_array(arr_node_t *src_arr, arr_node_t *dst_arr) { for (size_t i = 0; i < src_arr->m_n_base_elements; i++) { key_val_t *t = &src_arr->m_base[i]; ... insert_or_update(k, v, false, dst_arr, false); } }

In our case, this scan had already reached i = 5957568.

This means a non-critical statistics update can run a large synchronous hash table migration while page latches acquired by the mtr are still held. This can prolong latch ownership for hundreds of seconds and block flush/page-cleaner threads.

This is not TRUNCATE-specific. TRUNCATE was only the trigger in this case because InnoDB implements it through a recreate path. Any operation that creates an index tree and reaches btr_create() may trigger the same issue, including CREATE TABLE, CREATE INDEX, ALTER TABLE ADD INDEX, ALTER TABLE FORCE, OPTIMIZE TABLE, table rebuild, and partition rebuild/truncate paths.

Current MySQL trunk status: The current MySQL trunk still appears to have the same structural issue. btr_create() still calls:

buf_stat_per_index->inc(index_id_t(space, index_id));

before returning, and btr_create() itself does not commit the mtr. The caller still commits the mtr after btr_create() returns.

The current trunk has improved ut_lock_free_hash_t::guess_position() by using:

ut::hash_uint64(key) & (arr_size - 1)

instead of the older:

ut_fold_ull(key) & (arr_size - 1)

This should reduce hash collisions and reduce the probability of hitting this issue. However, it does not eliminate the structural problem that a synchronous ut_lock_free_hash_t::optimize() can still happen inside btr_create() before mtr_commit(). The current trunk also does not appear to provide a runtime switch to disable buf_stat_per_index tracking.

Expected behavior: A non-critical per-index buffer pool statistics update should not be able to keep InnoDB page latches held for a long time during B-tree creation. Expensive hash table migration should not happen inside the mtr/latch critical path.

Actual behavior: buf_stat_per_index->inc() can trigger ut_lock_free_hash_t::optimize() inside btr_create() before the caller executes mtr_commit(). If the hash table has grown large, optimize() may synchronously copy millions of slots. During this time, page latches acquired earlier in the same mtr remain held, causing page cleaner / flush threads to wait.

Impact: Long-running DDL, for example TRUNCATE TABLE stuck in “creating index”. Long semaphore wait in buf0flu.cc. Page cleaner / flush thread blocked by a DDL thread holding a page latch. Potential wider flushing stalls if the page cleaner is blocked for a long time.

How to repeat:
This issue is workload dependent and may require the per-index buffer pool statistics hash table to become large before the problem is visible.

A possible way to reproduce or stress the problem is:

1.Use a MySQL/InnoDB build where ut_lock_free_hash_t::guess_position() uses ut_fold_ull(key) & (arr_size - 1), and where btr_create() calls buf_stat_per_index->inc() before the caller commits the mtr.

2.Create a workload that causes high churn in buf_stat_per_index keys. The key format is:

key = (space_id << 32) | index_id

Such churn can be produced by repeatedly creating, dropping, truncating, and rebuilding many InnoDB tables/indexes, especially with multiple secondary indexes and partitions.

3.Keep the server under buffer pool page read/evict pressure, because normal page reads and evictions also update buf_stat_per_index through inc() and dec().

4.Once the hash has grown large, run DDL that creates an index tree, for example:

CREATE TABLE t (..., PRIMARY KEY(...), KEY k1(...), KEY k2(...)) ENGINE=InnoDB;

or:

TRUNCATE TABLE existing_innodb_table;

or:

ALTER TABLE t ADD INDEX k_new(...);

5.If the DDL thread happens to trigger ut_lock_free_hash_t::optimize() from btr_create(), the thread may spend a long time in:

ut_lock_free_hash_t::copy_to_another_array() ut_lock_free_hash_t::optimize() buf_stat_per_index_t::inc() btr_create()

6.While this is happening, use SHOW ENGINE INNODB STATUS. The DDL transaction may show “creating index”, and the SEMAPHORES section may show a flush/page-cleaner thread waiting at buf0flu.cc for a latch last write locked by the DDL thread at fsp0fsp.cc.

The exact reproduction depends on hash table size, key distribution, deleted entries, and concurrent buffer pool activity. In our production core dump, copy_to_another_array() had already scanned 5,957,568 slots.

Suggested fix:
Suggested fix 1: Move buf_stat_per_index->inc(index_id_t(space, index_id)) out of btr_create(), or at least ensure it is executed only after the caller commits the mtr and releases page latches.

For example, conceptually in dict_create_index_tree_in_mem():

page_no = btr_create(..., &mtr); index->page = page_no; index->trx_id = trx->id; mtr_commit(&mtr);

if (page_no != FIL_NULL) { buf_stat_per_index->inc(index_id_t(index->space, index->id)); }

This would avoid executing a potentially expensive statistics hash update while page latches are still held by the mtr.

Suggested fix 2: Make ut_lock_free_hash_t::optimize() asynchronous or lazy, so the thread that triggers grow() does not synchronously migrate a potentially very large array.

Suggested fix 3: Add a runtime switch to disable buf_stat_per_index tracking in high-churn or high-contention environments.

Suggested fix 4: Keep the improved hash function currently present in trunk, but also add instrumentation for: grow count optimize count optimize duration copied array size maximum linear probe length current array size array chain length

This would help detect the problem before it manifests as a long semaphore wait.