Bug #77401 btr_search_drop_page_hash_index() accesses potentially freed dict_index_t
Submitted: 18 Jun 2015 8:48 Modified: 12 Aug 2015 13:50
Reporter: Marko Mäkelä Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.0.3 OS:Any
Assigned to: CPU Architecture:Any

[18 Jun 2015 8:48] Marko Mäkelä
Description:
I realized that I introduced a potential race condition in MySQL 5.0.3 when implementing ROW_FORMAT=COMPACT.

The race condition goes like this:

1. Thread 1 starts executing btr_search_drop_page_hash_index() on a block.
2. Thread 2 starts executing btr_search_drop_page_hash_index() on the same block.
3. Thread 2 finishes, setting block->index = NULL, and also setting index->search_info->ref_count = 0.
4. The table is dropped (or starting from MySQL 5.6, evicted from the dict_sys cache).
5. Thread 1 resumes, dereferencing the block->index that it cached to a local variable, while computing the folds[]

At step 5, we would have block->index == NULL, and the previous value of block->index points to memory that was freed at step 4.

How to repeat:
Read the code.
For an easier-to-understand repetition scenario, replace Thread 2 with btr_search_disable().

Suggested fix:
Never dereference block->index while not holding the adaptive hash index latch.
After acquiring the latch, check if block->index==NULL (we shold always do this already).
[18 Jun 2015 9:04] Marko Mäkelä
Posted by developer:
 
Workaround: --skip-innodb-adaptive-hash-index
[12 Aug 2015 13:47] Marko Mäkelä
Posted by developer:
 
I wanted to fix this perceived problem by introducing an optional
parameter "bool hazard=false" to the function
btr_search_drop_page_hash_index(). It would only be passed as
hazard=true by buf_LRU_free_page(), which I verified to be the only
caller that is not necessarily holding any MDL or table handle on the
table that the being-freed page belongs to.

To obtain proof that my patch is needed for correctness, I applied a
simpler patch, adding some diagnostics to
btr_search_drop_page_hash_index():

diff --git a/storage/innobase/btr/btr0sea.cc b/storage/innobase/btr/btr0sea.cc
index 6aa0650..3452816 100644
--- a/storage/innobase/btr/btr0sea.cc
+++ b/storage/innobase/btr/btr0sea.cc
@@ -1218,6 +1218,15 @@ retry:
 	      || rw_lock_own(&block->lock, RW_LOCK_S)
 	      || rw_lock_own(&block->lock, RW_LOCK_X));
 
+	if (index->table->n_ref_count == 0 && !mutex_own(&dict_sys->mutex)) {
+		fprintf(stderr, "InnoDB: Warning: hazard: %s\n",
+			index->table->name.m_name);
+		if (strstr(index->table->name.m_name, "test/t") ||
+		    strstr(index->table->name.m_name, "mysql/")) {
+			os_thread_sleep(100000000);/*100 seconds*/
+		}
+	}
+
 	/* We must not dereference index here, because it could be freed
 	if (index->table->n_ref_count == 0 && !mutex_own(&dict_sys->mutex)).
 	Determine the ahi_slot based on the block contents. */

A variation of this instrumentation (replacing the sleep with an
assertion failure) was triggered by some tests, such as these ones:

innodb_zip.wl6347_comp_indx_stat         [ pass ]  26504
innodb_zip.wl6470_2                      [ fail ]
        Test ended at 2015-08-12 15:15:13

CURRENT_TEST: innodb_zip.wl6470_2
mysqltest: At line 245: query 'reap' failed: 2013: Lost connection to MySQL server during query

The current statement is:
update t5 set c1 = c1 + 1000 where c1 > 10;

I was able to merge these two test files and reduce them to a single
test case that runs with --manual-gdb, for a different statement:
delete from t5 where c1 > 10

Both times, we try to evict AHI entries for the table
mysql.innodb_index_stats.  In the active gdb session, we have two
active threads (the others are idle, waiting for user input or
internal events):

Thread 26: delete from t5 where c1 > 10
Thread 25: delete from t2 where c1 > 10

The stack trace of Thread 26 is like this:

#2 btr_search_drop_page_hash_index
#3 buf_LRU_free_page (bpage=0x7fffde6c1778, zip=true)
#4 buf_LRU_free_from_common_LRU_list (buf_pool=0x2d799f8, scan_all=false)
#5 buf_LRU_scan_and_free_block (buf_pool=0x2d799f8, scan_all=false)
#6 buf_LRU_get_free_block (buf_pool=0x2d799f8)
#7 buf_page_create
#8 fsp_page_create
#9 fsp_alloc_free_page (space=22, ...)
#10 fseg_alloc_free_page_low
#11 fseg_alloc_free_page_general
#12 trx_undo_add_page

Note: the space=22 corresponds to srv_tmp_space.m_space_id (allocating
undo log for an operation on a TEMPORARY TABLE). It has nothing to do
with the tablespace (14) of the table mysql.innodb_index_stats. Only
the clustered index root page (14:3) has AHI entries. It is a
single-page index with 7 entries.

I let Thread 26 resume from the breakpoint that I had set at the
sleep, and concurrently initiated a new connection like this:

echo 'drop table mysql.innodb_index_stats;'|
mysql --defaults-file=/dev/shm/t/mysql-test/var/my.cnf -A -uroot

Then, the gdb session would show the following for the new Thread 27:

#2 dict_index_remove_from_cache_low
#3 dict_table_remove_from_cache_low
#4 dict_table_remove_from_cache
#5 row_drop_table_from_cache (tablename="mysql/innodb_index_stats",...)
#6 row_drop_table_for_mysql (name="mysql/innodb_index_stats",...)
#7 ha_innobase::delete_table (name="./mysql/innodb_index_stats")

[New Thread 0x7fffde495700 (LWP 11029)]
2015-08-12T12:49:47.178869Z 5 [ERROR] InnoDB: Waited for 5 secs for hash index ref_count (1) to drop to 0. index: `PRIMARY` table: `mysql`.`innodb_index_stats`
2015-08-12T12:49:52.273839Z 5 [ERROR] InnoDB: Waited for 10 secs for hash index ref_count (1) to drop to 0. index: `PRIMARY` table: `mysql`.`innodb_index_stats`

So, my reproduction attempt failed due to the above check in
dict_index_remove_from_cache_low().

Let us check if it is possible (although improbable) that another
thread invokes btr_search_drop_page_hash_index() on the same page
(14:3) to make the dict_index_remove_from_cache_low() return. If we
look at Thread 26, it is not holding any of the following:

* MDL (otherwise, the DROP TABLE would not have reached ha_innobase::delete_table)
* dict_sys->mutex
* block->mutex (block->mutex.m_impl.is_owned() == false)
* block->lock (block->lock->debug_list is empty)
* block->page.buf_fix_count (it is 0)

It would first seem that another thread could drop the AHI before our
Thread 26 acquires the AHI latch for this slot.

However, in this test scenario, we also have
block->page.state==BUF_BLOCK_REMOVE_HASH and
block->page.in_page_hash==false. The are set in
buf_LRU_block_remove_hashed().

Now, the question is: Can buf_LRU_free_page() invoke
btr_search_drop_page_hash_index() without first invoking
buf_LRU_block_remove_hashed()? If not, we are safe. And the answer is
indeed no, we are safe.

(Clarification: buf_LRU_block_remove_hashed() makes the buf_block_t
unaccessible to other threads by disconnecting it from the
(space_id,page_no) lookup table and by changing its state to
BUF_BLOCK_REMOVE_HASH.)

So, it looks like this was an unfounded concern.

I would not call it a wasted effort, because it was not obvious to me
how this would work.

It is hard to repeat this kind of scenarios in mysql-test-run (mtr),
because mtr tests tend to be rather short, and they will always
execute DROP TABLE to clean up after themselves. Therefore, there
usually are not many big 'idle' tables in the buffer pool, and you do
not tend to get this type of AHI eviction by buffer pool eviction; you
would get it evicted by DROP TABLE or DROP DATABASE instead. In mtr,
this type of scenario is basically only possible for shared predefined
tables, such as data dictionary tables, the persistent statistics
tables, or (theoretically) the GTID table if do not have too frequent
updates to it.
[12 Aug 2015 14:10] Marko Mäkelä
Posted by developer:
 
It could seem that my analysis did not cover the scenario in the Description.
The instrumentation that I introduced was at the start of the function
btr_search_drop_page_hash_index(), before it acquires the AHI latch for the first time.
The AHI latch would be released during the computation of the folds[] array,
and reacquired after that.

Most calls to btr_search_drop_page_hash_index() are protected by block->lock.
The only exception is buf_LRU_free_page(), which would have disconnected the
block from buf_page_get_gen() and invalidated the block state. Because of this,
the only way to get multiple concurrent calls to btr_search_drop_page_hash_index()
on the same block would be when each thread is holding block->lock S-latch.
But, in this case the threads would also be holding an open table handle or MDL,
preventing table metadata eviction and DROP TABLE.