Bug #63196 set innodb_adaptive_hash_index=0 hangs server
Submitted: 11 Nov 2011 6:19 Modified: 17 Jul 2013 14:14
Reporter: Mark Callaghan Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S1 (Critical)
Version:5.1.52 OS:Any
Assigned to: CPU Architecture:Any
Tags: adaptive, hang, hash, innodb

[11 Nov 2011 6:19] Mark Callaghan
Description:
This was done on a running server that was too slow with lots of stalls where it appeared that:
1) many threads were waiting to lock btr_search_latch
2) holder of btr_search_latch was waiting on other mutexes

So I took a chance and disabled the AHI. That didn't work out too well

This is the stack for the thread trying to disable the AHI

Thread 284 (Thread 0x5e1e5940 (LWP 7526)):
#0  0x0000003278c0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007aabf5 in os_event_wait_low (event=0x34841810, reset_sig_count=<value optimized out>) at os/os0sync.c:399
#2  0x00000000007ebaa0 in sync_array_wait_event (arr=0x1f474e20, index=11) at sync/sync0arr.c:451
#3  0x00000000007ec21a in rw_lock_s_lock_spin (lock=0x2aaabed26648, pass=<value optimized out>, file_name=0x9b8272 "btr/btr0sea.c", line=1244) at sync/sync0rw.c:436
#4  0x000000000082873a in rw_lock_s_lock_func (space=34, zip_size=8192, offset=263269, rw_latch=1, guess=0x0, mode=11, file=0x9b8272 "btr/btr0sea.c", line=1244, mtr=0x5e1e33c0) at ./incl
ude/sync0rw.ic:418
#5  buf_page_get_gen (space=34, zip_size=8192, offset=263269, rw_latch=1, guess=0x0, mode=11, file=0x9b8272 "btr/btr0sea.c", line=1244, mtr=0x5e1e33c0) at buf/buf0buf.c:2535
#6  0x000000000081f28d in btr_search_drop_page_hash_when_freed (space=34, zip_size=8192, page_no=263269) at btr/btr0sea.c:1243
#7  0x0000000000856895 in fseg_free_page_low (seg_inode=0x2aaaef21c272 "", space=34, zip_size=0, page=18446744073709551615, mtr=0x34841810) at fsp/fsp0fsp.c:3258
#8  0x0000000000856ea8 in fseg_free_page (seg_header=0x2aaaef25c04a "", space=34, page=263269, mtr=0x5e1e4590) at fsp/fsp0fsp.c:3395
#9  0x000000000080df20 in btr_compress (cursor=0x5e1e4000, mtr=0x5e1e4590) at btr/btr0btr.c:2810
#10 0x0000000000811ad3 in btr_cur_compress_if_useful (cursor=0x34841854, mtr=0xa) at btr/btr0cur.c:2769
#11 0x000000000081a2a5 in btr_cur_pessimistic_delete (err=0x5e1e40e8, has_reserved_extents=<value optimized out>, cursor=0x5e1e4000, rb_ctx=RB_NONE, mtr=0x5e1e4590) at btr/btr0cur.c:3012
#12 0x00000000007d2fe4 in row_purge_remove_sec_if_poss_low (node=<value optimized out>, index=0x2ab654065ab8, entry=0x2ab65819b748, mode=33) at row/row0purge.c:300
#13 0x00000000007d3ded in row_purge (thr=0x2ab689d4b0b0) at row/row0purge.c:335
#14 row_purge_step (thr=0x2ab689d4b0b0) at row/row0purge.c:700
#15 0x00000000007be64b in que_thr_step (thr=0x2ab689d4b0b0) at que/que0que.c:1248
#16 que_run_threads_low (thr=0x2ab689d4b0b0) at que/que0que.c:1306
#17 que_run_threads (thr=0x2ab689d4b0b0) at que/que0que.c:1343
#18 0x00000000007f0cff in trx_purge_worker (worker_id=<value optimized out>) at trx/trx0purge.c:1238
#19 0x00000000007e4f71 in srv_purge_worker_thread (arg=<value optimized out>) at srv/srv0srv.c:3587
#20 0x0000003278c062f7 in start_thread () from /lib64/libpthread.so.0
#21 0x00000032780d1e3d in clone () from /lib64/libc.so.6
#22 0x0000000000000000 in ?? ()

In that case thread was at buf_page_get_gen in
btr_search_drop_page_hash_when_freed(
/*=================================*/
        ulint   space,          /*!< in: space id */
        ulint   zip_size,       /*!< in: compressed page size in bytes
                                or 0 for uncompressed pages */
        ulint   page_no)        /*!< in: page number */
{
        buf_block_t*    block;
        mtr_t           mtr;

        if (!buf_page_peek_if_search_hashed(space, page_no)) {

                return;
        }

        mtr_start(&mtr);

        /* We assume that if the caller has a latch on the page, then the
        caller has already dropped the hash index for the page, and we never
        get here. Therefore we can acquire the s-latch to the page without
        having to fear a deadlock. */

        block = buf_page_get_gen(space, zip_size, page_no, RW_S_LATCH, NULL,
                                BUF_GET_IF_IN_POOL, __FILE__, __LINE__,
                                &mtr);

I think it was really hung because a few minutes later this was on the same page number.

#6  0x000000000081f28d in btr_search_drop_page_hash_when_freed (space=34, zip_size=8192, page_no=263269) at btr/btr0sea.c:1243

How to repeat:
disable AHI on a busy server
[11 Nov 2011 6:19] Mark Callaghan
And this is a summary of the other thread stacks

    38 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone,??
     11 pthread_cond_wait@@GLIBC_2.3.2,MYSQL_BIN_LOG::wait_for_update,mysql_binlog_send,dispatch_command,do_command,handle_one_connection,start_thread,clone,??
     10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,row_mysql_freeze_data_dictionary_func,row_undo,row_undo_step,que_thr_step,que_run_threads_low,que_run_threads,trx_general_rollback_for_mysql,row_mysql_handle_errors,row_insert_for_mysql,ha_innobase::write_row,handler::ha_write_row,write_record,mysql_insert,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone,??
      6 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,mtr_s_lock_func,btr_cur_search_to_nth_level,btr_pcur_open_func,row_search_index_entry,row_purge_remove_sec_if_poss_low,row_purge,row_purge_step,que_thr_step,que_run_threads_low,que_run_threads,trx_purge_worker,srv_purge_worker_thread,start_thread,clone,??
      4 read,my_real_read,my_net_read,do_command,handle_one_connection,start_thread,clone,??
      1 select,os_thread_sleep,sync_array_print_long_waits,srv_error_monitor_thread,start_thread,clone,??
      1 select,os_thread_sleep,srv_monitor_thread,start_thread,clone,??
      1 select,os_thread_sleep,srv_LRU_dump_restore_thread,start_thread,clone,??
      1 select,os_thread_sleep,srv_lock_timeout_thread,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_low,rw_lock_x_lock_func,row_mysql_lock_data_dictionary_func,ha_innobase::info,get_schema_tables_record,get_all_tables,get_schema_tables_result,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_pool_drop_hash_index,btr_search_disable,sys_var_pluginvar::update,set_var::update,sql_set_variables,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,mtr_s_lock_func,btr_cur_search_to_nth_level,row_ins_index_entry_low,row_ins_index_entry,row_ins_index_entry_step,row_ins,row_ins_step,row_insert_for_mysql,ha_innobase::write_row,handler::ha_write_row,write_record,mysql_insert,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,mtr_s_lock_func,btr_cur_search_to_nth_level,btr_pcur_open_func,row_search_index_entry,row_purge_remove_sec_if_poss_low,row_purge,row_purge_step,que_thr_step,que_run_threads_low,que_run_threads,trx_purge,srv_purge_thread,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,buf_page_get_gen,btr_search_drop_page_hash_when_freed,fseg_free_page_low,fseg_free_page,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_low,row_purge,row_purge_step,que_thr_step,que_run_threads_low,que_run_threads,trx_purge_worker,srv_purge_worker_thread,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,buf_flush_page,buf_flush_try_neighbors,buf_flush_batch,srv_master_thread,start_thread,clone,??
      1 pthread_cond_timedwait@@GLIBC_2.3.2,kill_server,kill_server_thread,start_thread,clone,??
      1 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,THD::init,THD::THD,handle_connections_sockets,main
      1 do_sigwait,sigwait,signal_hand,start_thread,clone,??
[11 Nov 2011 6:28] Sunny Bains
pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,mtr_s_lock_func,btr_cur_search_to_nth_level,btr_pcur_open_func,row_search_index_entry,row_purge_remove_sec_if_poss_low,row_purge,row_purge_step,que_thr_step,que_run_threads_low,que_run_threads,trx_purge_worker,srv_purge_worker_thread,start_thread,clone,??

This doesn't look like Oracle code:

  trx_purge_worker
  srv_purge_worker_thread
[11 Nov 2011 6:30] Mark Callaghan
Sure it is. It just isn't 5.1.52 code. That is the innodb multi-threaded purge patch. I think it originated from one of your branches and then you redid it. Note that without this patch we might not be running mysql today.
[11 Nov 2011 6:36] Sunny Bains
Which branch, do you remember? The reason is that I don't remember writing a patch that
worked like this. All versions of MT purge have used the work queue. So, I'm curious.  All
worker threads always call trx_purge() and always have, I don't recall having a special function
trx_purge_worker().
[11 Nov 2011 6:47] Mark Callaghan
You would have to ask Percona. I thought that an official branch (trunk or something) was the basis for their patch. I don't want to take credit away from them for code they might have originated though. But when I needed multi-threaded purge, their patch was much easier to port than the official fix in 5.6.
[11 Nov 2011 6:52] MySQL Verification Team
could this be related?

http://bugs.mysql.com/bug.php?id=59225
(Deadlock between DML and innodb_adaptive_hash_index=OFF)
[11 Nov 2011 6:53] Sunny Bains
ok, then it is not our patch. I can see why back porting the 5.6 patch is more difficult. The kernel
mutex split and code reorganisation would make it problematic. Though I don't think purge is
related to this bug.
[11 Nov 2011 8:55] MySQL Verification Team
i got a hang.  check thread 23.

Attachment: bug63196_5.5.17_hang.txt (text/plain), 195.44 KiB.

[11 Nov 2011 9:05] MySQL Verification Team
thread apply all bt full

Attachment: bug63196_5.5.17_hang_thread_apply_all_bt_full.txt.zip (application/octet-stream, text), 49.28 KiB.

[11 Nov 2011 14:56] MySQL Verification Team
same testcase as above, with 5.5.13, also hangup after a while.

Attachment: gdb_5.5.13_hang_stacks.txt (text/plain), 201.64 KiB.

[11 Nov 2011 15:11] MySQL Verification Team
repeated a hang quicker on 5.1.59+plugin.... stacks.

Attachment: gdb_5.1.59_plugin_hang_stacks.txt (text/plain), 115.82 KiB.

[14 Nov 2011 4:50] MySQL Verification Team
this is not repeatable on mysql-trunk after 48 hours.
[17 Jul 2013 14:14] MySQL Verification Team
Apparently the bug is not repeatable on current versions due to the fix for http://bugs.mysql.com/bug.php?id=62487 having fixed this issue also.