Bug #69276 fil_system->mutex contention limits block read rate with fast storage
Submitted: 18 May 2013 19:23 Modified: 19 May 2013 5:01
Reporter: Mark Callaghan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.6.11 OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[18 May 2013 19:23] Mark Callaghan
Description:
With really fast storage (simulated by using RAM via OS filesystem cache) the block read rate on my 32-core test servers is limited by mutex contention on fil_system->mutex after fixing a few other things in 5.6.11 including http://bugs.mysql.com/bug.php?id=69275 and a change to make the page cleaner thread run more than once per second.

By "limits" I mean that I can't get more than 240,000 block reads/second from a simple read-only sysbench workload. While there are worse problems to have than this, I think InnoDB can do better without major changes.

The problem is that this is done at least 4 times per block read from fil_space_get_version, fil_inc_pending_ops, fil_decr_pending_ops, fil_space_get_size:
  lock fil_system->mutex
  fil_space_get_by_id(id)
  do some simple work
  unlock fil_system->mutex

Can this be optimized to only do the hash table lookup once per block read? Have the first of the callers increment a reference count on the space to keep it from being deleted and then somehow save the pointer to the tablespace so the other callers can skip the hash table lookup.

How to repeat:
Run sysbench with a simple read-only workload -- lookup one row by PK with the database cached in the OS filesystem cache and the InnoDB buffer pool much smaller than the database. I hacked sysbench to optionally use HANDLER READ and then I hacked InnoDB to use the read-only optimization in transaction start for HANDLER READ. I also changed the page cleaner thread to run 10 times/second rather than once to keep ahead of the demand for free pages from the LRU tail.

Suggested fix:
use a reference count on the tablespace and save the pointer after the first lookup
[18 May 2013 19:25] Mark Callaghan
Aggregated thread stacks from one example. The count prior to each call stack is the number of threads with that stack.

     47 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,fil_space_get_version,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_for_mysql,ha_innobase::index_read,Sql_cmd_handler_read::execute,mysql_execute_command,mysql_parse,mysql_parse,dispatch_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     10 __io_getevents_0_4,os_aio_linux_collect,os_aio_linux_handle,fil_aio_wait,io_handler_thread,start_thread,clone
      2 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,fil_io,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_for_mysql,ha_innobase::index_read,Sql_cmd_handler_read::execute,mysql_execute_command,mysql_parse,mysql_parse,dispatch_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      2 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,fil_inc_pending_ops,ibuf_merge_or_delete_for_page,buf_page_io_complete,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_for_mysql,ha_innobase::index_read,Sql_cmd_handler_read::execute,mysql_execute_command,mysql_parse,mysql_parse,dispatch_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,fil_space_get_size,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_do_flush_list_batch,buf_flush_batch,buf_flush_list,page_cleaner_do_flush_batch,page_cleaner_flush_pages_if_needed,buf_flush_page_cleaner_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,fil_mutex_enter_and_prepare_for_io,fil_io,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_for_mysql,ha_innobase::index_read,Sql_cmd_handler_read::execute,mysql_execute_command,mysql_parse,mysql_parse,dispatch_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,fil_decr_pending_ops,ibuf_merge_or_delete_for_page,buf_page_io_complete,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_for_mysql,ha_innobase::index_read,Sql_cmd_handler_read::execute,mysql_execute_command,mysql_parse,mysql_parse,dispatch_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      1 mutex_reset_lock_word,mutex_exit_func,fil_inc_pending_ops,ibuf_merge_or_delete_for_page,buf_page_io_complete,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_search_for_mysql,ha_innobase::index_read,Sql_cmd_handler_read::execute,mysql_execute_command,mysql_parse,mysql_parse,dispatch_command,do_handle_one_connection,handle_one_connection,start_thread,clone
[19 May 2013 2:16] Sunny Bains
I like the reference count idea, will investigate.
[15 Sep 2014 8:01] Marko Mäkelä
I pushed a somewhat related patch to MySQL 5.7.6, in Bug#19149177.
That patch replaces fil_inc_pending_ops() and fil_decr_pending_ops() with fil_space_acquire() and fil_space_release(), and makes a bit more use of these functions. The fil_space_release() is directly operating on a pointer.

Many of the lookups and mutex operations that you mention are still remaining.

I think that it would be best to cache the fil_space_t* pointer in dict_index_t and in the undo log data structures. With that in place, maybe we could even replace the buf_pool->page_hash with a per-tablespace hash table.
[5 Nov 2014 15:15] Marko Mäkelä
Posted by developer:
 
Slightly related:
Bug#19949683 SPEED UP FSP_GET_AVAILABLE_SPACE_IN_FREE_EXTENTS()
removes an unnecessary buffer pool lookup and paves the way for
further optimizations, by caching some tablespace header fields
in the fil_space_t object. It also removed one space_id lookup,
but that was not a busy code path.