MySQL Bugs: #62030: trx_sys_flush_max_trx_id stalls transaction start

Bug #62030	trx_sys_flush_max_trx_id stalls transaction start
Submitted:	29 Jul 2011 18:17	Modified:	14 May 2014 12:37
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB Plugin storage engine	Severity:	S5 (Performance)
Version:	5.6.3	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	innodb, stall

Description:
I am running the innodb features preview for 5.6.3 and trying to understand what keeps me from saturating the IO capacity of a flash device. One of the stalls occurs on transaction start even for a read-only workload because of trx_sys_flush_max_trx_id. Every 256 transaction starts this function is called and no transactions can start/end until it is done and in this case it looks like the call to:
  sys_header = trx_sysf_get(&mtr);
in trx_sys_flush_max_trx_id gets blocked on this in buf_page_get_gen
  rw_lock_s_lock(hash_lock);

#0  0x0000003cc7c0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000082b672 in os_event_wait_low (event=0x16f21ba0, reset_sig_count=5) at /s/bld/563inno/storage/innobase/os/os0sync.c:214
#2  0x000000000086fac5 in sync_array_wait_event (arr=0x8fbc5e0, index=40) at /s/bld/563inno/storage/innobase/sync/sync0arr.c:421
#3  0x00000000008705b9 in rw_lock_s_lock_spin (lock=0x16f21498, pass=<value optimized out>, file_name=0xa58260 "/s/bld/563inno/storage/innobase/buf/buf0buf.c", line=2445) at /s/bld/563inno/storage/innobase/sync/sync0rw
.c:458
#4  0x00000000008b2230 in buf_page_get_gen (space=0, zip_size=0, offset=5, rw_latch=2, guess=0x0, mode=10, file=0xa420e0 "/s/bld/563inno/storage/innobase/include/trx0sys.ic", line=<value optimized out>, mtr=0x4b516030)
 at /s/bld/563inno/storage/innobase/include/sync0rw.ic:418
#5  0x000000000087ec4c in trx_sys_flush_max_trx_id () at /s/bld/563inno/storage/innobase/include/trx0sys.ic:102
#6  0x0000000000882a45 in trx_start_low (trx=0x2aaf780d4558) at /s/bld/563inno/storage/innobase/include/trx0sys.ic:456
#7  0x000000000085db1f in row_search_for_mysql (buf=0x2aaf82c19390 "", mode=2, prebuilt=0x2aaf6d80d048, match_mode=1, direction=0) at /s/bld/563inno/storage/innobase/row/row0sel.c:3924
#8  0x00000000007f2955 in ha_innobase::index_read (this=0x2aaf6d7cac30, buf=0x2aaf82c19390 "", key_ptr=<value optimized out>, key_len=<value optimized out>, find_flag=<value optimized out>) at /s/bld/563inno/storage/in
nobase/handler/ha_innodb.cc:6245
#9  0x0000000000504df1 in handler::index_read_idx_map (this=0x2aaf6d7cac30, buf=0x2aaf82c19390 "", index=<value optimized out>, key=0x2aaf8c0b59d0 "Tj\016", keypart_map=1, find_flag=HA_READ_KEY_EXACT) at /s/bld/563inno
/sql/handler.cc:5608
#10 0x00000000006478b7 in join_read_const (tab=0x2aaf8c0b5188) at /s/bld/563inno/sql/sql_select.cc:19273
#11 0x0000000000656bce in join_read_const_table (tab=0x2aaf8c0b5188, pos=0x2aaf8c0b5638) at /s/bld/563inno/sql/sql_select.cc:19168
#12 0x00000000006666c9 in make_join_statistics (join=0x2aaf8c0b48b0, tables_arg=0x2aaf8c0b3fc0, conds=0x2aaf8c0b4f70, keyuse_array=0x2aaf8c0b4b28, first_optimization=true) at /s/bld/563inno/sql/sql_select.cc:5248
#13 0x000000000066be12 in JOIN::optimize (this=0x2aaf8c0b48b0) at /s/bld/563inno/sql/sql_select.cc:2124
#14 0x00000000006702d6 in mysql_select (thd=0x2aaf74140c80, tables=0x2aaf8c0b3fc0, wild_num=0, fields=@0x2aaf74142e68, conds=0x2aaf8c0b4690, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=21
47748608, result=0x2aaf8c0b4890, unit=0x2aaf74142728, select_lex=0x2aaf74142d58) at /s/bld/563inno/sql/sql_select.cc:3776
#15 0x0000000000670bae in handle_select (thd=0x2aaf74140c80, lex=0x2aaf74142678, result=0x2aaf8c0b4890, setup_tables_done_option=0) at /s/bld/563inno/sql/sql_select.cc:412
#16 0x0000000000623602 in execute_sqlcom_select (thd=0x2aaf74140c80, all_tables=0x2aaf8c0b3fc0) at /s/bld/563inno/sql/sql_parse.cc:4679
#17 0x0000000000629ae0 in mysql_execute_command (thd=0x2aaf74140c80) at /s/bld/563inno/sql/sql_parse.cc:2311
#18 0x000000000062debb in mysql_parse (thd=0x2aaf74140c80, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4b518e00) at /s/bld/563inno/sql/sql_parse.cc:5741
#19 0x000000000062f507 in dispatch_command (command=COM_QUERY, thd=0x2aaf74140c80, packet=<value optimized out>, packet_length=49) at /s/bld/563inno/sql/sql_parse.cc:1181
#20 0x000000000062f8b4 in do_command (thd=0x2aaf74140c80) at /s/bld/563inno/sql/sql_parse.cc:903
#21 0x00000000006077bf in do_handle_one_connection (thd_arg=0x2aaf74140c80) at /s/bld/563inno/sql/sql_connect.cc:787
#22 0x0000000000607869 in handle_one_connection (arg=0x16f21be4) at /s/bld/563inno/sql/sql_connect.cc:706
#23 0x0000003cc7c062f7 in start_thread () from /lib64/libpthread.so.0
#24 0x0000003cc74d1e3d in clone () from /lib64/libc.so.6

I think this thread has an x lock on hash_lock and is blocked on fil_system->mutex
      pthread_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,fil_tablespace_deleted_or_being_deleted_in_mem,buf_page_init_for_read,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,row_search_for_mysql,ha_innobase::index_read,handler::index_read_idx_map,join_read_const,join_read_const_table,make_join_statistics,JOIN::optimize,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone

I don't see a thread that holds fil_system->mutex, perhaps it was just released

How to repeat:
run read-only sysbench on ramfs or fast flash device
use PMP to view stalls

The max trx id needs to be persistent, currently we do it after 256 allocations. Increasing this size to 1024 may help. The thread is stuck in the buffer pool. The rw-lock around the trx_list has been
replaced by a mutex. 

I've seen this in my tests too and the main problem I face has to do with the order of the transaction list.
This stall feeds into the read view creation too. If I can get rid of the total order required on the
trx_list then one solution is to make this operation asynchronous and add a barrier. Allocating a
 transaction id that is not persistent is OK provided that id doesn't make it to persistent store (do
inserts and updates). In other words we can add the trx to the trx list and do other trx start operations
in parallel while another thread makes it persistent and wait for that thread to complete IO.

The trx_start()/trx_commit() and read view create/destroy is under the scanner.

Running with it set to 1024 did not make a difference. I will look at it again in a few days.

Thanks for testing with 1024. No easy options then.

I have a patch that "allocates" blocks of transactions ids and does it in a background thread. When
starting new transactions (or serialising them in the UNDO log) it allocates from this block. When the
allocation block is down to 1/3 a new block allocation is triggered. This should remove the bottleneck
around next id generation. I will probably test it for performance and correctness next. After staring
at the code, it looks like we don't really need to be too strict about this either. We can lazily make
the max trx id persistent. The recovery code should check for highest trx id seen and adjust accordingly.
If that holds then that option will be explored too.

This was fixed in 5.7.2, see http://dev.mysql.com/worklog/task/?id=6047