Bug #59291 | InnoDB purge lag much worse for 5.5.8 versus 5.1 | ||
---|---|---|---|
Submitted: | 4 Jan 2011 20:32 | Modified: | 3 Oct 2013 6:10 |
Reporter: | Mark Callaghan | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S5 (Performance) |
Version: | 5.5 | OS: | Any |
Assigned to: | Sunny Bains | CPU Architecture: | Any |
Tags: | innodb, lag, multiple rollback segments, purge |
[4 Jan 2011 20:32]
Mark Callaghan
[4 Jan 2011 20:37]
Mark Callaghan
Before realizing that purge lag was the problem, I created http://bugs.mysql.com/bug.php?id=57489 for significant performance regressions between MySQL 5.5 and 5.1 See https://www.facebook.com/notes/mysql-at-facebook/mysql-cpu-bound-performance-51-55-and-the... for details on my tests
[5 Jan 2011 1:48]
James Day
Mark, query_cache_type=OFF should be used for 5.5 when there's lots of concurrency. It reduces the mutex contention for the query cache even when size is 0. I think we're changing the default from on to off for 5.5, not sure when. I always wonder about innodb_io_capacity when background threads get behind because Oracle InnoDB tries to respect that setting and the 200 default is often too low for high performance systems. Why not fast mutexes in 5.5 as well as 5.1?
[5 Jan 2011 20:45]
Inaam Rana
Mark, 8 16 32 64 128 256 threads 15592 118953 91617 8017707 15197122 15214778 5.5.8-0 359 284 280 6376105 12897579 11738230 5.5.8-1 71260 80561 310605 17609 86113 52303 5.1.52 Above 5.1.52 is the FB version of 5.1? If so, have you made any changes in purge code or how often purge is scheduled in FB patch? Do you have numbers for stock 5.1.47?
[5 Jan 2011 20:53]
Inaam Rana
I tried with stock sysbench on my modest 8 core box with 1 point_select and 1 index update in equal ratio and uniform distribution on a single 5 million table with 32 clients. 5.5 shines as compare to 5.1. The TPS are higher and the purge lag is slightly higher as we are able to do more work in 5.5. 5.5: TPS: 11541 History list length growth: History list length 2941465 History list length 3219597 History list length 3499632 History list length 3779928 History list length 4055433 History list length 4333017 History list length 4598088 History list length 4880252 History list length 5160025 5.1: TPS: 9916 history list length growth: History list length 2069883 History list length 2275163 History list length 2482399 History list length 2627458 History list length 2838499 History list length 3043952 History list length 3242746 History list length 3449381 History list length 3645951 History list length 3821100 Next I have to try multi-table sysbench or some variant of it to see if I can reproduce this
[5 Jan 2011 21:01]
Inaam Rana
Mark, Quick question about your test setup: Do you restart the server between each run or by 'giving time to recover' you mean that you let the server idle for 20 minutes and then start next run without restart. I ask this because if the purge lag is running in millions then it is possible that due to large number of UNDO pages your entire 16G buffer pool got filled up. After that point, if server is not restarted before every run, then every subsequent run, strictly speaking, cannot be called an in-memory workload. I am not sure how much impact triggering of LRU eviction will have on the results but I suspect it not be a lot given that most of the UNDO pages are 'cold pages'. In any case, it'd be good to know if at any point the buffer pool spill happened.
[6 Jan 2011 5:32]
Mark Callaghan
I will repeat the test and collect purge lag throughout the benchmark. I would have that data already if Innodb reported purge lag in SHOW STATUS (hint, hint). I don't restart the server between runs for each concurrency level.
[6 Jan 2011 15:31]
Mark Callaghan
You get ~10,000 TPS for 32 concurrent clients. I get between 27,000 and 35,000. Are you sure your database is cached at the start of the test? They are in my case -- 16G buffer pool and ~8G of data and I prime the buffer pool at test start. If your test has IO stalls then it is more likely that purge can keep up.
[6 Jan 2011 16:13]
Inaam Rana
Mark, There were no IO stalls. The difference in TPS is perhaps because I am running the stock sysbench on a single while you were running your own flavor. The test was CPU bound throughout the run. Also can you answer the questions by James about the query cache and the fast mutexes?
[6 Jan 2011 16:36]
Mark Callaghan
query_cache_type=ON for all tests. I will repeat with it OFF, but if the problem is limited to 5.5.8 and ON, then that is a performance regression. Dump the query cache or make it off by default. I didn't build with fast mutexes for 5.5.8 because you don't build with it as I used the cmake scripts that MySQL claims to use for their binaries. I get lots of different messages from people in MySQL about the future of fast mutexes. Is it staying or going? Why don't your builds use it? I will try it with fast mutexes enabled, but a few changes on your side could make life easier for people like me.
[6 Jan 2011 16:37]
Mark Callaghan
Inaam, You don't need my sysbench hacks. All of the magic is in the shell scripts. Just run 8 concurrent instances of it. Also, the problem appears to start with 64 concurrent clients.
[6 Jan 2011 16:39]
Mark Callaghan
And now 5.5.9 will get with-fast-mutexes, when it wasn't in 5.5.8 builds http://bugs.mysql.com/bug.php?id=58766
[6 Jan 2011 23:47]
Alexey Stroganov
Mark, To get the same workload I've tried a bit different way - I've slightly modified oltp_rw.lua workload from sysbench 0.5 and added --tables parameter that allows to specify what number of tables one would use for testing. so prepare/run will look like following prepare: sysbench-0.5.0 --test=multi_table_rw.lua --oltp-table-size=4000000 --mysql-table-engine=innodb \ --db-ps-mode=disable --mysql-engine-trx=yes --rand-type=uniform --tables=8 \ --mysql-host=<host> --mysql-user=<user> --mysql-port=<port> prepare run: sysbench-0.5.0 --test=multi_table_rw.lua --oltp-table-size=4000000 --mysql-table-engine=innodb \ --db-ps-mode=disable --mysql-engine-trx=yes --rand-type=uniform --tables=8 \ --mysql-host=<host> --mysql-user=<user> --mysql-port=<port> \ --max-requests=0 --max-time=600 --num-threads=<threads> run I've run this workload with 256 threads/8 tables that means 64 threads per table and your exact configs for 5.5.8/5.1. It seems that I don't see any notable difference in sense of purge lag: 5.1.50 TPS: 13729.27 5.5.8 TPS: 13885.19 5.1.50: size of ibdata1/history length right after stop of sysbench - /data0/data/ibdata1 1.679.818.752 - History length: 3212664 5.5.8: size of ibdata1/history length right after stop of sysbench - /data0/data/ibdata1 1.696.595.968 - History length: 3251717 If you will have chance can you please try my variant of workload to check if you can repeat the same behavior with it.
[6 Jan 2011 23:48]
Alexey Stroganov
Test case for the issue
Attachment: multi_table_rw.lua (text/x-lua), 3.51 KiB.
[7 Jan 2011 1:39]
Mark Callaghan
I have never used the Lua version of sysbench. I know I should, but I have a few nice patches for sysbench 0.4 that I like. So I will do this but in return can you repeat your test with 8 threads per table for 64 threads total. Where did you get sysbench from? Can you send me a URL for that? I repeated the test for 64 concurrent threads with 8 per table and the history list lengths at 1-minute intervals are listed below. The test ran for 20 minutes. The results for 558 are much worse than for the 51 releases. Results with query_cache_type=0 and builds using --with-fast-mutexes for 558 were no better than what is listed here. 122 12 5147 5152 558 27225 45827 8713 51899 330980 356086 110242 326102 410195 576066 12713 572020 132556 84705 656031 37837 11501 1158614 1272295 361269 1690426 504 79095 1700081 118360 156928 2072686 1544348 244717 2665173 152711 138543 2838506 341397 30268 3166159 1521737 249165 3820811 441446 167466 4130098 958994 32983 4231953 1065207 240991 4838093 313345 80180 5444631 802280 354299 5501184 544112 160487 5864877 My TPS numbers are still much higher than yours (25,000 to 35,000). You report ~10,000. For my tests begin/commit are not run. Each transaction is select/update and the update is auto-commit.
[7 Jan 2011 1:47]
Mark Callaghan
The columns are mysql: 5.1.47, 5.1.52, 5.5.8 The last column (5.5.8) has by far the largest values for purge lag
[7 Jan 2011 14:01]
Alexey Stroganov
Mark, Sure I will do that testing on my side. Lua version of sysbench - bzr branch lp:sysbench Actually this version is much more simpler for adding any extension,query or flow modification. It is synced with 0.4 so it has all fixes from there like: --tx-rate=N target transaction rate (tps) [0] --tx-jitter=N target transaction variation, in microseconds [0] --report-interval=N periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports Please note that in 0.5.0 dist type options have different names (--rand* instead of --oltp-dist*): --rand-init=[on|off] initialize random number generator [off] --rand-type=STRING random numbers distribution {uniform,gaussian,special} [special] --rand-spec-iter=N number of iterations used for numbers generation [12] --rand-spec-pct=N percentage of values to be treated as 'special' (for special distribution) [1] --rand-spec-res=N percentage of 'special' values to use (for special distribution) [75] --rand-seed=N seed for random number generator, ignored when 0 [0]
[7 Jan 2011 14:48]
Inaam Rana
Mark, I presume that you have run 'show innodb status' periodically to get history list length. If you have that output can you send it over. If you also have 'show engine innodb mutex' that would be fabulous.
[7 Jan 2011 21:57]
Mark Callaghan
I provided history list length at 1 minute intervals from a 20-minute test in my previous message. I have to repeat tests to get mutex status because MySQL continues to change the syntax for SHOW commands. InnoDB also changed the format ('os_waits=2' now, used to just list the number). aggregated mutex stats from 5.5.8 121461 /s/bld/558orig/storage/innobase/trx/trx0rseg.c:207 94302 /s/bld/558orig/storage/innobase/log/log0log.c:771 40327 /s/bld/558orig/storage/innobase/srv/srv0srv.c:996 19525 /s/bld/558orig/storage/innobase/btr/btr0sea.c:182 10604 /s/bld/558orig/storage/innobase/buf/buf0buf.c:1208 6269 /s/bld/558orig/storage/innobase/fil/fil0fil.c:1251 428 /s/bld/558orig/storage/innobase/log/log0log.c:775 101 /s/bld/558orig/storage/innobase/fil/fil0fil.c:1559 52 /s/bld/558orig/storage/innobase/log/log0log.c:832 11 /s/bld/558orig/storage/innobase/buf/buf0buf.c:1243 10 /s/bld/558orig/storage/innobase/dict/dict0dict.c:695 1 /s/bld/558orig/storage/innobase/trx/trx0sys.c:190 1 /s/bld/558orig/storage/innobase/trx/trx0purge.c:238 1 /s/bld/558orig/storage/innobase/srv/srv0srv.c:999 more for 5.5.8 from another 20-minute test run 154189 /s/bld/558orig/storage/innobase/log/log0log.c:771 112734 /s/bld/558orig/storage/innobase/trx/trx0rseg.c:207 59097 /s/bld/558orig/storage/innobase/srv/srv0srv.c:996 37855 /s/bld/558orig/storage/innobase/buf/buf0buf.c:1208 25252 /s/bld/558orig/storage/innobase/btr/btr0sea.c:182 6863 /s/bld/558orig/storage/innobase/fil/fil0fil.c:1251 1225 /s/bld/558orig/storage/innobase/log/log0log.c:775 341 /s/bld/558orig/storage/innobase/fil/fil0fil.c:1559 94 /s/bld/558orig/storage/innobase/buf/buf0buf.c:1243 64 /s/bld/558orig/storage/innobase/log/log0log.c:832 10 /s/bld/558orig/storage/innobase/dict/dict0dict.c:695 3 /s/bld/558orig/storage/innobase/trx/trx0sys.c:190 Top-N for 5.1.52 + facebook patch 2957178 trx/trx0rseg.c:210 182466 srv/srv0srv.c:1135 110395 log/log0log.c:759 30822 btr/btr0sea.c:183 1952 buf/buf0buf.c:1050 322 ha/hash0hash.c:192 315 fil/fil0fil.c:1388 112 log/log0log.c:826 8 srv/srv0start.c:980 2 dict/dict0dict.c:655 Top-N for 5.1.47 5284161 trx/trx0rseg.c:210 254172 srv/srv0srv.c:938 109009 log/log0log.c:759 46978 btr/btr0sea.c:170 28260 buf/buf0buf.c:957 1990 fil/fil0fil.c:1517 199 log/log0log.c:815 10 srv/srv0start.c:980 1 dict/dict0dict.c:667 1 dict/dict0dict.c:655
[7 Jan 2011 22:01]
Mark Callaghan
The other thing I have are PMP (http://poormansprofiler.org) samples during the run. I grep the stacks for those with purge and the results from several samples are: --- pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone -- 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 __lll_mutex_unlock_wake,_L_mutex_unlock_59,__pthread_mutex_unlock_usercnt,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 1 buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone ---- 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone ----- 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone ------------------- And then I have this from two samples for 5.1.52 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,log_preflush_pool_modified_pages,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
[7 Jan 2011 23:54]
Mark Callaghan
I repeated the tests for 20 minutes on 5147 and 558 and collected PMP samples every 30 seconds. The results below are filtered to only list stack traces with "purge". There were many more for 558 and there appear to be more cases where 558 waits on a mutex or condition variable than 5147. The format for this is #occurrences, stack Results for 558 10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 8 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 8 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 5 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 3 trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 3 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mtr_commit,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 2 sched_yield,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 2 buf_page_get_gen,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 2 buf_calc_page_new_checksum,buf_flush_init_for_writing,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 log_write_low,mtr_commit,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone Results for 5147 3 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 ut_delay,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 1 trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 1 sched_yield,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 1 rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_rseg_get_next_history_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,buf_page_get_gen,trx_purge_rseg_get_next_history_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 mtr_read_ulint,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 1 mlog_write_ulint,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 mlog_write_ulint,flst_insert_before,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone 1 buf_page_get_gen,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
[8 Jan 2011 1:19]
Inaam Rana
Mark, Thanks for the info. What I meant was if you can send the entire 'show innodb status' output taken periodically over the run. May be, we'll get some hint from there.
[8 Jan 2011 1:55]
Mark Callaghan
After patching 5.5.8 to use one rollback segment rather than 128 and recreating the database, the history list length growth problem appears to have stopped. Can you ask Dimitri to run tests for this using a cached database for dbstress and measure history list length over time?
[8 Jan 2011 2:22]
Sunny Bains
" 10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone " As the number of UNDO segments increase the function trx_purge_choose_next_log() becomes more expensive. This has been fixed in 5.6.
[8 Jan 2011 2:31]
Mark Callaghan
The thread doing purge holds purge_sys->mutex while iterating on the list of rollback segments. Do any other threads need to lock purge_sys->mutex and thus get blocked?
[8 Jan 2011 4:57]
Sunny Bains
No other thread needs to block on purge_sys->mutex except the thread doing the purge. Strictly speaking I don't think the mutex is even required.
[8 Jan 2011 15:22]
Mark Callaghan
Another problem we have had in production with 5.1 is purge lag on a few servers that doesn't decrease. Looking at the srv_master_thread makes it clear why that can happen when the purge delay option is not used -- trx_purge is called once per 10 seconds for up to 20 undo log pages. 5.5 with the innodb_purge_threads=1 should improve that
[10 Jan 2011 17:43]
Dimitri KRAVTCHUK
Mark, any more details about the server you're using to reproduce the observed issue? (I'm pretty sure it has more than 8 cores, no?).. then, are you running your test locally on the same server or via network? (I'm curious why the issue was not reproduced easily) Then regarding the problem: #1.) I think there are many reasons purge may not follow the load.. - and the only possible solutions here are: purge harder OR use a kind of throttling on DML statements.. purge harder - if we have more purge threads (but it's not ready yet in 5.5, and will not be an easy task as it may yet again reduce performance (as I've observed before on XtraDB)) DML throttling - using innodb_max_purge_lag=N will help here, but the code should be fixed (the check for the consistent read view should be removed).. Mark, you may try it, it may only work, and gives a better result than having more purge threads (http://dimitrik.free.fr/blog/archives/2010/05/mysql-performance-improving-stability.html) #2.) According Mark, the issue is coming from the bigger number of the rollback segments available within 5.5 (looks to me like a contention on the rollback segment mutex blocked InnoDB before from a higher load..) - now, what if we simply add a configuration parameter like "innodb_rollback_segments" - I mean we'll still create 128 segments as we're doing currently, but then will use *only* a given number, that's all. So, in case there are still other users meeting the same problem - it can be easily fixed by using a lower number of RBS.. Rgds, -Dimitri
[10 Jan 2011 17:47]
Mark Callaghan
My setup is: * server host has 8 cores * client host has 16 cores with HT on (same HW as server) * server and client are in the same rack I don't understand why my peak TPS is much higher than yours. My guess is that your tests ran with the clients and server on the same host. But I am not sure if your tests used 8 instances of sysbench.
[10 Jan 2011 23:00]
Mark Callaghan
When you ran your lua/sysbench test did you confirm that QPS/TPS was relatively stable throughout the test. I think the latest version of sysbench or the lua script is unstable (sysbench 0.5 with Lua). I built it from launchpad and this is the rate of Questions from SHOW STATUS reported over 20 second intervals. There are repeated intervals of 1 to 2 minutes where sysbench+lua isues ~0 queries. 35013.1 67222.4 68018.45 72725.3 74538.75 74902.65 74107.75 75325.7 71333.8 10345.75 130.25 0.5 0.35 0.35 0.35 0.35 42404.45 73697.05 74900.55 74858.9 74921.3 73783.7 74735.95 73921.3 60418.95 0.35 0.35 0.35 510.15 0.35 0.35 14594.15 73156.15 All of the threads in sysbench get stuck on network read, and when hung there are queries displayed in SHOW PROCESSLIST but the Questions count stops growing, so it looks like sysbench+lua pauses and stops reading data from sockets that have data. The thread stack is: read,vio_read_buff,my_real_read,my_net_read,cli_safe_read,cli_read_query_result,mysql_real_query,mysql_drv_query,mysql_drv_execute,db_execute,sb_lua_db_execute,luaD_precall,luaV_execute,luaD_call,luaD_rawrunprotected,luaD_pcall,lua_pcall,sb_lua_op_execute_request,runner_thread,start_thread,clone But the binary uses the thread safe mysql client libs: gcc -W -Wall -Wextra -Wpointer-arith -Wbad-function-cast -Wstrict-prototypes -Wnested-externs -Winline -funroll-loops -Wundef -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wcast-align -pthread -O2 -ggdb3 -o sysbench --all-static sysbench.o sb_timer.o sb_options.o sb_logger.o db_driver.o tests/fileio/libsbfileio.a tests/threads/libsbthreads.a tests/memory/libsbmemory.a tests/cpu/libsbcpu.a tests/mutex/libsbmutex.a scripting/libsbscript.a drivers/mysql/libsbmysql.a -L/data/5152orig/lib/mysql /data/5152orig/lib/mysql/libmysqlclient_r.so -lcrypt -lnsl -lpthread -lz scripting/lua/src/liblua.a -ldl -lrt -lm -Wl,--rpath -Wl,/data/5152orig/lib/mysql -Wl,--rpath -Wl,/data/5152orig/lib/mysql final result from sysbench+lua --> ~21,000 TPS The max per-request time is ~146 seconds. That is really bad. OLTP test statistics: queries performed: read: 24962864 write: 24962864 other: 0 total: 49925728 transactions: 0 (0.00 per sec.) deadlocks: 0 (0.00 per sec.) read/write requests: 49925728 (40102.80 per sec.) other operations: 0 (0.00 per sec.) Test execution summary: total time: 1244.9436s total number of events: 24962864 total time taken by event execution: 78830.5580s per-request statistics: min: 0.29ms avg: 3.16ms max: 146571.19ms approx. 95 percentile: 7.17ms Threads fairness: events (avg/stddev): 390044.7500/2477.07 execution time (avg/stddev): 1231.7275/1.98 ------------------ I didn't look at this before but the max request times for my sysbench read-write tests are also bad in 5.5.8 (65 seconds) compared to ~1 second for 5.1.52+facebook and ~400ms for 5.1.47. This only a problem for my sysbench read-write tests. It is not a problem for my read-only tests. Here is more data from my use of 8 concurrent instances of sysbench with 8 threads each. The only runs that have high max per-request values are the read-write on 5.5.8 max per instance for 5.5.8 read-only [1293100742] max: 15.95ms [1293101344] max: 11.84ms [1293101945] max: 102.63ms [1293102546] max: 299.48ms [1293103147] max: 397.00ms [1293103749] max: 254.50ms [1293104351] max: 416.42ms [1293104954] max: 421.48ms max per instance for 5.1.52 read-only [1293076036] max: 103.12ms [1293076637] max: 27.39ms [1293077238] max: 162.98ms [1293077839] max: 998.04ms [1293078440] max: 1506.41ms [1293079042] max: 2302.87ms [1293079648] max: 1832.13ms [1293080268] max: 738.39ms max per-instance for 5.1.47 read-only [1293090886] max: 86.99ms [1293091487] max: 10.64ms [1293092088] max: 78.61ms [1293092689] max: 476.69ms [1293093291] max: 362.65ms [1293093893] max: 365.18ms [1293094498] max: 512.64ms [1293095105] max: 272.17ms max per-instance for 5.1.47 read-write [1293631822] max: 232.57ms [1293633044] max: 201.62ms [1293634266] max: 182.75ms [1293635488] max: 1129.07ms [1293636711] max: 195.82ms [1293637913] max: 228.27ms [1293640238] max: 1505.17ms [1293641462] max: 239.67ms max per-instance for 5.1.52 read-write [1293582169] max: 121.25ms [1293583392] max: 146.69ms [1293584614] max: 348.14ms [1293585836] max: 480.16ms [1293587059] max: 705.35ms [1293588281] max: 562.35ms [1293589504] max: 743.61ms [1293590728] max: 918.94ms max per instance for 5.5.8 read-write [1293664157] max: 1228.81ms [1293665379] max: 9414.58ms [1293666602] max: 14973.30ms [1293667824] max: 20148.61ms [1293669135] max: 45227.82ms [1293670498] max: 31623.50ms [1293671863] max: 36253.58ms [1293673154] max: 24608.35ms
[10 Jan 2011 23:29]
Mark Callaghan
These numbers make no sense to me right now so I will recompile binaries and then repeat the test on a different pair of servers.
[11 Jan 2011 0:22]
Mark Callaghan
Repeated test using sysbench+lua with better 5.5.8 binary (compiled with fast mutexes, not hacked to use 1 rollback segment). Still has bad max per-request time (120 seconds). Purge lag grew to 6M during the test. This does ~37 QPS on average which is ~18,500 TPS on average. That is still much better than the 10,000 TPS that everyone but me has reported. But there are many stalls where TPS goes to zero for 1+ minute. I think that sysbench+lua are fine given the 5.1.52 results below. I think the problem is either a bug in 5.5.8 or a problem with my 5.5.8 binary. Results for 5.5.8 OLTP test statistics: queries performed: read: 22605256 write: 22605256 other: 0 total: 45210512 transactions: 0 (0.00 per sec.) deadlocks: 0 (0.00 per sec.) read/write requests: 45210512 (37364.76 per sec.) other operations: 0 (0.00 per sec.) Test execution summary: total time: 1209.9774s total number of events: 22605256 total time taken by event execution: 77142.7893s per-request statistics: min: 0.29ms avg: 3.41ms max: 121380.38ms approx. 95 percentile: 9.55ms Threads fairness: events (avg/stddev): 353207.1250/2298.50 execution time (avg/stddev): 1205.3561/1.17 QPS every 20 seconds has a lot of variance: 0.35 677.35 3411.9 3931.15 4762.5 5935.85 7527.85 11319.45 21171.8 55009.55 69177.25 73735.7 75611.2 75752.5 75581.95 71883.3 16848.05 200.3 0.35 0.35 0.5 21806.55 72472.8 74143.45 75860.05 76030.55 74360.2 73787.85 72129.95 22247.15 0.35 469.65 0.35 0.35 0.35 72072.8 69042.95 73679.35 72285.8 70123.35 73747.85 71243.05 61629.8 0.35 0.35 0.35 0.35 0.35 57218.65 71939.8 70774.65 70116.6 71240.5 72391.15 69894.85 59205.8 18088.65 0.65 0.35 0.35 ------------------- Results for 5.1.52 are much better: average, max and fewer stalls, Peak purge lag was less than 400,000 versus 6M for 5.5.58 OLTP test statistics: queries performed: read: 37414066 write: 37414066 other: 0 total: 74828132 transactions: 0 (0.00 per sec.) deadlocks: 0 (0.00 per sec.) read/write requests: 74828132 (62356.24 per sec.) other operations: 0 (0.00 per sec.) Test execution summary: total time: 1200.0104s total number of events: 37414066 total time taken by event execution: 76769.7165s per-request statistics: min: 0.28ms avg: 2.05ms max: 1612.07ms approx. 95 percentile: 10.55ms Threads fairness: events (avg/stddev): 584594.7812/2517.86 execution time (avg/stddev): 1199.5268/0.01 QPS every 20 seconds: 56329.3 67086.5 68839.8 68499.8 68483.0 70003.8 72789.1 73814.6 74485.3 74132.3 75499.6 75058.5 74627.8 75160.9 75146.9 75262.3 74457.4 31643.5 726.3 1840.3 40750.1 60005.0 65113.0 72981.6 72749.3 71929.1 71321.7 74930.2 72290.9 76580.1 74689.8 73368.5 74931.6 74967.9 75683.0 75039.9 74697.4 47255.0 1252.0 2186.5 31514.9 58356.0 62233.5 74019.7 72598.3 73788.0 72976.5 74370.8 71570.8 74253.1 75616.2 75012.9 70646.4 75856.0 76486.6 74392.1 73610.9 62255.5 1928.3 2198.7 16901.4 57148.2 62231.2 70456.2 70519.3 70772.2 72650.6 75979.9 75779.7 72831.3 75493.9 75682.7 75455.6 74966.3 76342.6 74573.4 74517.9 74070.6 5822.1 2549.5 10215.7 58248.9 65300.1 66148.1 70590.6 70040.3 73672.6 72493.9 74786.4 73297.4 73530.1 76199.3 74989.2 75767.1 75089.1 73468.7 75305.0 75738.6 17530.9 2459.5 4430.2 51282.5 61999.1 62736.2 67090.9 73165.2 71576.5 72746.4 72631.5 73728.7 74813.7 74847.0 75925.4 73384.8 74791.2 73393.5 75102.1 73586.8 5731.1
[11 Jan 2011 0:54]
Mark Callaghan
When 5.5.8 stalls, the thread stacks are listed below 32 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 20 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::index_read_idx_map,join_read_const,join_read_const_table,make_join_statistics,JOIN::optimize,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 12 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone 1 pwrite64,os_file_write_func,os_aio_func,fil_io,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone 1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_monitor_thread,start_thread,clone 1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_lock_timeout_thread,start_thread,clone 1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_error_monitor_thread,start_thread,clone 1 poll,handle_connections_sockets,mysqld_main,__libc_start_main,_start,poll 1 do_sigwait,sigwait,signal_hand,start_thread,clone And the problem is this thread: pwrite64,os_file_write_func,os_aio_func,fil_io,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone I used the mstat binary to plot these values at 20 second intervals * qps - the rate for Questions from SHOW STATUS * ups - the rate for Innodb_rows_updated from SHOW STATUS * pdirty - the value of Innodb_buffer_pool_pages_dirty from SHOW STATUS * pflush - the rate of Innodb_buffer_pool_pages_flushed from SHOW STATUS * data_wr - the rate of Innodb_data_written from SHOW STATUS * page_wr - the rate of Innodb_pages_written from SHOW STATUS You can see the stalls in this data. Is 5.5.8 stalled by slow disks? But 5.1.52 is not? Or is there a problem in the page flush code? SHOW ENGINE INNODB STATUS does not report latency for disk reads/writes/fsync so this is harder than it should be to debug. From the 5.1.52+facebook run I know that disk writes by Innodb take about 0.01ms each (thanks to HW RAID card) and fsync takes ~3.5ms. echo " qps ups pdirty pflush data_wr page_wr"; awk '{ if (NF > 500) {printf "%6d %6d %7d %7d %8d %8d\n", $563, $477, $396, $399, $431, $433 } }' o.ms | head -100 qps ups pdirty pflush data_wr page_wr 0 0 445208 0 0 0 71991 35990 455369 0 33002 36411340 73518 36757 461879 0 33683 36997990 75361 37675 466692 0 34622 37943756 75431 37711 470472 0 34701 38037350 72669 36329 473990 0 33541 36629248 74165 37081 477053 0 34121 37369190 77005 38498 480777 0 35419 38738688 75715 37852 483470 0 34894 38313574 75997 37999 486483 0 34946 38331033 76164 38077 489882 0 35042 38356275 71613 35798 480698 25 32665 87789260 7125 3560 457895 0 3321 81876889 0 0 434599 0 83 76336332 0 0 413287 0 83 69835161 0 0 388775 0 95 80320921 422 211 368582 24861 271 68220313 0 0 345542 0 71 75287756 0 0 319782 0 82 84724940 0 0 294726 0 97 81998643 0 0 271686 0 90 75497472 0 0 247236 0 96 80320921 0 0 223174 0 93 78643200 0 0 201286 0 85 71722598 0 0 178758 0 88 73819750 0 0 154182 0 96 80530636 0 0 135494 0 73 61236838 0 0 111686 0 93 78014054 0 0 84806 0 106 88080384 0 0 61382 0 92 76755763 49531 24769 233608 65934 22701 82348185 74975 37490 377977 614 34397 47274752 75992 37986 444059 307 34908 43205222 76393 38191 473434 460 35075 45949593 76547 38273 488931 588 35119 48029747 75528 37758 496404 640 34703 48535705 75320 37655 502052 614 34640 48117862 75556 37776 507798 691 34669 49136742 75034 37509 511400 563 34471 46937907 75586 37790 512880 793 34761 50985420 76346 38192 516677 486 35084 46345164 75558 37752 519362 640 34662 48433766 76361 38174 522625 640 35074 48904908 72629 36313 518191 2585 33222 79934515 71084 35537 515839 1894 32422 91468236 24923 12458 499821 0 11401 84649881 0 0 477165 0 88 74029465 0 0 451565 0 100 84095795 0 0 425069 0 103 86822092 0 0 402413 0 88 74448896 0 0 379437 0 89 75078041 0 0 356845 0 88 74029465 0 0 334317 0 88 73610035 0 0 310381 0 94 78643200 0 0 285677 0 96 80950067 0 0 263021 0 88 74239180 0 0 240621 0 87 73400320 0 0 217957 0 88 74448896 0 0 196281 0 84 70883737 0 0 172397 0 93 78223769 0 0 146029 0 103 86402662 0 0 121325 0 97 80950067 39601 19804 257156 85702 18178 86207283 74225 37114 409716 1331 34103 59077734 75162 37571 475519 1228 34497 58350028 75007 37501 502091 1484 34444 61632972 75723 37864 513106 1280 34737 59022592 74600 37289 518156 1280 34247 58364313 74448 37221 519504 1561 34159 63365068 75774 37883 520525 1356 34784 59951769 75881 37934 522953 972 34877 54294732 75324 37663 523845 1203 34563 57319782 72340 36162 524486 1203 33099 56194867 74158 37075 528526 1228 34099 57367398 74059 37028 528789 1996 33925 73507584 72635 36311 527478 3456 33159 96810035 69623 34808 530054 3302 31858 95451289 46373 23181 525342 0 21195 88891084
[11 Jan 2011 1:15]
Mark Callaghan
Repasting as some columns were wrong: * qps - rate of Questions * ups - rate of Innodb_rows_updated * pdirty - value of Innodb_buffer_pool_pages_dirty * pflush - rate of Innodb_buffer_pool_pages_flushed * data_wr - rate of Innodb_data_writes * page_wr - rate of Innodb_dblwr_pages_written * dsync - rate of Innodb_data_fsyncs * lsync - rate of Innodb_os_log_fsyncs Here is one example of a stall. When the stall happens, * data_wr drops to ~100, which is about the speed of a single disk (odd) * page_wr spikes - given the drop in data_wr, a lot of large IOs are being done But why is InnoDB staying in this state? qps ups pdirty pflush data_wr page_wr dsync lsync 0 0 477165 0 88 2252 69 0 0 0 451565 0 100 2572 81 0 0 0 425069 0 103 2649 80 0 0 0 402413 0 88 2265 68 0 0 0 379437 0 89 2291 68 0 0 0 356845 0 88 2265 68 0 0 0 334317 0 88 2240 68 0 0 0 310381 0 94 2406 73 0 0 0 285677 0 96 2470 74 0 0 0 263021 0 88 2265 68 0 0 0 240621 0 87 2240 67 0 --- and all of the data # echo " qps ups pdirty pflush data_wr page_wr dsync lsync"; awk '{ if (NF > 500) {printf "%6d %6d %7d %7d %8d %8d %5d %5d\n", $563, $477, $396, $399, $431, $435, $419, $445 } }' o.ms | head -80 qps ups pdirty pflush data_wr page_wr dsync lsync 0 0 445208 0 0 0 0 0 71991 35990 455369 0 33002 0 1 1 73518 36757 461879 0 33683 0 1 1 75361 37675 466692 0 34622 0 1 1 75431 37711 470472 0 34701 0 1 1 72669 36329 473990 0 33541 0 1 1 74165 37081 477053 0 34121 0 1 1 77005 38498 480777 0 35419 0 1 1 75715 37852 483470 0 34894 0 1 1 75997 37999 486483 0 34946 0 1 1 76164 38077 489882 0 35042 0 1 1 71613 35798 480698 25 32665 1574 47 8 7125 3560 457895 0 3321 2393 62 1 0 0 434599 0 83 2329 68 0 0 0 413287 0 83 2124 64 0 0 0 388775 0 95 2457 74 0 422 211 368582 24861 271 2075 60 0 0 0 345542 0 71 2291 60 0 0 0 319782 0 82 2585 67 0 0 0 294726 0 97 2508 75 0 0 0 271686 0 90 2304 70 0 0 0 247236 0 96 2444 74 0 0 0 223174 0 93 2406 75 0 0 0 201286 0 85 2188 67 0 0 0 178758 0 88 2252 67 0 0 0 154182 0 96 2457 74 0 0 0 135494 0 73 1868 56 0 0 0 111686 0 93 2380 71 0 0 0 84806 0 106 2688 81 0 0 0 61382 0 92 2342 65 0 49531 24769 233608 65934 22701 1747 46 3 74975 37490 377977 614 34397 294 6 1 75992 37986 444059 307 34908 153 3 1 76393 38191 473434 460 35075 230 4 1 76547 38273 488931 588 35119 294 6 1 75528 37758 496404 640 34703 320 6 1 75320 37655 502052 614 34640 307 6 1 75556 37776 507798 691 34669 345 6 1 75034 37509 511400 563 34471 281 5 1 75586 37790 512880 793 34761 396 7 1 76346 38192 516677 486 35084 243 5 1 75558 37752 519362 640 34662 320 6 1 76361 38174 522625 640 35074 320 6 1 72629 36313 518191 2585 33222 1331 38 6 71084 35537 515839 1894 32422 1689 53 9 24923 12458 499821 0 11401 2201 61 3 0 0 477165 0 88 2252 69 0 0 0 451565 0 100 2572 81 0 0 0 425069 0 103 2649 80 0 0 0 402413 0 88 2265 68 0 0 0 379437 0 89 2291 68 0 0 0 356845 0 88 2265 68 0 0 0 334317 0 88 2240 68 0 0 0 310381 0 94 2406 73 0 0 0 285677 0 96 2470 74 0 0 0 263021 0 88 2265 68 0 0 0 240621 0 87 2240 67 0 0 0 217957 0 88 2265 67 0 0 0 196281 0 84 2163 65 0 0 0 172397 0 93 2393 71 0 0 0 146029 0 103 2636 81 0 0 0 121325 0 97 2470 73 0
[11 Jan 2011 1:24]
Mark Callaghan
the stall happens when the main thread is in log_preflush_pool_modified_pages on a line that changed between 5.5.8: n_pages = buf_flush_list(ULINT_MAX, new_oldest); ...and 5.1.52 n_pages = buf_flush_batch(BUF_FLUSH_LIST, ULINT_MAX, new_oldest); The full stack trace for the stall: Thread 67 (Thread 1215478080 (LWP 12505)): #0 0x0000003f7dc0d838 in pwrite64 () from /lib64/libpthread.so.0 #1 0x0000000000896d38 in os_file_write_func (name=0x309b8198 "./ibdata1", file=3, buf=0x2aaf677e8000, offset=2097152, offset_high=0, n=1048576) at /s/bld/558orig/storage/innobase/os/os0 file.c:2298 #2 0x000000000089a450 in os_aio_func (type=11, mode=<value optimized out>, name=0x309b8198 "./ibdata1", file=3, buf=0x2aaf677e8000, offset=2097152, offset_high=0, n=1048576, message1=0x 309b8098, message2=0x0) at /s/bld/558orig/storage/innobase/include/os0file.ic:354 #3 0x000000000085c9b8 in fil_io (type=<value optimized out>, sync=<value optimized out>, space_id=<value optimized out>, zip_size=0, block_offset=2097152, byte_offset=0, len=1048576, bu f=0x2aaf677e8000, message=0x0) at /s/bld/558orig/storage/innobase/include/os0file.ic:246 #4 0x000000000082eefe in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:800 #5 0x000000000082fc58 in buf_flush_try_neighbors (space=15, offset=11741, flush_type=BUF_FLUSH_LIST, n_flushed=230474, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobas e/buf/buf0flu.c:952 #6 0x00000000008303d2 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4872ad88) at /s/bld/558orig/s torage/innobase/buf/buf0flu.c:1499 #7 0x000000000083091e in buf_flush_batch (buf_pool=0x159b4708, flush_type=BUF_FLUSH_LRU, min_n=18446744073709551614, lsn_limit=69091037851) at /s/bld/558orig/storage/innobase/buf/buf0fl u.c:1621 #8 0x0000000000832560 in buf_flush_list (min_n=18446744073709551614, lsn_limit=69091037851) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1929 #9 0x000000000088a7f5 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658 #10 0x00000000008af335 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #11 0x00000000007ecf47 in trx_purge (limit=3757588) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #12 0x00000000007e1692 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2584 #13 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0 #14 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6
[11 Jan 2011 1:26]
Sunny Bains
Looks like row0purge.c may be forcing extra flushing. Can you verify with the following patch? === modified file 'storage/innobase/row/row0purge.c' --- storage/innobase/row/row0purge.c revid:vasil.dimov@oracle.com-20110108145236-ffp48jfkfiq0n46r +++ storage/innobase/row/row0purge.c 2011-01-11 01:24:11 +0000 @@ -282,7 +282,6 @@ mtr_t mtr; enum row_search_result search_result; - log_free_check(); mtr_start(&mtr); search_result = row_search_index_entry(index, entry, BTR_MODIFY_TREE, @@ -362,8 +361,6 @@ btr_pcur_t pcur; enum row_search_result search_result; - log_free_check(); - mtr_start(&mtr); /* Set the purge node for the call to row_purge_poss_sec(). */ This may possibly break correctness, I haven't checked but I would like to see its affect on flushing. The problem with the way it is currently written is that it will do it multiple times if row_purge_remove_sec_if_poss_leaf() fails.
[11 Jan 2011 1:45]
Mark Callaghan
The benchmark doesn't update indexed columns
[11 Jan 2011 3:00]
Sunny Bains
#5 0x000000000082fc58 in buf_flush_try_neighbors (space=15, offset=11741, flush_type=BUF_FLUSH_LIST, n_flushed=230474, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952 230474 pages is a serious number of pages for one flush.
[11 Jan 2011 3:11]
Mark Callaghan
log_preflush_pool_modified_pages calls buf_flush_list with no limit on the number of pages to flush. The limit is on max LSN, and when dirty neighbor pages are also flushed, that adds up. n_pages = buf_flush_list(ULINT_MAX, new_oldest);
[11 Jan 2011 3:17]
Sunny Bains
Yes, I realize that, but 230474 pages flushed to LSN is still a lot of pages, 230474 * 16 * 1024 = ~3G.
[11 Jan 2011 3:21]
Inaam Rana
This note is about stalls (not about purge lag) * data_wr - rate of Innodb_data_writes * page_wr - rate of Innodb_dblwr_pages_written qps ups pdirty pflush data_wr page_wr dsync lsync ... ... 0 0 434599 0 83 2329 68 0 0 0 413287 0 83 2124 64 0 0 0 388775 0 95 2457 74 0 422 211 368582 24861 271 2075 60 0 0 0 345542 0 71 2291 60 0 0 0 319782 0 82 2585 67 0 The data_wr in this case is number of write IOs i.e.: calls to os_file_pwrite(). As you are using simulated AIO it is possible, indeed very likely, that IO requests are coalesced together to give us this rather small number i.e.: each IO operation was much bigger than a single page. What also points in this direction is the fact that page_wr values are much higher which represents the number of pages written from dblwrite buffer. The pattern above looks a classic case where background flushing was unable to keep pace with dirtying activity. The question to answer is why do we lag behind so woefully with 5.5.8. There are two possible explanations: 1) In 5.5 we have enforced the innodb_io_capacity more strictly. If we flush innodb_io_capacity pages in one batch we sleep for the rest of the second. In 5.1 we skip the sleep. In cases where innodb_io_capacity is set very low (like the default 200 value) the effect in 5.1 will be that we'll be able to flush much more than 200 because of skipping sleep. As we fixed this subtle bug in 5.5 the result will be that we'll flush max of 200 pages in one second. If you want to get more pages flushed because you think your IO subsystem has more capacity then you should set innodb_io_capacity to a higher value. I see that in your benchmark you have not set this parameter which means a default value of 200 will be used. If you can retry with a higher value of, say, 4000 may be that will resolve the stall problem. 2) If you can retry with native aio enabled just to see the difference. We have mostly tested with native aio on linux and it may be that when writing the native aio code I introduced some regression in the simulated aio code. #7 0x000000000083091e in buf_flush_batch (buf_pool=0x159b4708, flush_type=BUF_FLUSH_LRU, min_n=18446744073709551614, lsn_limit=69091037851) at /s/bld/558orig/storage/innobase/buf/buf0fl u.c:1621 This is strange. flush_type should be BUF_FLUSH_LIST not BUF_FLUSH_LRU. Maybe optimization has screwed this up. Further up in the stack we see BUF_FLUSH_LIST which is the right value.
[11 Jan 2011 3:27]
Mark Callaghan
I didn't document it, but I have been using innodb_io_capacity=1000. The real capacity is closer to 1600. There is also a HW RAID card with a 256MB write cache which can absorb write bursts at a higher rate. I will try with innodb_io_capacity=2000
[11 Jan 2011 4:12]
Sunny Bains
Going by the n_flushed=230474 value, it is impossible for purge to catch up. Assuming it is the only one doing these massive flushes. Why isn't the master thread doing the flushing? Purge shouldn't have to do such massive flushes.
[11 Jan 2011 4:12]
Mark Callaghan
I am not sure if innodb_io_capacity is the reason. QPS per 20-second interval for 5.5.8 with innodb_io_capacity=2000, still lots of stalls 0 0.7 0.3 1825.8 3267.5 3568.2 3798.3 4165.7 4593.4 5031.2 5649.7 6037.9 7239.8 8424.5 10527.2 13817.4 18446.3 29599.5 51319.3 66186.0 70229.9 72659.1 74321.8 74712.6 75891.7 77234.2 77156.7 76265.0 75825.5 75748.8 74873.9 59240.6 0.3 481.6 0.3 0.7 0.3 38550.2 71336.0 73959.1 76251.5 76752.0 75159.6 76384.6 76460.2 75394.1 76457.2 73363.8 70129.8 76130.6 73811.6 70250.4 49935.2 0.7 0.3 0.7 0.3 0.7 --------- and 5.1.47 with innodb_io_capacity=1000, no stalls 1484.0 1451.5 4135.1 4672.7 4863.4 5289.3 5392.4 6019.1 6448.3 7269.2 8204.6 9099.6 10779.1 13292.0 17081.2 24265.7 41039.9 58226.6 62334.3 64135.1 65280.2 67507.3 67796.1 67166.9 67981.7 69349.0 70122.1 68304.7 69416.4 69188.9 71185.7 69237.3 67757.3 68835.3 67961.7 66663.1 66443.3 69977.4 69987.3 67996.5 68416.1 67625.7 69960.2 69583.8 69037.4 68291.8 69989.4 67236.0
[11 Jan 2011 4:28]
Inaam Rana
Mark, Can you try setting io_capacity to unlimited (something like 999999). If that doesn't work then try enabling native aio. Thanks for your effort to nail this down. Sunny, It is in fact master thread that is doing the purge and during purge it encountered the log_free_check() call. When we cross this limit then one thread does the flushing and all other threads wait for the huge batch to finish. Therefore, I don't think it really matters which thread is doing the batch.
[11 Jan 2011 4:35]
Mark Callaghan
Anything calling log_free_check can trigger this as log_free_check -> log_check_margins -> log_checkpoint_margins -> log_preflush_pool_modified_pages -> buf_flush_list(ULINT_MAX, ...) ------ After starting 5.5.8, this message is in the db error log 110110 20:33:39 InnoDB: Warning: purge reached the head of the history list, InnoDB: but its length is still reported as 3819087! Make a detailed bug
[11 Jan 2011 5:32]
Sunny Bains
Inaam, even if it is the master thread the problem is that such a huge flush by purge will prevent it from catching up with user threads.
[11 Jan 2011 6:46]
Mark Callaghan
The comment on the return value is wrong. This returns FALSE when buf_pool_mutex was not released /********************************************************************//** Check if the block is modified and ready for flushing. If the the block is ready to flush then flush the page and try o flush its neighbors. @return TRUE if buf_pool mutex was not released during this function. This does not guarantee that some pages were written as well. Number of pages written are incremented to the count. */ static ibool buf_flush_page_and_try_neighbors( ... buf_pool_mutex_enter(buf_pool); flushed = TRUE; ... return(flushed) -------- buf_flush_flush_list_batch does more work in the inner loop as it gets/releases the flush list mutex on each iteration after a block has been flushed. -------- I don't think neighbor pages should be flushed in this case as there is no limit on the number of pages that can be flushed. The limit in on the oldest_modification value. -------- buf_flush_flush_list_batch has an optimization to not start over from the end of the flush list after finding a block to flush. When the optimization is used, there is no check to confirm that the lsn of the current block is large enough that the function can return. This is different from what 5.1 did and I have confirmed that this code continues past the block where it could have stopped.
[11 Jan 2011 8:02]
Mark Callaghan
What I see under load is: 1) srv_master_thread does its normal work with many calls to buf_flush_list from adaptive flushing 2) then it gets stuck in purge, I have been running with innodb_purge_threads=0 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone It stays in purge for a long time, so its stops flushing dirty pages leading to the eventual problem which is a huge amount of checkpoint IO that must eventually get done.
[11 Jan 2011 8:08]
Sunny Bains
We are certainly flushing more than we need to. That is most likely the bug. The 3G flush that I noted earlier is just too much for a single flush. Now onto the why: 1. The comment is wrong. 2. There is an additional flush list mutex enter/exit in 5.5, we can ignore that. 3. 5.5 optimisation is the same as 5.1 4. In 5.1 we have: mutex_enter(block_mutex); ready = buf_flush_ready_for_flush(bpage, flush_type); mutex_exit(block_mutex); if (ready) { space = buf_page_get_space(bpage); offset = buf_page_get_page_no(bpage); buf_pool_mutex_exit(); /* Try to flush also all the neighbors */ page_count += buf_flush_try_neighbors( space, offset, flush_type); buf_pool_mutex_enter(); goto flush_next; In 5.5 we have: /* The list may change during the flushing and we cannot safely preserve within this function a pointer to a block in the list! */ while (bpage != NULL && len > 0 && !buf_flush_page_and_try_neighbors( bpage, BUF_FLUSH_LIST, min_n, &count)) { buf_flush_list_mutex_enter(buf_pool); /* If we are here that means that buf_pool->mutex was not released in buf_flush_page_and_try_neighbors() above and this guarantees that bpage didn't get relocated since we released the flush_list mutex above. There is a chance, however, that the bpage got removed from flush_list (not currently possible because flush_list_remove() also obtains buf_pool mutex but that may change in future). To avoid this scenario we check the oldest_modification and if it is zero we start all over again. */ if (bpage->oldest_modification == 0) { buf_flush_list_mutex_exit(buf_pool); break; } bpage = UT_LIST_GET_PREV(list, bpage); ut_ad(!bpage || bpage->in_flush_list); buf_flush_list_mutex_exit(buf_pool); --len; } We are doing the same thing as 5.1, the check for "ready" in 5.1 is the return value of buf_flush_page_and_try_neighbors().
[11 Jan 2011 8:12]
Sunny Bains
Where exactly in purge is it stuck? Is purge just slower with 128 rollback segments due to trx_purge_choose_next_log() ?
[11 Jan 2011 8:38]
Dimitri KRAVTCHUK
Seems that Mark did not use purge thread in his tests.. - the reason of appearing of the purge thread feature is exactly for this, and honestly, I thin that purge thread should be *always* enabled (and probably did not have any other option).. But it did not explain for me why the similar issue was not observed on 5.1?.. Mark, may you please post your my.conf setting here too?.. Rgds, -Dimitri
[11 Jan 2011 16:28]
Mark Callaghan
I still think 5.5 does more work as lock/unlock on the flush list mutex should encounter more contention compared to lock/unlock on block->mutex as was done in 5.1. In 5.5, after a page and its neighbors have been flushed, there is an optimization to not start over from the end of the flush list. In 5.1, the search always goes back to the end of the flush list. While I think this is a good change, it also skips the check to halt the search by comparing oldest_modification with lsn_limit and callers can stay in longer than they need to. I don't think that is good. Dmitri -- this problem was less of an issue in 5.1 because with only 1 rollback segment to iterate over there is less work to do in trx_purge. Inaam suggested trying innodb_io_capacity=99999. I don't think that will help as in some cases it will call buf_flush_batch with much larger values and those calls will take more time and disrupt the balance of the main background thread. I supplied several thread stacks above to show where purge gets stuck. But will save the next round of data for my next update. my.cnf settings that I use: innodb_buffer_pool_size=16G innodb_log_file_size=1900M innodb_flush_log_at_trx_commit=2 innodb_doublewrite=1 innodb_flush_method=O_DIRECT innodb_thread_concurrency=0 innodb_max_dirty_pages_pct=80 innodb_file_format=barracuda innodb_file_per_table max_connections=2000 table_cache=2000 key_buffer_size=2000M innodb_io_capacity=1000 innodb_thread_concurrency=32
[11 Jan 2011 16:53]
Mark Callaghan
I started with a new database (dropped all files, ran sysbench prepare) and a warm buffer pool. Then I waited for the sysbench clients to become idle from the stall and ran PMP in a loop to get thread stacks. I also added printfs to some of the buffer flush related routines. In this case I see srv_master_thread -> trx_purge -> buf_flush_list taking a very long time because on consecutive calls to trx_purge, buf_flush_list flushes 140,000 and then 280,000 pages. A typical thread stack: Thread 67 (Thread 1232353600 (LWP 15022)): #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ace9 in os_event_wait_low (event=0x20886b70, reset_sig_count=9493) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655 #3 0x00000000008310b8 in buf_flush_try_neighbors (space=1, offset=20602, flush_type=BUF_FLUSH_LIST, n_flushed=84441, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952 #4 0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x49742d88) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499 #5 0x0000000000831e3b in buf_flush_batch (buf_pool=0x208ba708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1621 #6 0x0000000000832703 in buf_flush_list (min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1943 #7 0x000000000088a9e5 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658 #8 0x00000000008af525 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #9 0x00000000007ecfd7 in trx_purge (limit=8906108) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #10 0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586 #11 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0 #12 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6 But I don't think the 5.5 optimization is the problem. With more printfs, this is the work done in buf_flush_flush_list batch when 70,000 pages are flushed in one call. I added printfs to display the page flush count each time the search starts over and each time the optimization is used. Almost all of the work is done from starting the search over at the end of the list. buf_flush_start for LIST buf_flush_batch LIST enter 18446744073709551614 min, 23896715046 lsn buf_flush_flush_list_batch start at end 64 count buf_flush_flush_list_batch start at end 128 count buf_flush_flush_list_batch start at end 192 count buf_flush_flush_list_batch start at end 256 count buf_flush_flush_list_batch start at end 320 count buf_flush_flush_list_batch start at end 384 count buf_flush_flush_list_batch start at end 448 count buf_flush_flush_list_batch start at end 512 count buf_flush_flush_list_batch start at end 576 count buf_flush_flush_list_batch start at end 640 count ... buf_flush_flush_list_batch start at end 73259 count buf_flush_flush_list_batch start at end 73323 count buf_flush_flush_list_batch start at end 73387 count buf_flush_flush_list_batch start at end 73451 count buf_flush_flush_list_batch start at end 73515 count buf_flush_flush_list_batch optimization when lsn limit reached, 73515 count buf_flush_flush_list_batch start at end 73579 count buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count buf_flush_flush_list_batch start at end 73643 count buf_flush_batch flush writes buf_flush_batch LIST flushed 73643 buf_flush_end for LIST
[11 Jan 2011 16:53]
Mark Callaghan
I started with a new database (dropped all files, ran sysbench prepare) and a warm buffer pool. Then I waited for the sysbench clients to become idle from the stall and ran PMP in a loop to get thread stacks. I also added printfs to some of the buffer flush related routines. In this case I see srv_master_thread -> trx_purge -> buf_flush_list taking a very long time because on consecutive calls to trx_purge, buf_flush_list flushes 140,000 and then 280,000 pages. A typical thread stack: Thread 67 (Thread 1232353600 (LWP 15022)): #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ace9 in os_event_wait_low (event=0x20886b70, reset_sig_count=9493) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655 #3 0x00000000008310b8 in buf_flush_try_neighbors (space=1, offset=20602, flush_type=BUF_FLUSH_LIST, n_flushed=84441, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952 #4 0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x49742d88) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499 #5 0x0000000000831e3b in buf_flush_batch (buf_pool=0x208ba708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1621 #6 0x0000000000832703 in buf_flush_list (min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1943 #7 0x000000000088a9e5 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658 #8 0x00000000008af525 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #9 0x00000000007ecfd7 in trx_purge (limit=8906108) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #10 0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586 #11 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0 #12 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6 But I don't think the 5.5 optimization is the problem. With more printfs, this is the work done in buf_flush_flush_list batch when 70,000 pages are flushed in one call. I added printfs to display the page flush count each time the search starts over and each time the optimization is used. Almost all of the work is done from starting the search over at the end of the list. buf_flush_start for LIST buf_flush_batch LIST enter 18446744073709551614 min, 23896715046 lsn buf_flush_flush_list_batch start at end 64 count buf_flush_flush_list_batch start at end 128 count buf_flush_flush_list_batch start at end 192 count buf_flush_flush_list_batch start at end 256 count buf_flush_flush_list_batch start at end 320 count buf_flush_flush_list_batch start at end 384 count buf_flush_flush_list_batch start at end 448 count buf_flush_flush_list_batch start at end 512 count buf_flush_flush_list_batch start at end 576 count buf_flush_flush_list_batch start at end 640 count ... buf_flush_flush_list_batch start at end 73259 count buf_flush_flush_list_batch start at end 73323 count buf_flush_flush_list_batch start at end 73387 count buf_flush_flush_list_batch start at end 73451 count buf_flush_flush_list_batch start at end 73515 count buf_flush_flush_list_batch optimization when lsn limit reached, 73515 count buf_flush_flush_list_batch start at end 73579 count buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count buf_flush_flush_list_batch start at end 73643 count buf_flush_batch flush writes buf_flush_batch LIST flushed 73643 buf_flush_end for LIST
[11 Jan 2011 17:05]
Mark Callaghan
I still see stalls first in pure purge code that occur prior to the stalls from trx_purge -> buf_flush_list. This is the problem that stalls srv_master_thread in trx_purge and prevents it from doing anything else which can explain why the next call to buf_flush_list after this stall requires a huge amount of IO. Examples are: pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ad99 in os_event_wait_low (event=0x2d55f060, reset_sig_count=33) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x00000000007e8742 in sync_array_wait_event (arr=0x10feddb0, index=1) at /s/bld/558orig/storage/innobase/sync/sync0arr.c:457 #3 0x00000000007e966f in mutex_spin_wait (mutex=0x2d55ef50, file_name=0xa33108 "/s/bld/558orig/storage/innobase/trx/trx0purge.c", line=521) at /s/bld/558orig/storage/innobase/sync/sync0sync.c:621 #4 0x00000000007ef181 in trx_purge_truncate_rseg_history (rseg=0x2d55ef48, limit_trx_no=138501018, limit_undo_no=0) at /s/bld/558orig/storage/innobase/include/sync0sync.ic:222 #5 0x00000000007ef8ac in trx_purge_fetch_next_rec (roll_ptr=<value optimized out>, cell=0x2d56ae10, heap=0x2d56b2c0) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:636 #6 0x00000000008b9bf1 in row_purge_step (thr=0x2d56ac10) at /s/bld/558orig/storage/innobase/row/row0purge.c:725 #7 0x00000000008af8c7 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/que/que0que.c:1247 #8 0x00000000007ecfd7 in trx_purge (limit=1696160) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #9 0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586 #10 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6 #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ad99 in os_event_wait_low (event=0x2d549c90, reset_sig_count=43) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x00000000007e8742 in sync_array_wait_event (arr=0x10feddb0, index=1) at /s/bld/558orig/storage/innobase/sync/sync0arr.c:457 #3 0x00000000007e966f in mutex_spin_wait (mutex=0x2d549b80, file_name=0xa33108 "/s/bld/558orig/storage/innobase/trx/trx0purge.c", line=792) at /s/bld/558orig/storage/innobase/sync/sync0sync.c:621 #4 0x00000000007edb09 in trx_purge_choose_next_log () at /s/bld/558orig/storage/innobase/include/sync0sync.ic:222 #5 0x00000000007ef572 in trx_purge_fetch_next_rec (roll_ptr=<value optimized out>, cell=<value optimized out>, heap=0x2d56b2c0) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:899 #6 0x00000000008b9bf1 in row_purge_step (thr=0x2d56ac10) at /s/bld/558orig/storage/innobase/row/row0purge.c:725 #7 0x00000000008af8c7 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/que/que0que.c:1247 #8 0x00000000007ecfd7 in trx_purge (limit=2447520) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #9 0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586 #10 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6
[11 Jan 2011 17:11]
Mark Callaghan
I added more instrumentation and print out the values used by log_checkpoint_margin when computing arguments for log_preflush_pool_modified_pages. In this case, the async checkpoint limits are used ("0 sync"). Note that advance has a large value (almost 1M) and buf_flush_list ends up flushing almost 100,000 pages. But the huge flush done by buf_flush_list is a symptom of the problem. The cause is the earlier stalls in trx_purge from the overhead of 128 rollback segments. I will try innodb_purge_threads=1 next. log_preflush 0 sync, 35792609753 lsn, 32769015802 oldest, 2822766225 age, 962638 advance, 32769978440 new buf_flush_list 99085 return, 32769978440 lsn_limit, 0 skipped, 99085 total_page_count for 1 instances 0x413e2858 ptr
[11 Jan 2011 17:32]
Dimitri KRAVTCHUK
Mark, you simply entering the issue I've told many times in the past (and nobody yet did not believe it's ever possible) - get a look on: http://dimitrik.free.fr/db_STRESS_MySQL_InnoDB_dirty_pages_and_log_size_impact_May2009.htm... and then the final fix: http://dimitrik.free.fr/blog/archives/2009/08/mysql-performance-final-fix-for-ahead-flushi... again: you should always keep purge thread enable to keep InnoDB working "as designed".. - otherwise on the hard purge you'll freeze the Master thread from anything else (and that's why I think there should be no more option within InnoDB to disable purge thread).. Rgds, -Dimitri
[11 Jan 2011 17:37]
Mark Callaghan
With innodb_purge_threads=1 I still get stalls where QPS drops to almost 0 for 60+ seconds. And huge flushes (more than 80,000 pages) still occur from buf_flush_list. Here are thread stacks when that occurs: os_event_wait_low,buf_flush_wait_batch_end,srv_master_thread,start_thread,clone os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_purge_thread,start_thread,clone os_event_wait_low,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone And the full stack for the flushing thread #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ad99 in os_event_wait_low (event=0x422eb70, reset_sig_count=6078) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655 #3 0x00000000008310b8 in buf_flush_try_neighbors (space=6, offset=27020, flush_type=BUF_FLUSH_LIST, n_flushed=98933, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952 #4 0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4a322788) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499 #5 0x0000000000831e48 in buf_flush_batch (buf_pool=0x4262708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=45321000437) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1622 #6 0x0000000000832733 in buf_flush_list (min_n=18446744073709551614, lsn_limit=45321000437) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1944 #7 0x000000000088aaff in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658 #8 0x00000000007dc93e in row_upd_step (thr=0x2aaf5815fce8) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #9 0x00000000007c86d7 in row_update_for_mysql (mysql_rec=<value optimized out>, prebuilt=0x2aaf580da6f8) at /s/bld/558orig/storage/innobase/row/row0mysql.c:1453 #10 0x00000000007b622e in ha_innobase::update_row (this=0x2aaf6000d4c0, old_row=0x2aaf6000d780 "<FF>2\233\037", new_row=0x2aaf6000d6c0 "<FF>2\233\037") at /s/bld/558orig/storage/innobase/handler/ha_innodb.cc:5262 #11 0x0000000000682f77 in handler::ha_update_row (this=0x2aaf6000d4c0, old_data=0x2aaf6000d780 "<FF>2\233\037", new_data=0x2aaf6000d6c0 "<FF>2\233\037") at /s/bld/558orig/sql/handler.cc:4782 #12 0x00000000005e2ef7 in mysql_update (thd=0x1f3b9400, table_list=0x2aaf60007868, fields=@0x1f3bb518, values=@0x1f3bb8f0, conds=0x2aaf60008228, order_num=<value optimized out>, order=) at /s/bld/558orig/sql/sql_update.cc:707 #13 0x000000000057390a in mysql_execute_command (thd=0x1f3b9400) at /s/bld/558orig/sql/sql_parse.cc:2669 #14 0x000000000057835b in mysql_parse (thd=0x1f3b9400, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4a324de0) at /s/bld/558orig/sql/sql_parse.cc:5496 #15 0x00000000005794ea in dispatch_command (command=COM_QUERY, thd=0x1f3b9400, packet=<value optimized out>, packet_length=159) at /s/bld/558orig/sql/sql_parse.cc:1032 #16 0x0000000000579854 in do_command (thd=0x1f3b9400) at /s/bld/558orig/sql/sql_parse.cc:772 #17 0x000000000060e294 in do_handle_one_connection (thd_arg=0x1f3b9400) at /s/bld/558orig/sql/sql_connect.cc:745 #18 0x000000000060e384 in handle_one_connection (arg=<value optimized out>) at /s/bld/558orig/sql/sql_connect.cc:684 Also, loop counts from srv_master_thread don't increase as they should. These are taken at 5 second intervals: srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 606 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush In one case, it got stuck behind a huge flush. Thread stacks in the next post.
[11 Jan 2011 17:44]
Mark Callaghan
Per my last update, innodb_purge_threads=1 still has the problem. I still get stalls (60+ seconds with near zero QPS). I still get huge flushes even with it enabled. That is the current mystery. Thread stacks in one case where srv_master_thread is stuck waiting for a long (100,000) page flush to end: the flusher: #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ad99 in os_event_wait_low (event=0x422eb70, reset_sig_count=14464) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655 #3 0x00000000008310b8 in buf_flush_try_neighbors (space=2, offset=46947, flush_type=BUF_FLUSH_LIST, n_flushed=185174, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952 #4 0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4a2a0788) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499 #5 0x0000000000831e48 in buf_flush_batch (buf_pool=0x4262708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=51175735319) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1622 #6 0x0000000000832733 in buf_flush_list (min_n=18446744073709551614, lsn_limit=51175735319) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1944 #7 0x000000000088aaff in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658 #8 0x00000000007dc93e in row_upd_step (thr=0x1f4cfad8) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #9 0x00000000007c86d7 in row_update_for_mysql (mysql_rec=<value optimized out>, prebuilt=0x1f491de8) at /s/bld/558orig/storage/innobase/row/row0mysql.c:1453 #10 0x00000000007b622e in ha_innobase::update_row (this=0x1f499090, old_row=0x1f499350 "<FF>\032<E8>\034", new_row=0x1f499290 "<FF>\032<E8>\034") at /s/bld/558orig/storage/innobase/handler/ha_innodb.cc:5262 #11 0x0000000000682f77 in handler::ha_update_row (this=0x1f499090, old_data=0x1f499350 "<FF>\032<E8>\034", new_data=0x1f499290 "<FF>\032<E8>\034") at /s/bld/558orig/sql/handler.cc:4782 #12 0x00000000005e2ef7 in mysql_update (thd=0x2aaf581137d0, table_list=0x1f3bc988, fields=@0x2aaf581158e8, values=@0x2aaf58115cc0, conds=0x1f3bd348, order_num=<value optimized out>, order=) at /s/bld/558orig/sql/sql_update.cc:707 #13 0x000000000057390a in mysql_execute_command (thd=0x2aaf581137d0) at /s/bld/558orig/sql/sql_parse.cc:2669 #14 0x000000000057835b in mysql_parse (thd=0x2aaf581137d0, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4a2a2de0) at /s/bld/558orig/sql/sql_parse.cc:5496 #15 0x00000000005794ea in dispatch_command (command=COM_QUERY, thd=0x2aaf581137d0, packet=<value optimized out>, packet_length=159) at /s/bld/558orig/sql/sql_parse.cc:1032 #16 0x0000000000579854 in do_command (thd=0x2aaf581137d0) at /s/bld/558orig/sql/sql_parse.cc:772 #17 0x000000000060e294 in do_handle_one_connection (thd_arg=0x2aaf581137d0) at /s/bld/558orig/sql/sql_connect.cc:745 #18 0x000000000060e384 in handle_one_connection (arg=<value optimized out>) at /s/bld/558orig/sql/sql_connect.cc:684 srv_master_thread #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ad99 in os_event_wait_low (event=0x1f262ea0, reset_sig_count=2238) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082e5dc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1845 #3 0x00000000007e1a6a in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2956 #0 0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000089ad99 in os_event_wait_low (event=0x1f262ea0, reset_sig_count=2238) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082e5dc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1845 #3 0x000000000088adfc in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1661 #4 0x00000000008af5d5 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #5 0x00000000007ecfd7 in trx_purge (limit=18623221) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #6 0x00000000007e445c in srv_purge_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:3125
[11 Jan 2011 18:24]
Inaam Rana
Mark, I still think adaptive_flushing is the problem. The following also points to that: srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush So out of 605 1_second loops we end up sleeping in 584 iterations. That is when most of the work piles up. How many sleeps do you see in 5.1? If you can resurrect the 'skip_sleep' code of 5.1 then my guess would be that even with lower io_capacity of 1000 or 2000 you'll see problem being mitigated/delayed (yes I suspect that even in case of 5.1 we'll probably hit this issue but it is delayed because of aggressive flushing and overall more contention in 5.1). If you try the above please make sure that we are running a separate purge thread.
[11 Jan 2011 19:39]
Mark Callaghan
I restored the old behavior where skip_sleep is set to true when dirty page flushing is done. I also made it skip sleep when the suggested adaptive flush rate was 10% or more of innodb_io_capacity. The stalls are not as bad with this change. But they are still bad (20 seconds of zero QPS rather than 60+ seconds). Here is an example of output scraped from SHOW ENGINE INNODB STATUS. The age column is the checkpoint age divided by 1M. The data is scraped every 5 seconds and immediately prior to the stall (where 0 rows updated/second) there are two rounds where srv_master_thread sleeps 5 times in the one second loop. This test will never exceed the max dirty pages pct as that is set to 80%, the buffer pool has ~1M pages and the data fits in less than half of that. So it appears that in the middle of the test when the server is very busy, that buf_flush_get_desired_rate() returns a value that is less than 10% of innodb_io_capacity. History list length 277068 srv_master_thread loops: 910 1_second, 178 sleeps, 90 10_second, 194 background, 194 flush 0.00 inserts/s, 37861.23 updates/s, 0.00 deletes/s, 75705.46 reads/s Database pages 489256 Modified db pages 426194 Age 2420 History list length 312354 srv_master_thread loops: 916 1_second, 178 sleeps, 91 10_second, 194 background, 194 flush 0.00 inserts/s, 38147.77 updates/s, 0.00 deletes/s, 76261.35 reads/s Database pages 490254 Modified db pages 428179 Age 2518 History list length 348031 srv_master_thread loops: 920 1_second, 179 sleeps, 91 10_second, 194 background, 194 flush 0.00 inserts/s, 31539.91 updates/s, 0.00 deletes/s, 63060.82 reads/s Database pages 491304 Modified db pages 430507 Age 2615 History list length 381109 srv_master_thread loops: 936 1_second, 180 sleeps, 93 10_second, 194 background, 194 flush 0.00 inserts/s, 36643.27 updates/s, 0.00 deletes/s, 73267.75 reads/s Database pages 492264 Modified db pages 429442 Age 2709 History list length 359585 srv_master_thread loops: 986 1_second, 185 sleeps, 98 10_second, 194 background, 194 flush 0.00 inserts/s, 35649.67 updates/s, 0.00 deletes/s, 71268.75 reads/s Database pages 492328 Modified db pages 429126 Age 2802 History list length 367433 srv_master_thread loops: 1036 1_second, 190 sleeps, 103 10_second, 194 background, 194 flush 0.00 inserts/s, 31083.78 updates/s, 0.00 deletes/s, 62150.97 reads/s Database pages 492584 Modified db pages 424842 Age 2883 History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 391818 Age 2883 History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 355850 Age 2883 History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 321851 Age 2883 History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 284992 Age 2883
[11 Jan 2011 20:29]
Mark Callaghan
More data from a stall with old style (skip_sleep) management of sleep in 5.5. The data below is scraped at 5 second intervals. The pattern is: * initially the 1_second counter increases ~5 times per sample * initially there are 0, 1 or 2 sleeps per sample When the checkpoint age approaches the async checkpoint limit, then the 1_second counter is incremented many more times than once per second. This is because I changed the code to set skip_sleep=TRUE when buf_flush_get_desired_rate() > 10% of innodb_io_capacity, but buf_flush_list returns immediately because another thread is doing a flush. At that time, buf_flush_get_desired_rate() was returning a value great than 2000, and it was reduced to 2000 because innodb_io_capacity=2000. Eventually, there were ~7 consecutive calls where it returned a small value (~30) and sleep was not skipped. After that the main background loop stopped for 10+ seconds and was probably stuck waiting for a flush to end. Extra printfs show that one flush was for more than 130,000 pages. History list length 134694 srv_master_thread loops: 221 1_second, 97 sleeps, 21 10_second, 2 background, 2 flush 0.00 inserts/s, 35941.81 updates/s, 0.00 deletes/s, 71858.83 reads/s Database pages 448181 Modified db pages 419196 Age 2552 0.00 reads/s, 0 avg bytes/read, 32887.82 writes/s, 84.38 fsyncs/s 0.00 reads/s, 89.58 creates/s, 2669.07 writes/s History list length 132789 srv_master_thread loops: 227 1_second, 98 sleeps, 22 10_second, 2 background, 2 flush 0.00 inserts/s, 34695.66 updates/s, 0.00 deletes/s, 69371.13 reads/s Database pages 448245 Modified db pages 419222 Age 2642 0.00 reads/s, 0 avg bytes/read, 32019.40 writes/s, 74.19 fsyncs/s 0.00 reads/s, 12.80 creates/s, 2171.37 writes/s History list length 185957 srv_master_thread loops: 242 1_second, 99 sleeps, 24 10_second, 2 background, 2 flush 0.00 inserts/s, 35411.92 updates/s, 0.00 deletes/s, 70791.64 reads/s Database pages 450164 Modified db pages 420233 Age 2732 0.00 reads/s, 0 avg bytes/read, 32371.33 writes/s, 83.18 fsyncs/s 0.00 reads/s, 383.72 creates/s, 2521.70 writes/s History list length 376732 srv_master_thread loops: 302 1_second, 105 sleeps, 30 10_second, 2 background, 2 flush 0.00 inserts/s, 31787.70 updates/s, 0.00 deletes/s, 63552.07 reads/s Database pages 455384 Modified db pages 425393 Age 2822 0.00 reads/s, 0 avg bytes/read, 28886.35 writes/s, 80.15 fsyncs/s 0.00 reads/s, 869.86 creates/s, 2132.98 writes/s History list length 505579 srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush 0.00 inserts/s, 25760.65 updates/s, 0.00 deletes/s, 51512.50 reads/s Database pages 458901 Modified db pages 417026 Age 2883 0.00 reads/s, 0 avg bytes/read, 23482.30 writes/s, 134.17 fsyncs/s 0.00 reads/s, 703.26 creates/s, 4082.38 writes/s History list length 505579 srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 458901 Modified db pages 394235 Age 2883 0.00 reads/s, 0 avg bytes/read, 181.16 writes/s, 139.57 fsyncs/s 0.00 reads/s, 0.00 creates/s, 4557.29 writes/s History list length 505579 srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 458901 Modified db pages 370103 Age 2883 0.00 reads/s, 0 avg bytes/read, 191.36 writes/s, 146.77 fsyncs/s 0.00 reads/s, 0.00 creates/s, 4825.43 writes/s History list length 505579 srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 458901 Modified db pages 347842 Age 2883 0.00 reads/s, 0 avg bytes/read, 175.56 writes/s, 132.77 fsyncs/s 0.00 reads/s, 0.00 creates/s, 4451.31 writes/s
[11 Jan 2011 20:44]
Inaam Rana
I can explain the numbers and build a hypothesis about what is happening in the master thread. But please don't yell at me for the convoluted logic of the master thread (we are working on it in 5.6). So here is the story: -- During first fifteen second of the readings the master thread is trying hard to keep up with page dirtying pace. The adaptive flushing is working. We can say that because there are no sleeps for master thread. We can also say that master thread is not able to keep pace with the dirtying pace because modified pages and checkpoint age is increasing. History list length 277068 srv_master_thread loops: 910 1_second, 178 sleeps, 90 10_second, 194 background, 194 flush 0.00 inserts/s, 37861.23 updates/s, 0.00 deletes/s, 75705.46 reads/s Database pages 489256 Modified db pages 426194 Age 2420 History list length 312354 srv_master_thread loops: 916 1_second, 178 sleeps, 91 10_second, 194 background, 194 flush 0.00 inserts/s, 38147.77 updates/s, 0.00 deletes/s, 76261.35 reads/s Database pages 490254 Modified db pages 428179 Age 2518 History list length 348031 srv_master_thread loops: 920 1_second, 179 sleeps, 91 10_second, 194 background, 194 flush 0.00 inserts/s, 31539.91 updates/s, 0.00 deletes/s, 63060.82 reads/s Database pages 491304 Modified db pages 430507 Age 2615 -- Somewhere in the next 5 second interval the rot sets in. From some user thread log_preflush_buffer_pool with a huge flush batch is called in. We can deduce that from the fact that dirty pages are now decreasing and during this five second period the master thread actually shows going through 1_second loop 16 times (more on this below) History list length 381109 srv_master_thread loops: 936 1_second, 180 sleeps, 93 10_second, 194 background, 194 flush 0.00 inserts/s, 36643.27 updates/s, 0.00 deletes/s, 73267.75 reads/s Database pages 492264 Modified db pages 429442 Age 2709 Note the difference above 936 - 920 between in 1_second loops. -- In next two readings the master thread is fully choked because of the huge flush batch that is happening in some user thread. Here is what is happening: * master thread gets some value from adaptive_flushing * master thread tries to start a batch calling buf_flush_list() * because another batch is running master thread returns with failure immediately * because adaptive_flushing has provided a value > 10% of io_capacity therefore we skip sleep * master thread immediately starts next iteration * after going through 10 iterations master thread goes to 10 second loop * when doing 10 second stuff master thread restarts the 1 second loop again it sets skip_sleep = FALSE and therefore in first iteration it sleeps * If above assumptions are true then master thread should do 1 second loop 50 times, 10 second loop 5 times and 5 sleeps in the 5 second interval. The next two entries exactly show this History list length 359585 srv_master_thread loops: 986 1_second, 185 sleeps, 98 10_second, 194 background, 194 flush 0.00 inserts/s, 35649.67 updates/s, 0.00 deletes/s, 71268.75 reads/s Database pages 492328 Modified db pages 429126 Age 2802 History list length 367433 srv_master_thread loops: 1036 1_second, 190 sleeps, 103 10_second, 194 background, 194 flush 0.00 inserts/s, 31083.78 updates/s, 0.00 deletes/s, 62150.97 reads/s Database pages 492584 Modified db pages 424842 Age 2883 -- Although in above ten seconds things are coming to a screeching halt but there are still some threads (may be purge guy) working. This can be deduced from changes in the history list. What this means is that there is still some activity that keeps the master thread in the 1 and 10 second loop. But by now all threads are waiting for the huge flush batch to finish. There is no server activity. Accordingly master thread assumes that server is 'idle' and leaves the 10 second loop and go to the background and flush loop. In flush loop master thread finds buf_flush_wait_batch_end() and blocks there. If this is true then there should be no activity reported (like updates or changes in history length), there should be no increment to the 1 and 10 second loop and there should an increment of 1 from previous reading of background and flush loop. This is exactly what we see in next readings: History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 391818 Age 2883 History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 355850 Age 2883 History list length 367433 srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 492584 Modified db pages 321851 Age 2883 So much for the explanation. I hope it is clear as mud. Now lets think about why we don't see this in 5.1 and what can we do to avoid this.
[11 Jan 2011 20:47]
Sunny Bains
Mark Callaghan said: "In 5.5, after a page and its neighbors have been flushed, there is an optimization to not start over from the end of the flush list. In 5.1, the search always goes back to the end of the flush list. While I think this is a good change, it also skips the check to halt the search by comparing oldest_modification with lsn_limit and callers can stay in longer than they need to. I don't think that is good." The claim about the oldest_modification check is not correct. If you look at the code snippet I posted yesterday, the check is done in exactly the same way. If we flush the neighbour pages then buf_flush_page_and_try_neighbors() will return TRUE, this fill force the thread to break out out of the while loop. It will start from the tail of the list again. The check for oldest modification in both 5.1 and 5.5 is in the outer loop. I need to reproduce this to trace it now. By adding artificial delays in the purge code it should in theory reproduce the symptoms.
[11 Jan 2011 20:51]
Mark Callaghan
I thought perhaps that the first checkpoint after restart was the worst. But that is not the case. Here is an interesting example. I am using the purge thread and hacked srv_master_thread to not sleep when work was done. The async limit has been reached and advance is ~84M when log_preflush_pool_modified_pages is called log_preflush 0 sync, 113358852695 lsn, 110261884096 oldest, 2905956188 age, 2913351011 checkpoint_age, 84152601 advance, 110346036697 new, 3023360986 sync_limit, 2821803587 async_limit buf_flush_list then flushes ~140k pages buf_flush_list 138757 return, 110346036697 lsn_limit, 0 skipped, 138757 total_page_count for 1 instances 0x4a7b0838 ptr purge thread blocks waiting for the flush above to end #1 0x000000000089aeb9 in os_event_wait_low (event=0x21f56ea0, reset_sig_count=893) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082e6bc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1846 #3 0x000000000088af1c in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1661 #4 0x00000000008af6f5 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #5 0x00000000007ed0b7 in trx_purge (limit=15164416) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184 #6 0x00000000007e409e in srv_purge_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:3140 srv_master_thread goes into the flush_loop code because it does not detect server activity #1 0x000000000089aeb9 in os_event_wait_low (event=0x21f56ea0, reset_sig_count=893) at /s/bld/558orig/storage/innobase/os/os0sync.c:207 #2 0x000000000082e6bc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1846 #3 0x00000000007e192a in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2969 and a user connection has flushed more than 80,000 pages #0 buf_calc_page_new_checksum (page=<value optimized out>) at /s/bld/558orig/storage/innobase/include/ut0rnd.ic:222 #1 0x000000000082eafb in buf_flush_init_for_writing (page=0x2aaceded4000 "\022\233,<EE>", page_zip_=<value optimized out>, newest_lsn=<value optimized out>) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1023 #2 0x000000000083155d in buf_flush_try_neighbors (space=3, offset=50450, flush_type=BUF_FLUSH_LIST, n_flushed=78703, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1117 #3 0x0000000000831912 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4a7b0768) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499 #4 0x0000000000831f1b in buf_flush_batch (buf_pool=0x6f56708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=110346036697) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1622 #5 0x0000000000832803 in buf_flush_list (min_n=18446744073709551614, lsn_limit=110346036697) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1945 #6 0x000000000088abe7 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658 #7 0x00000000007dc93e in row_upd_step (thr=0x221ba058) at /s/bld/558orig/storage/innobase/include/log0log.ic:443 #8 0x00000000007c86d7 in row_update_for_mysql (mysql_rec=<value optimized out>, prebuilt=0x221bebe8) at /s/bld/558orig/storage/innobase/row/row0mysql.c:1453 #9 0x00000000007b622e in ha_innobase::update_row (this=0x220c2980, old_row=0x220c2c40 "<FF><BC>x/", new_row=0x220c2b80 "<FF><BC>x/") at /s/bld/558orig/storage/innobase/handler/ha_innodb.cc:5262 #10 0x0000000000682f77 in handler::ha_update_row (this=0x220c2980, old_data=0x220c2c40 "<FF><BC>x/", new_data=0x220c2b80 "<FF><BC>x/") at /s/bld/558orig/sql/handler.cc:4782 #11 0x00000000005e2ef7 in mysql_update (thd=0x2aaf581346e0, table_list=0x2aaf581a2d38, fields=@0x2aaf581367f8, values=@0x2aaf58136bd0, conds=0x2aaf581a36f8, order_num=<value optimized out>, order=0x0, limit=) at /s/bld/558orig/sql/sq l_update.cc:707 #12 0x000000000057390a in mysql_execute_command (thd=0x2aaf581346e0) at /s/bld/558orig/sql/sql_parse.cc:2669 #13 0x000000000057835b in mysql_parse (thd=0x2aaf581346e0, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4a7b2de0) at /s/bld/558orig/sql/sql_parse.cc:5496 #14 0x00000000005794ea in dispatch_command (command=COM_QUERY, thd=0x2aaf581346e0, packet=<value optimized out>, packet_length=159) at /s/bld/558orig/sql/sql_parse.cc:1032 #15 0x0000000000579854 in do_command (thd=0x2aaf581346e0) at /s/bld/558orig/sql/sql_parse.cc:772 #16 0x000000000060e294 in do_handle_one_connection (thd_arg=0x2aaf581346e0) at /s/bld/558orig/sql/sql_connect.cc:745 #17 0x000000000060e384 in handle_one_connection (arg=<value optimized out>) at /s/bld/558orig/sql/sql_connect.cc:684 History list length 2336785 srv_master_thread loops: 3887 1_second, 480 sleeps, 388 10_second, 7 background, 7 flush 0.00 inserts/s, 36703.46 updates/s, 0.00 deletes/s, 73375.52 reads/s Database pages 527393 Modified db pages 472350 Age 2826 History list length 2358018 srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush 0.00 inserts/s, 18607.07 updates/s, 0.00 deletes/s, 37206.30 reads/s Database pages 528033 Modified db pages 464622 Age 2883 History list length 2358018 srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 528033 Modified db pages 442542 Age 2883 History list length 2358018 srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 528033 Modified db pages 416803 Age 2883 History list length 2358018 srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 528033 Modified db pages 389678 Age 2883 History list length 2358018 srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Database pages 528033 Modified db pages 357742 Age 2883 History list length 2344766 srv_master_thread loops: 3958 1_second, 488 sleeps, 395 10_second, 8 background, 8 flush 0.00 inserts/s, 23056.39 updates/s, 0.00 deletes/s, 46082.78 reads/s Database pages 528097 Modified db pages 378078 Age 2865
[11 Jan 2011 20:52]
Inaam Rana
Mark, I posted the last comment before looking at your latest note. I think we both have reached kind of same conclusion. Are we on the same page about what is happening here?
[11 Jan 2011 21:00]
Mark Callaghan
Yes, we are. And I even agree with Sunny about the 'start from flush list end' behavior not changing. I have been using sysbench+lua with the lua script provided by your side. Do you have hardware to reproduce this on your side and compare 5.5 with 5.1? While people have claimed they could not reproduce this on your side, all of the TPS numbers are far lower than what I get with 5.1 or 5.5 so I wonder what is different about your repro HW. I have 2 hosts (16-core client, 8-core server, and server has HW RAID 10 + 8 SAS disks)
[11 Jan 2011 21:49]
Inaam Rana
As even with skip_sleep hack we are unable to catch up on flushing, we can safely assume that there is more to it than just adaptive_flushing heuristic and enforced sleep. Some plausible ideas are: 1) In 5.5 we are doing more work because of less contention and therefore we hit the wall. -- Not very likely to have such a huge impact. But then reverting rbs removed the stalls which points in that direction. -- Sometimes small scalability improvements can have cumulative effect. For example, if we are working more then purge falls behind a little more which means more dirty UNDO pages in the buffer pool. 2) In 5.5 we generating more redo compared to 5.1 -- Possibly because of change buffering (unlikely as whole dataset is in memory) -- May be multiple rbs somehow changed the amount of redo -- Any changes in purge code that can cause more redo? 3) We are dirtying more pages in 5.5 -- Again change buffering should not be it because of memory resident dataset -- UNDO log pages stuff. Not sure about this but we have seen a bug in 5.6 where more than required pages are allocated for UNDO, meaning more dirty pages to be flushed.
[11 Jan 2011 21:57]
Inaam Rana
Missed another likely culprit: 4) buf_flush_list() is somehow less efficient in 5.5.
[11 Jan 2011 22:13]
Mark Callaghan
Here is the prefix of the rate for Innodb_os_log_written for innodb_purge_threads=0, 64 concurrent threads and my sysbench (no lua). The rates for 5.1.47 and 5.5.58 are about the same from 5.5.8, note the stalls 36406272.0 36681779.2 34856243.2 35550668.8 36225689.6 36595200.0 38077798.4 38436300.8 38426316.8 27824896.0 26976307.2 33760972.8 33155584.0 21431552.0 26237388.8 29718169.6 30882764.8 56217.6 0.0 0.0 0.0 14287616.0 36970240.0 37363046.4 37629952.0 37632460.8 37352857.6 38316032.0 38275891.2 37844224.0 38198988.8 38642995.2 38260121.6 38405427.2 37520435.2 35800320.0 21494630.4 0.0 0.0 0.0 from 5.1.47, no stalls 33570508.8 33567232.0 33452697.6 32866048.0 33323776.0 32619827.2 34948403.2 33984614.4 34770124.8 34721945.6 35243980.8 35481702.4 35339673.6 36613529.6 35518617.6 35636736.0 34742425.6 34868275.2 35366400.0 36071014.4 35994982.4 35894169.6 35740979.2 36201062.4 35822950.4 36118067.2 36815052.8 35276185.6 35970201.6 35788236.8 37049702.4 36561766.4 36245708.8 35397222.4 35373414.4 35375206.4 35613440.0 35516364.8 36310425.6 36300646.4 35225804.8 36576614.4 36192204.8 36178790.4 33087795.2 35442380.8 37365862.4 36126003.2 35294259.2 35714150.4 35303936.0
[11 Jan 2011 23:18]
Mark Callaghan
5.1.47 also experiences the huge flushes, but for whatever reason, they don't cause stalls as they do for 5.5.8 buf_flush_batch LIST has 1024 flushes for 1000 min and 18446744073709551615 lsn buf_flush_batch LIST has 1024 flushes for 1000 min and 18446744073709551615 lsn buf_flush_batch LIST has 128 flushes for 100 min and 18446744073709551615 lsn log_preflush 0 sync, 1679265076702 lsn, 1676267915463 oldest, 2829357857 age, 7554270 advance, 1676275469733 new, 3023360986 max_sync, 2821803587 max_async buf_flush_batch LIST has 43268 flushes for 18446744073709551614 min and 1676275469733 lsn log_preflush flushed 43268 for lsn 1676275469733 and 0 sync log_preflush 0 sync, 1679390981442 lsn, 1676275629391 oldest, 2989447675 age, 167644088 advance, 1676443273479 new, 3023360986 max_sync, 2821803587 max_async log_preflush flushed 30908 for lsn 1676443273479 and 0 sync buf_flush_batch LIST has 30908 flushes for 18446744073709551614 min and 1676443273479 lsn
[12 Jan 2011 0:03]
Mark Callaghan
It seems like 5.5.8 takes longer to do the first call to log_preflush_pool_modified_pages with sync=0 so when it does make that call, there is much more work to do as the value of the "advance" variable in log_checkpoint_margin is much larger. Is log_free_check being called less frequently in 5.5.8? 5.1.47 with a huge flush -- 4.5 seconds buf_flush_batch LIST has 24840 flushes for 18446744073709551614 min and 1694941425550 lsn, 4.58 seconds log_preflush flushed 24840 for lsn 1694941425550 and 0 sync log_preflush 0 sync, 1697840633932 lsn, 1694937772605 oldest, 2825456532 age, 3652945 advance, 1694941425550 new, 3023360986 max_sync, 2821803587 max_async The large flush for 5.5.8 was much larger (160k+ pages) and took much longer (34.31 seconds) The value for "advance" was also much larger 24M versus 3M, thus more pages were flushed buf_flush_flush_list_batch done with 162598 count, 2548 times in 34.31 seconds buf_flush_list 162598 return, 163852193457 lsn_limit, 0 skipped, 162598 total_page_count for 1 instances 0x4aa64838 ptr log_preflush 0 sync, 166850905821 lsn, 163827046822 oldest, 2846950222 age, 2847512440 checkpoint_age, 25146635 advance, 163852193457 new, 3023360986 sync_limit, 2821803587 async_limit I also think that the values for max_modified_age_async, max_modified_age_sync should be much larger given that I am using 3.8GB total for innodb transaction logs. The current values are ~2.8GB for async and 3GB for sync. The current behavior is 'stop the world' when the async limit is reached. I don't think it should be when called by user connections. In that case it might be better to have them do at most a fixed amount of IO when the async limit is exceeded. If there is a large flush to be done, it should be done by background threads. I added srv_background_checkpoint to the Facebook patch for that. Finally, this has a huge cliff in performance. As soon as the async limit is reached, the next caller is forced to do way too much IO.
[12 Jan 2011 1:43]
Mark Callaghan
The other thing I noticed is that 5.5.8 takes longer to flush a similar number of pages from buf_flush_batch and ends up calling log_preflush_pool_modified_pages earlier in the test (probably because it takes too long to flush pages). This is work done from the 1 second loop in srv_master_thread and 5.5.8 takes ~0.75 seconds per call to flush 2000 pages versus ~0.5 seconds per call for 5.1.47. From imprecise samples, after ~250 seconds the checkpoint age for 5.5.8 is ~2B versus ~1.7B for 5.1.47 From 5.1.47, the last buf_flush_batch times before the first call to log_preflush_pool_modified_pages. It takes ~0.5 seconds per 2000 pages. buf_flush_batch LIST has 2016 flushes for 2000 min and 18446744073709551615 lsn, 0.51 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.46 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.53 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.39 seconds buf_flush_batch LIST has 2047 flushes for 2000 min and 18446744073709551615 lsn, 0.61 seconds buf_flush_batch LIST has 2047 flushes for 2000 min and 18446744073709551615 lsn, 0.64 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.50 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.45 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.48 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.66 seconds buf_flush_batch LIST has 256 flushes for 200 min and 18446744073709551615 lsn, 0.05 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.61 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.69 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.47 seconds buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.65 seconds from 5.5.8, it takes 0.7 to 0.8 seconds per 2000 pages buf_flush_flush_list_batch done with 2001 count, 32 times in 0.88 seconds buf_flush_flush_list_batch done with 2000 count, 32 times in 0.86 seconds buf_flush_flush_list_batch done with 2000 count, 32 times in 0.77 seconds buf_flush_flush_list_batch done with 2001 count, 32 times in 0.74 seconds buf_flush_flush_list_batch done with 2000 count, 32 times in 0.79 seconds buf_flush_flush_list_batch done with 2001 count, 32 times in 0.62 seconds buf_flush_flush_list_batch done with 2001 count, 32 times in 0.62 seconds buf_flush_flush_list_batch done with 2001 count, 32 times in 0.68 seconds buf_flush_flush_list_batch done with 2001 count, 32 times in 0.72 seconds buf_flush_flush_list_batch done with 2000 count, 32 times in 0.74 seconds
[12 Jan 2011 4:31]
Sunny Bains
I can reproduce the symptoms with the following settings: REDO log size 32M Buffer pool size 16G Sysbench OLTP RW 128 threads I can see that when all the user threads simultaneously discover that a flush is required they stall and the master thread goes into the background loop. I have separate purge thread and that thread ends up doing more purge than the user threads because it calls log_free_check() more frequently than user threads. This I think is something we just have to live with. What we can and should fix is that wait for flush should wait for free space and not for flush to end. Waiting for flush to end is too coarse. Also, I think 5.5 does more flushing (or did more flushing than 5.1), Inaam can comment on that. One thing that the master thread can do is to check activity count and also the number of threads active within InnoDB. A quick hack is to add the check for srv_conc_n_threads == 0 before suspending or going to the flush loop. Currently this will only work for innodb_thread_concurrency > 0.
[12 Jan 2011 4:37]
Sunny Bains
Should have been: I have separate purge thread and that thread ends up doing more flushing than the user threads because it calls log_free_check() more frequently than user threads.
[12 Jan 2011 12:44]
Alexey Stroganov
Mark, See below for my findings/observations: I've run sysbench_lua scenario with 64 threads for 20 minutes with 5 minutes warmup with your config files/parameters: scenario: - start server with fresh datadir with 8 tables (4M rows in each table) - run POINT_SELECT_ALL_COLUMNS for 5 minutes to warmup - run POINT_SELECT/UPDATE_NO_KEY for 20 minutes - after the test wait while history length decreases I ran tests on box with RAID0 and 24 Cores and result pattern for 5.1 was not really different from 5.5. Then I've bound mysqld to 8 cores and sysbench to 16 cores and got exact results that you have observed. See below tps, sizes of ibdata1 and history length # Server 1 - 5.1.47 # Server 2 - 5.5.8-purge0 # Server 3 - 5.5.8-purge1 # # Server 1 Server 2 Server 3 # Thread INNODB INNODB INNODB 64 10160.00 37151.68 34525.43 # Server 1 - 5.1.47-taskset:mysqld8.sysbench16 # Server 2 - 5.5.8-taskset:mysqld8.sysbench16.purge0 # Server 3 - 5.5.8-taskset:mysqld8.sysbench16.purge1 # # Server 1 Server 3 Server 4 # Thread INNODB INNODB INNODB 64 26154.53 30784.68 36419.39 5.1.47 5.5.8 5.5.8 purge0 purge1 2.2GB 9GB 8.1GB no taskset - 24 Cores 0.97GB 6.6GB 7.1GB taskset: mysqld:8 sysbench:16 History list length (30sec interval) ------------------------------------------------------- task set mysqld:8cores | no task set 24 Cores sysbench:16 cores | -------------------------------------------------------- 5.1.47 5.5.8 5.5.8 | 5.1.47 5.5.8 5.5.8 purge0 purge1 | purge0 purge1 1842 12970 995 | 448 5894 1957 123087 506472 386988 | 128810 634217 402325 165539 935820 794198 | 250617 1111405 948795 184994 1341715 1200092 | 370444 1607104 1459331 133886 1792488 1558891 | 495441 2204015 2014718 135782 2260436 1941357 | 618949 2804505 2584142 110466 2737207 2301905 | 740097 3393790 3161552 111057 3216861 2676161 | 862050 4000312 3760839 130766 3697016 3051781 | 982639 4609270 4354120 130553 4150425 3441934 | 1104660 5177329 4936754 156050 4346137 3831758 | 1226815 5381039 5255374 163308 4348558 4196715 | 1350083 5381039 5255374 191986 4669859 4588007 | 1472785 5768203 5521208 209534 5075578 4891004 | 1597586 6316360 6090445 228700 5435242 4981820 | 1721242 6906792 6427128 267935 5878905 4981820 | 1844370 7505044 6771906 288848 6384443 5220351 | 1965695 8111471 7235152 335026 6884865 5606313 | 2087810 8724499 7581476 325376 7392254 5948588 | 2206316 9338437 7924890 417389 7876254 6337510 | 2325080 9940157 8264284 612274 8304442 6774621 | 2444471 10487672 8692110 806752 8661950 7276047 | 2563422 10794299 9270781 996647 8661950 7794486 | 2683284 10797333 9827396 931796 8855504 8305764 | 2808315 11192350 10330990 801311 9343227 8834988 | 2915054 11792906 10380368 790607 9820083 9307020 | 3029193 12398997 10526311 823107 10296457 9681316 | 3148762 13006574 11105807 698139 10793111 9681316 | 3269419 13610343 11692256 366234 11284027 9804518 | 3387346 14221039 12273521 104141 11784270 10336998| 3505152 14824285 12857632 10134 12284047 10855595| 3625502 15402246 13437536 2918 12761845 11376463| 3745770 15930156 14012999 7341 12943961 11918201| 3862317 16194402 14591772 413413 12957935 12440025| 3983269 16228078 15115281 971979 13424751 12972485| 4105675 16832049 15538626 1497021 13911070 13498213| 4226988 17442612 15538626 2017967 14406625 13917190| 4349110 18050606 15994001 1951350 14887142 14159856| 4475181 18649876 16557206 1638291 15346455 14159856| 4601556 19260133 17126295 1338088 15839828 14484727| 4727873 19873176 17683785
[12 Jan 2011 12:46]
Alexey Stroganov
chart for purge lag observed on 24 core box with and without taskset
Attachment: img_13685_3.png (image/png, text), 4.76 KiB.
[12 Jan 2011 13:55]
Alexey Stroganov
In my tests I've also observed stalls for 558: with taskset and purge0 [ 375s] threads: 64, tps: 0.00, read_ops: 14623.13, write_ops: 14622.93 [ 390s] threads: 64, tps: 0.00, read_ops: 11753.74, write_ops: 11753.34 [ 405s] threads: 64, tps: 0.00, read_ops: 11851.14, write_ops: 11851.54 [ 420s] threads: 64, tps: 0.00, read_ops: 1066.73, write_ops: 1065.87 [ 435s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 450s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 465s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 480s] threads: 64, tps: 0.00, read_ops: 12105.95, write_ops: 12106.75 [ 495s] threads: 64, tps: 0.00, read_ops: 14639.41, write_ops: 14639.01 ... [ 780s] threads: 64, tps: 0.00, read_ops: 20566.65, write_ops: 20567.05 [ 795s] threads: 64, tps: 0.00, read_ops: 10874.67, write_ops: 10872.40 [ 810s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 825s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 840s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 855s] threads: 64, tps: 0.00, read_ops: 12976.46, write_ops: 12977.66 [ 870s] threads: 64, tps: 0.00, read_ops: 22077.18, write_ops: 22077.38 with taskset and purge1 [ 240s] threads: 64, tps: 0.00, read_ops: 20639.13, write_ops: 20639.66 [ 255s] threads: 64, tps: 0.00, read_ops: 17347.44, write_ops: 17346.91 [ 270s] threads: 64, tps: 0.00, read_ops: 18699.77, write_ops: 18701.57 [ 285s] threads: 64, tps: 0.00, read_ops: 19427.00, write_ops: 19426.20 [ 300s] threads: 64, tps: 0.00, read_ops: 9540.32, write_ops: 9539.32 [ 315s] threads: 64, tps: 0.00, read_ops: 176.07, write_ops: 176.07 [ 330s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 345s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 360s] threads: 64, tps: 0.00, read_ops: 17429.05, write_ops: 17429.05 [ 375s] threads: 64, tps: 0.00, read_ops: 20981.24, write_ops: 20983.04 ... [ 630s] threads: 64, tps: 0.00, read_ops: 18966.92, write_ops: 18968.06 [ 645s] threads: 64, tps: 0.00, read_ops: 13793.60, write_ops: 13791.60 [ 660s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 675s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 690s] threads: 64, tps: 0.00, read_ops: 1477.07, write_ops: 1477.07 [ 705s] threads: 64, tps: 0.00, read_ops: 16986.43, write_ops: 16986.76 [ 720s] threads: 64, tps: 0.00, read_ops: 21377.22, write_ops: 21378.22 ... [ 960s] threads: 64, tps: 0.00, read_ops: 21117.10, write_ops: 21117.76 [ 975s] threads: 64, tps: 0.00, read_ops: 3871.53, write_ops: 3870.86 [ 990s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [1005s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [1020s] threads: 64, tps: 0.00, read_ops: 14073.07, write_ops: 14074.07 [1035s] threads: 64, tps: 0.00, read_ops: 21732.61, write_ops: 21733.61
[12 Jan 2011 13:56]
Alexey Stroganov
without taskset and purge0 [ 285s] threads: 64, tps: 0.00, read_ops: 20316.78, write_ops: 20316.85 [ 300s] threads: 64, tps: 0.00, read_ops: 1281.13, write_ops: 1279.53 [ 315s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 330s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 345s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 360s] threads: 64, tps: 0.00, read_ops: 21001.61, write_ops: 21003.01 ... [ 615s] threads: 64, tps: 0.00, read_ops: 20174.48, write_ops: 20175.81 [ 630s] threads: 64, tps: 0.00, read_ops: 16316.18, write_ops: 16314.44 [ 645s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 660s] threads: 64, tps: 0.00, read_ops: 216.93, write_ops: 216.93 [ 675s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 690s] threads: 64, tps: 0.00, read_ops: 16075.96, write_ops: 16077.49 [ 705s] threads: 64, tps: 0.00, read_ops: 21081.37, write_ops: 21081.24 ... [ 945s] threads: 64, tps: 0.00, read_ops: 20676.27, write_ops: 20677.07 [ 960s] threads: 64, tps: 0.00, read_ops: 17416.77, write_ops: 17414.83 [ 975s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 990s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [1005s] threads: 64, tps: 0.00, read_ops: 5125.33, write_ops: 5125.33 [1020s] threads: 64, tps: 0.00, read_ops: 21622.08, write_ops: 21622.28 without taskset and purge1 [ 285s] threads: 64, tps: 0.00, read_ops: 18991.20, write_ops: 18991.87 [ 300s] threads: 64, tps: 0.00, read_ops: 12301.98, write_ops: 12300.65 [ 315s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 330s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 345s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 360s] threads: 64, tps: 0.00, read_ops: 13395.09, write_ops: 13396.56 [ 375s] threads: 64, tps: 0.00, read_ops: 21595.69, write_ops: 21594.82 ... [ 690s] threads: 64, tps: 0.00, read_ops: 20325.56, write_ops: 20327.29 [ 705s] threads: 64, tps: 0.00, read_ops: 16040.44, write_ops: 16038.71 [ 720s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 735s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [ 750s] threads: 64, tps: 0.00, read_ops: 1872.34, write_ops: 1872.34 [ 765s] threads: 64, tps: 0.00, read_ops: 17628.53, write_ops: 17629.73 [ 780s] threads: 64, tps: 0.00, read_ops: 21576.42, write_ops: 21576.62 ... [1020s] threads: 64, tps: 0.00, read_ops: 20592.50, write_ops: 20591.50 [1035s] threads: 64, tps: 0.00, read_ops: 10460.08, write_ops: 10460.08 [1050s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [1065s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00 [1080s] threads: 64, tps: 0.00, read_ops: 15247.00, write_ops: 15247.93 [1095s] threads: 64, tps: 0.00, read_ops: 21673.43, write_ops: 21672.70
[12 Jan 2011 18:17]
Mark Callaghan
Sunny, I think that flushing is done faster with 5.5, but I need to update my stats collection tool to also report checkpoint age per interval before I repeat my tests and confirm. I also think that the buf_flush_batch equivalent in 5.5 has more overhead than in 5.1. I have instrumentation in the 5.1 facebook patch to time all tasks done by srv_master_thread and buf_flush_batch. Having similar timers in official mysql would make it easier to notice and debug this.
[12 Jan 2011 18:20]
Mark Callaghan
Alexey - thanks for the details and if I learn lua then I can upgrade to the new sysbench. Were your 5.1.47 numbers for built-in or plug-in innodb?
[12 Jan 2011 19:22]
Sunny Bains
Mark, I think I've found the bug, I need to verify the fix and run it by the other devs. Once it passes muster then I can send you a patch to confirm that it fixes what you observe too. The problem is in the flush code and the changes made to handle multiple buffer pools. Thanks for your help in nailing this down.
[12 Jan 2011 22:40]
Sunny Bains
Mark, The overhead in buf_flush_batch() is quite possibly real with removal of the global buffer pool mutex. It should however be offset by benefits in the mtr commit code. Once I confirm the fix for this I will take a look.
[13 Jan 2011 0:54]
Alexey Stroganov
Mark, my results for 5.1.47 - for InnoDB plugin
[13 Jan 2011 6:08]
Inaam Rana
Mark, "The current behavior is 'stop the world' when the async limit is reached. I don't think it should be when called by user connections." You mean 'sync' limit or 'async' limit. All user threads are blocked when 'sync' limit is reached. The calling user thread is blocked when 'async' limit is reached. "In that case it might be better to have them do at most a fixed amount of IO when the async limit is exceeded. If there is a large flush to be done, it should be done by background threads. I added srv_background_checkpoint to the Facebook patch for that." We have done something similar in 5.6. The async stuff is totally off loaded to a new dedicated flushing thread called page_cleaner.
[13 Jan 2011 8:21]
Mark Callaghan
I have thread stacks somewhere from poormans profiler that show no progress. I also have benchmark results showing 30 second stalls where the async limit is reached, a huge flush occurs, srv_master_thread then gets stuck (maybe waiting for flush to end) and then all other user connections get stuck. I can't tell you exactly where all other threads get stuck. Perhaps checkpoint age grows quickly at that point and then they hit the sync limit. They certainly will run slower regardless because of the furious flushing.
[13 Jan 2011 14:45]
Mikhail Izioumtchenko
Mark, it would be interesting to see those stacks. A sync checkpoint should be visible (log_free_check() -> ... ). A periodic output of SHOW STATUS through the stall would also be useful, specifically the LSNs it reports. The idea of async checkpoint processing triggering the sync checkpoint somehow is sound, however it wasn't showing in my tests which means there may be something interesting about your configuration parameters. Could you tell us: 1. redo log size 2. buffer pool size 3. active database size and/or average buffer pool hit ratio 4. max_dirty_pages_pct (sp?)
[13 Jan 2011 15:03]
Mark Callaghan
Michael - some people at Oracle can reproduce my 5.5.8 results. So you can get the thread stacks by running poormansprofiler (http://poormansprofiler.org) on their setup. I posted stacks from two stalls earlier in this thread. In one case everything was stuck in purge code and in the other case on buf_flush_wait_batch_end. I suspect the second case was from sync checkpoints. 5.5.8 is not able to manage checkpoint age on this test, so the sync checkpoint limit is hit soon after the async checkpoint limit and it is difficult to get stacks in between the two. You also have to guess at which limit is being enforced because SHOW ENGINE INNODB STATUS doesn't tell you: checkpoint age, async limit, sync limit. The async checkpoint limit is definitely 'stop the world' for the one thread that gets stuck flushing 100,000 pages in buf_flush_batch.
[13 Jan 2011 16:20]
Mark Callaghan
I ran 5.5.8 until the first stall occurred. The stacks are below and the stall is quickly from the sync checkpoint limit. I missed this at first because you can't compute this from SHOW ENGINE INNODB STATUS because the value of buf_pool_get_oldest_modification() is not provided. oldest_lsn = log_buf_pool_get_oldest_modification(); age = log->lsn - oldest_lsn; 32 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 20 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 12 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::index_read_idx_map,join_read_const,join_read_const_table,make_join_statistics,JOIN::optimize,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone 10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone 2 1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,srv_master_thread,start_thread,clone 1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_monitor_thread,start_thread,clone 1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_lock_timeout_thread,start_thread,clone 1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_error_monitor_thread,start_thread,clone 1 poll,handle_connections_sockets,mysqld_main,__libc_start_main,_start,poll 1 do_sigwait,sigwait,signal_hand,start_thread,clone 1 buf_calc_page_new_checksum,buf_flush_init_for_writing,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_purge_thread,start_thread,clone
[13 Jan 2011 17:06]
Mikhail Izioumtchenko
Sorry, I failed to read the entire bug, I'll post some comments separately. Mark, irrespective of how people at Oracle know how to reproduce this, could you still post your entire sysbench invocation, buffer pool size and redo log size? Time before you get the first stall, too. I'm trying to collect all variants in order to create the simplest possible test case to use in the future. As for SHOW STATUS output I find the following [derived] value useful: LSN: 555555555 Last checkpoint at: 444444444 LSN - Last checkpoint at is then what the sync/async/none checkpoint decision is made on, the exact values for the margin can be easily calculated from the code. Not too easily, unfortunately, given various extra margins added here and there. I found a graph on this value especially useful. For the max_dirty_pages_pct a similar metrics should be Modified DB pages / Database pages * 100
[13 Jan 2011 17:51]
Mikhail Izioumtchenko
here's what I see by rereading the entire bug (I was on vacation when most of the action happened), some of this is just the shorthand version of the above contents: * it is hard to tell from the stacks whether the flush is sync, because of inlining * irrespective, the immediate problem is the size of the flush, 80k to 230k pages. This is quite amazing given how we calculate 'advance' in log_check_margins() When this is fixed everything may follow suit. * the simplest test will be Sunny's REDO log size 32M Buffer pool size 16G Sysbench OLTP RW 128 threads missing values are the number of rows and max_dirty_pages_pct (default?) but the important thing that it is not a multitable test. I would also guess the dataset fits in the buffer pool a few times over. Mark's test is like this: * multitable sysbench, uniform distribution * 16G buffer pool ('about 1M pages') * 32M rows or so altogether ('data fits in less than half the buffer pool') * max_dirty_pages_pct=80 missing information: redo log size. my guess is it's not very large based on the data from my tests. What I run: * 10G buffer pool * single table sysbench with 25M rows, just over half the buffer pool * max_dirty_pages_pct=default of the observed flushing behaviour is quite OK: * flushing is driven by async flushing limit, with some help from adaptive flushing * somewhat unexpectedly, the async flushing is called mostly from user threads. Somewhat modified code showed there are literally millions of those, mostly hitting the situation 'someone else doing the flush', a minority advancing the checkpoint little by little. * No sync checkpoints, definitely no huge flushes. Everything goes quite smoothly even though my sysbench lacks the periodic tps rate printout capability, it is visible from other data. So it would seem that what makes the bug tick is the redo log size, if my guess about Mark's environment is right. With a small redo log, async and sync limits get pretty close in absolute numbers. When many threads call log_check_margins and decide on a sync checkpoint they retry the check even if another thread is flushing. I can see how it could lead to some form of stall but I can't see how this or anything could lead to runaway dirtying of pages to provoke the huge flush we observe. Other notes: * the current knowledge is that we call log_free_check often enough, there was a bug that was fixed in that area relatively recently * I don't think we can avoid calling log_free_check in purge because if the master thread does purge and there's no other db activity, there's noone else to call log_free_check() * the logic in log_check_margins() of 'sync and someone else doing flushing? OK, retry the whole thing' is questionable, given that log_free_check is called often.
[13 Jan 2011 18:23]
Mark Callaghan
if by redo log file size you mean the innodb transaction log --> 2 x 1.9G
[13 Jan 2011 18:31]
Mark Callaghan
Please don't run mysqld across 24-cores during your reproduction. That is likely to suffer from too much mutex contention everywhere that this result won't be reproduced. I use mysqld on an 8 core server.
[13 Jan 2011 20:17]
Inaam Rana
Looking closely at the stats collected by Ranger where he reproduced the issue when binding mysqld to 8 cores I am now not sure if purge lag is because of huge flushing or the huge flushing is because of purge lag. I am more inlined to believe that purge lag is causing other grief as well. The evidence in support: (all that is written below will make much more sense if Ranger can post the InnoDB buffer pool usage charts) 1) dataset (in terms of database pages in buffer pool) is around 400K pages at startup. 2) with 5.1.47 the number of database pages increase to around 500K after the run. 3) with 5.5 (both with and without a separate purge thread the number of database pages increase to above 800K by the end of the run. 4) with 5.1.47 the number of modified pages stays almost the same as number of database pages with periodic dips when the checkpoint flushing happens 5) with 5.5 the dips at flushing are more pronounced and the number of dirty pages steadies at around 600K during steady period. This means around 200K more dirty pages when compared to 5.1.47. 6) Also the difference of total - dirty pages == 800 - 600 = 200K pages is significant. This means (making an educated guess here) that these 200K pages which were flushed were UNDO log pages. Had they been data pages they would have been dirtied again because of uniform load distribution. The above rant tells us a few things: -- We are dealing with almost double dataset size in case of 5.5. -- As purge lags behind more the .ibd file probably will get more fragmented potentially making write operations on it more expensive. -- Both ibdata and .ibd files have to extend which is probably causing extra grief. -- There are more than 200K UNDO log pages in the buffer pool. These pages are dirty to start with and hence they add up to the flushing woes. Bottom line, I think we can do an apple to apple comparison of flushing only after we are able to bring dataset size down to what it is in 5.1.47. In other words, first fix purge lag and then work on flushing.
[13 Jan 2011 20:32]
Mikhail Izioumtchenko
transaction log size is what I meant, thank you. So my theory on small redo log size triggering it didn't hold. Hopefully Sunny's fix in the works will explain this. Sunny, what is the table size in your reproducible test case?
[13 Jan 2011 20:47]
Mikhail Izioumtchenko
more on 'do we call log_free_check' often enough? We think we call log_free_check in normal course of operation often enough. What about the purge thread? By looking row0purge.c we call log_free_check() before every mtr_start() with two exceptions: 1. in row_purge_upd_exist_or_extern() under skip_secondaries: /* Free possible externally stored fields */ for (i = 0; i < upd_get_n_fields(node->update); i++) { const upd_field_t* ufield = upd_get_nth_field(node->update, i); if (dfield_is_ext(&ufield->new_val)) { this is actually suspicious, why don't we call log_free_check() here? But it should be irrelevant to any sysbench testcase. 2. in row_purge_poss_sec(). First we can't call log_free_check() here because we are holding a sync object. Second, this doesn't seem to be a write operation, from the function description.
[13 Jan 2011 20:58]
Inaam Rana
log_free_check() is a blocking call. Fix for bug#39168 introduced extra log_free_check() calls one of which was in row0purge.c:row_purge_remove_clust_if_poss_low((). The fix went in 5.1.49. What this means is that it is possible that in 5.1.47 the purge thread doesn't get blocked by log_free_check() when a sync flush is happening and gets a kind of free ride (as other threads block on flushing) during this small period. We see a manifestation of this behaviour in Ranger's stats where in 5.1.47 (first column below) the history list actually decreases in length potentially during a checkpoint: 996647 8661950 7794486 | 2683284 10797333 9827396 931796 8855504 8305764 | 2808315 11192350 10330990 801311 9343227 8834988 | 2915054 11792906 10380368 790607 9820083 9307020 | 3029193 12398997 10526311 823107 10296457 9681316 | 3148762 13006574 11105807 698139 10793111 9681316 | 3269419 13610343 11692256 366234 11284027 9804518 | 3387346 14221039 12273521 104141 11784270 10336998| 3505152 14824285 12857632 10134 12284047 10855595| 3625502 15402246 13437536 2918 12761845 11376463| 3745770 15930156 14012999 7341 12943961 11918201| 3862317 16194402 14591772 413413 12957935 12440025| 3983269 16228078 15115281 971979 13424751 12972485| 4105675 16832049 15538626 1497021 13911070 13498213| 4226988 17442612 15538626 2017967 14406625 13917190| 4349110 18050606 15994001 1951350 14887142 14159856| 4475181 18649876 16557206 The above still doesn't explain why history list length is 10 times higher in 5.5 though.
[13 Jan 2011 21:05]
Alexey Stroganov
buffer pool usage for Ranger's results
Attachment: buffer_pool_usage.png (image/png, text), 8.75 KiB.
[13 Jan 2011 21:18]
Mark Callaghan
The initial description of my test setups was listed long ago at the start of this bug: https://www.facebook.com/notes/mysql-at-facebook/mysql-cpu-bound-performance-51-55-and-the... That has the my.cnf values used with two exceptions. I didn't list innodb_io_capacity and have been using 1000 or 2000. I also began using innodb_purge_threads=1.
[13 Jan 2011 21:47]
Mikhail Izioumtchenko
1. the difference in throughput between Mark's and our tests could be due to innodb_flush_logs_at_trx_commit=1 in our tests. I always run with 1. 2. calling log_free_check() in purge more often as a culprit is an interesting idea. Note that Mark's 5.1.52 results for readwrite are considerably worse than 5.1.47. But as Inaam says this alone doesn't explain the 5.5 problem. 3. Ranger's graph: I see the 200k flushes, same size, same interval of 6-7 units of time (minutes)? I'd love to see a file with complete SHOW STATUS output every N seconds, the checkpoint lag (LSN - Last checkpoint at) would be particularly interesting to trace 5. buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge This would mean the flush is synchronous. The question still remains, why we don't stay at the async limit like what happens in my tests? Why do we flush 200k pages? 6. I consider it a long chance that 200k is the result of neighbour page explosion but we may consider not flushing neighbours for sync checkpoint flushes. 7. More modified pages in 5.5: this should mean INSERTs proceed happily while UPDATEs/DELETEs have problems. Because INSERTs don't interact with the purge?
[13 Jan 2011 22:23]
Mikhail Izioumtchenko
considering the stalls here's a hypothesis that would explain the periodicity observed in Ranger's graph: (1 and 4 are more of empirical observations) 1. the flushing in the test of this kind is [a]sync checkpoint driven with the codepatch of log_free_check...log_checkpoint_margin 2. the above codepath is executed a lot so in the end 'age' as calculated in log_checkpoint_margin remains more or less a linear function over time 3. in my tests this line is constant so nothing bad happens 4. in the case at hand the 'age' is linearly increasing until the sync checkpoint boundary is hit resulting in a huge flash and a stall. Why the asynch checkpointing can't keep up? Probably because of too small 'advance' as calculated in log_checkpoint_margin(). Why the adaptive flushing doesn't play much of a role is a separate question. The purge lag may well be a separate question with this approach but it's worthwhile to try a q&d patch to eliminate the sync checkpoints ans see what happens.
[14 Jan 2011 11:36]
Sunny Bains
Problem finally found. The previous "found" was about the excessive flushing/stalls not purge lag. The problem is due to extra rollback segments and the way the original code was written. It tries to truncate all segments after each purge. This results in excessive buffer pool traffic and extra mini-transactions. The fix is not to call truncate at each iteration but after some configurable period (or possibly make it adaptive). I've tested my hypothesis and it fixes the problem.
[23 Jan 2011 22:04]
Sunny Bains
See also bug# 59625. Both need to be fixed at the same time. Bug# 59625 will need the back port of the min binary heap from 5.6 that is part of this bug fix.
[1 Feb 2011 20:33]
Mikhail Izioumtchenko
Mark, could you clarify the value of innodb_flush_log_at_trx_commit in your tests? The reason I'm asking is that while testing a potential fix I found that the history list length problem depends on the value of innodb_flush_log_at_trx_commit. Specifically, when I use the current 5.1 plugin code without any fix, just plain unpatched code, and the multitable lua sysbench test mentioned elsewhere in the bug, I have good behaviour with history list length around 50K, for innodb_flush_log_at_trx_commit=1 only. With either of the two other values, history list length grows to millions in 10 minute. Your blog post mentions innodb_flush_log_at_trx_commit=2, could you confirm that, and one other thing, are you sure that the dataset you use for a 5.1 test was created with 5.1? As you know using a 5.5 dataset with a 5.1 executable means the 5.1 executable will use 128 rollback segments as the relevant code is already there. For the reference, here is what I have now for UNPATCHED code, no fixes at all, GOOD/BAD is an obvious function of the history list length: 5.1-exec-5.1-dataset 5.1-exec-5.5-dataset 5.5 innodb_flush_log_at_trx_commit=1 GOOD GOOD BAD* innodb_flush_log_at_trx_commit=2 BAD** BAD** BAD * this is the current bug as I understand it ** this is what I see in my tests but it seems to contradict your observations thatsay that you saw a short history list length with 5.1 and innodb_flush_log_at_trx_commit=2. So, we know that everything will deteriorate with 5.5, but to reiterate my question, could you confirm that you have a test case with innodb_flush_log_at_trx_commit=2, 5.1 plugin executable, database created with 5.1 plugin, history list length being not long?
[2 Feb 2011 5:18]
Mark Callaghan
I did not share database files between 5.1 and 5.5. I used innodb_flush_log_at_trx_commit=2 I did not produce stalls in unmodified 5.1.47 as I did in 5.5.8
[2 Feb 2011 6:13]
Mikhail Izioumtchenko
Thanks, Mark. What I see is that the configuration is about the same, the difference is in sysbench, or this is where I see it. There are three multitable sysbench version I know about: a- LUA sysbench 0.5. Here we see at least one problem. b- Mark's sysbench 0.4 modified. Here it would appear Mark sees one more problem not seen with sb 0.5 c- my own multitable sysbench based on 0.4. Untested in the context of this bug and only ever tested with secondary index search enabled. Whatever testing I did it wasn't showing any surprises. This may mean lack of secondary index search triggers the bug(s). given that I see that a- is different from c- in implementation it doesn't make me wonder that b- shows results different from a-. It means I have to get b- and retest. More importantly when Sunny is back from vacation we have to revisit/refactor/fork the bugs in question as there may be more than one.
[2 Feb 2011 6:28]
Mark Callaghan
I switched to lua/sysbench 0.5 early in the process of generating reproduction data.
[2 Apr 2011 15:39]
Mark Callaghan
When does the fix for this get committed? This problem was discussed on a recent Percona post. But I disagree with Dimitri that proper tuning solves the problem. http://www.mysqlperformanceblog.com/2011/03/31/innodb-flushing-a-lot-of-memory-and-slow-di...
[2 Apr 2011 21:32]
Sunny Bains
The fix the has been approved (and committed) is not for "furious flushing" but purge lag due to 128 rollback segments.
[3 Apr 2011 11:22]
Dimitri KRAVTCHUK
Mark, if I understand well, your bug was about the "purge lagging".. - but the article you're mentioning now is about the "adaptive flushing".. - may you clarify, please?.. and about which exactly tuning you're not agree that I've proposed?.. Rgds, -Dimitri
[3 Apr 2011 14:35]
Mark Callaghan
In the bug report there are details of cases where the server stalled for more than 60 seconds during checkpoint and log_preflush_pool_modified_pages flushes more than 200,000 pages. That isn't a new problem specific to 5.5 but it is a serious problem. The problem occurs because when InnoDB must flush X pages to advance the min dirty page LSN, it can end up flushing X * 64 because dirty neighbors are flushed.
[23 May 2012 23:54]
James Day
A bit of followup for the neighbor flushing on this, substantial changes in 5.6 labs release: http://blogs.innodb.com/wp/2012/04/optimizing-neighbor-flush-behavior/ .
[30 Nov 2012 8:50]
Ovais Tariq
Sunny, I would like to followup here, when do you think the bug fix for multiple rollback segments is going to be released?
[2 Dec 2012 8:30]
Sunny Bains
The purge lag due to 128 rsegs has been fixed. This bug morphed into one about furious flushing. There have been several improvements made around flushing code in 5.6. The flushing is done in a separate thread now and has been moved out of the master thread. Ovais, are you experiencing any issues around this? Do you have a test case that is purge related and not about furious flushing?
[2 Dec 2012 18:02]
Ovais Tariq
Sunny, I am experiencing issues with very noticeable differences in purge lag after upgrading from 5.1 to 5.5. There has been no other change in the application, or anything else, apart from this upgrade from 5.1 to 5.5. This behaviour is repeatable on many servers. Having read all the comments in this bug report, I have a theory that this might be because of running with a separate purge_thread, which if I understand correctly gets blocked by the master thread when its doing lot of flushing. I would be trying with disabling separate purge_thread, to see if that fixes purge lag. I was wondering if the fix for this bug will go in a 5.5 release, because the 5.5 release introduced the concept of 128 rsegs.
[2 Dec 2012 20:40]
Sunny Bains
You must remember that making the purge thread separate from the master thread is meant to reduce lag. If it is part of the master thread ie. innodb_purge_threads=0 then purge is done via the master thread and you will either have the master thread doing purge or flushing. If purge is done separately in its own thread then you can have flushing and purging in parallel, thereby reducing the purge lag. The problem around truncating the UNDO log (for 128 rsegs) has been fixed in 5.5 as mentioned in this bug report. It is done in batches rather than at the end of each round. However, given that you are experiencing lag after upgrading from 5.1 to 5.5, it may be worthwhile investigating this. Can you post a few stack traces? and some numbers around the history list. Also, please do try and run with innodb-purge-threads=0. One potential advantage in doing it via the master thread is that the master thread will purge the full list before it does the next flush. Furious flushing revolves around a lot of issues, purge gets affected indirectly only. The ongoing changes to 5.6+ will not be back ported and would be difficult to backport too.
[3 Dec 2012 0:57]
James Day
Ovais, if you're using 5.5's default innodb_purge_batch_size of 20, please change to 5.6's default of 300 and let us know how that goes, while using innodb_purge_threads = 1. 20 turned out to be too small to be really efficient. Still not resolved? What is innodb_io_capacity and what's the real io capacity of the system (100 =1 spinning disk). If it's the default of 200 it's probably too low, change to the number of disks used for the main tablespace * 100 (and an additional * 5 if using SSD). If that doesn't resolve it, check the percentage of dirty pages in your buffer pool at a time when there's trouble and set innodb_max_dirty_pages_pct to observed_percentage * 0.9. This is to cause flushing to happen in that part of the work rather than other threads. If that still doesn't work, look at your history list length and monitor it regularly at troubled times, then set innodb_max_purge_lag = that troubled value * 0.9. Possibly lower. Caution: setting this too aggressively low can destroy foreground thread performance, particularly badly if you do a backup with a read view open that then finishes and closes the read view with a high history list length. You should probably set innodb_max_purge_lag to 0 or a few billion outside peak times. While this setting can work and has in benchmarks, it's the least desirable one because of the risk and because it's abusing a setting really intended only to prevent disk filling, People who touch it tend to set it far too low. The suggestions are in roughly least potentially disruptive to most potentially disruptive order, just try working through them one at a time until the problem is solved or remains after all of them. James Day, MySQL Senior Principal Support Engineer, Oracle
[4 Dec 2012 9:28]
Ovais Tariq
Sunny, I agree that separating the purge_thread from the master_thread was done to reduce lag, however, as I can see currently, it does not really work as well when there is good amount of mixed workload, as then the master_thread is busy flushing pages and blocks the purge_thread. Doing the purging from within the master_thread makes sense in such a case, as then the master_thread would give equal opportunity to both purging and flushing. Regarding the bug fix, can you tell me if the fix has been released? I understand that furious flushing can have a lot of unintended side-affects as the one where purge gets affected indirectly, that is why I am tempted to try disabling the purge thread. I agree with you that many of the changes going into 5.6 cannot be backported, and I am waiting impatiently for a 5.6 GA release :)
[7 Dec 2012 18:24]
Ovais Tariq
Sunny, So in my case disabling purge_thread has helped. The master_thread does more purging as compared to purge_thread now, and the history_list_length has eased a lot. I think this was certainly a case of master_thread blocking purge_thread and producing increase in purge lag.
[3 Apr 2013 18:52]
Mark Callaghan
Trying to understand the regression for a single-threaded & update-only sysbench result comparing 5.1 with 5.6 -- http://mysqlha.blogspot.com/2013/04/mysql-56-single-thread-update-only.html This is from running "perf -p $( pidof mysqld ) -- sleep 180" during the test and buf_page_get_gen uses almost 3X more CPU with 5.6. This is true whether I use 1 or 8 buffer pool instances. The table is cached in the InnoDB buffer pool before the test starts. From 5.6.10 5.46% mysqld mysqld.notrc.noprof.flushcond2 [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*) 3.72% mysqld mysqld.notrc.noprof.flushcond2 [.] MYSQLparse(void*) 1.77% mysqld mysqld.notrc.noprof.flushcond2 [.] _ZL14ut_crc32_sse42PKhm 1.58% mysqld mysqld.notrc.noprof.flushcond2 [.] _ZL21mtr_memo_slot_releaseP5mtr_tP15mtr_memo_slot_t.isra.5 1.50% mysqld mysqld.notrc.noprof.flushcond2 [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long) 1.40% mysqld mysqld.notrc.noprof.flushcond2 [.] log_group_write_buf(log_group_t*, unsigned char*, unsigned long, unsigned long, unsigned long) 1.35% mysqld mysqld.notrc.noprof.flushcond2 [.] rw_lock_x_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) From 5.1.63 with the facebook patch 4.87% mysqld mysqld.noprof [.] MYSQLparse(void*) 2.87% mysqld libc-2.13.so [.] __memcpy_ssse3_back 1.87% mysqld mysqld.noprof [.] rec_get_offsets_func 1.83% mysqld mysqld.noprof [.] buf_page_get_gen 1.82% mysqld mysqld.noprof [.] log_group_write_buf 1.29% mysqld mysqld.noprof [.] my_fast_crc32 1.19% mysqld mysqld.noprof [.] os_aio_simulated_handle 1.14% mysqld mysqld.noprof [.] MYSQLlex(void*, void*)
[3 Apr 2013 18:54]
Mark Callaghan
These are the top callers to buf_page_get_gen in 5.6 5.51% mysqld mysqld.notrc.noprof.flushcond2 [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*) | --- buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*) | |--66.43%-- _ZL31trx_purge_truncate_rseg_historyP10trx_rseg_tPK12purge_iter_t | trx_purge(unsigned long, unsigned long, bool) | srv_purge_coordinator_thread | start_thread | __clone | |--19.06%-- btr_cur_search_to_nth_level(dict_index_t*, unsigned long, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, char const*, unsigned long, mtr_t*) | row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) | ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) | handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool) | handler::multi_range_read_next(char**) | QUICK_RANGE_SELECT::get_next() | _ZL8rr_quickP11READ_RECORD | mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*) | mysql_execute_command(THD*) | mysql_parse(THD*, char*, unsigned int, Parser_state*) | dispatch_command(enum_server_command, THD*, char*, unsigned int) | do_handle_one_connection(THD*) | handle_one_connection | start_thread | __clone | |--4.76%-- _ZL35trx_purge_rseg_get_next_history_logP10trx_rseg_tPm | _ZL22trx_purge_get_next_recPmP16mem_block_info_t | trx_purge(unsigned long, unsigned long, bool) | srv_purge_coordinator_thread | start_thread | __clone | |--2.83%-- trx_undo_assign_undo(trx_t*, unsigned long) | trx_undo_report_row_operation(unsigned long, unsigned long, que_thr_t*, dict_index_t*, dtuple_t const*, upd_t const*, unsigned long, unsigned char const*, unsigned long const*, unsigned long*) | btr_cur_update_in_place(unsigned long, btr_cur_t*, unsigned long const*, upd_t const*, unsigned long, que_thr_t*, unsigned long, mtr_t*) | btr_cur_optimistic_update(unsigned long, btr_cur_t*, unsigned long**, mem_block_info_t**, upd_t const*, unsigned long, que_thr_t*, unsigned long, mtr_t*) | _ZL17row_upd_clust_recP10upd_node_tP12dict_index_tPmPP16mem_block_info_tP9que_thr_tP5mtr_t | _ZL18row_upd_clust_stepP10upd_node_tP9que_thr_t | row_upd_step(que_thr_t*) | row_update_for_mysql(unsigned char*, row_prebuilt_t*) | ha_innobase::update_row(unsigned char const*, unsigned char*) | handler::ha_update_row(unsigned char const*, unsigned char*) | mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*) | mysql_execute_command(THD*) | mysql_parse(THD*, char*, unsigned int, Parser_state*) | dispatch_command(enum_server_command, THD*, char*, unsigned int) | do_handle_one_connection(THD*) | handle_one_connection | start_thread | __clone | |--1.92%-- trx_purge_add_update_undo_to_history(trx_t*, unsigned char*, mtr_t*) | trx_undo_update_cleanup(trx_t*, unsigned char*, mtr_t*) | _ZL31trx_write_serialisation_historyP5trx_t | trx_commit(trx_t*) | trx_commit_for_mysql(trx_t*) | _ZL15innobase_commitP10handlertonP3THDb | ha_commit_low(THD*, bool) | TC_LOG_DUMMY::commit(THD*, bool) | ha_commit_trans(THD*, bool) | trans_commit_stmt(THD*) | mysql_execute_command(THD*) | mysql_parse(THD*, char*, unsigned int, Parser_state*) | dispatch_command(enum_server_command, THD*, char*, unsigned int) | do_handle_one_connection(THD*) | handle_one_connection | start_thread | __clone
[3 Apr 2013 19:19]
Mark Callaghan
The problem is probably that purge iterates over 128 rollback segments in 5.6 versus 1 in 5.1 Code from 5.6 /********************************************************************//** Removes unnecessary history data from rollback segments. NOTE that when this function is called, the caller must not have any latches on undo log pages! */ static void trx_purge_truncate_history( /*========================*/ purge_iter_t* limit, /*!< in: truncate limit */ const read_view_t* view) /*!< in: purge view */ { ulint i; /* We play safe and set the truncate limit at most to the purge view low_limit number, though this is not necessary */ if (limit->trx_no >= view->low_limit_no) { limit->trx_no = view->low_limit_no; limit->undo_no = 0; } ut_ad(limit->trx_no <= purge_sys->view->low_limit_no); for (i = 0; i < TRX_SYS_N_RSEGS; ++i) { trx_rseg_t* rseg = trx_sys->rseg_array[i]; if (rseg != NULL) { ut_a(rseg->id == i); trx_purge_truncate_rseg_history(rseg, limit); } } } Code from 5.1, the while loop does 1 iteration in practice /********************************************************************//** Removes unnecessary history data from rollback segments. NOTE that when this function is called, the caller must not have any latches on undo log pages! */ static void trx_purge_truncate_history(void) /*============================*/ { trx_rseg_t* rseg; trx_id_t limit_trx_no; undo_no_t limit_undo_no; ut_ad(mutex_own(&(purge_sys->mutex))); trx_purge_arr_get_biggest(purge_sys->arr, &limit_trx_no, &limit_undo_no); if (ut_dulint_is_zero(limit_trx_no)) { limit_trx_no = purge_sys->purge_trx_no; limit_undo_no = purge_sys->purge_undo_no; } /* We play safe and set the truncate limit at most to the purge view low_limit number, though this is not necessary */ if (ut_dulint_cmp(limit_trx_no, purge_sys->view->low_limit_no) >= 0) { limit_trx_no = purge_sys->view->low_limit_no; limit_undo_no = ut_dulint_zero; } ut_ad((ut_dulint_cmp(limit_trx_no, purge_sys->view->low_limit_no) <= 0)); rseg = UT_LIST_GET_FIRST(trx_sys->rseg_list); while (rseg) { trx_purge_truncate_rseg_history(rseg, limit_trx_no, limit_undo_no); rseg = UT_LIST_GET_NEXT(rseg_list, rseg); } }
[3 Apr 2013 20:36]
Mark Callaghan
My other theory is that truncate_purge(..., truncate=TRUE) is getting called more frequently than the equivalent in 5.1. And the caller would be this code from srv_do_purge: if (!(count++ % TRX_SYS_N_RSEGS)) { /* Force a truncate of the history list. */ trx_purge(1, srv_purge_batch_size, true); }
[3 Apr 2013 20:40]
Sunny Bains
The second hypothesis is worth considering. The code was introduced to fix exactly this problem. The choice of magic number was arbitrary.
[3 Apr 2013 22:06]
Mark Callaghan
Watched purge lag during a run of the single-threaded & update-only & cached workload. For 5.6 the average lag is ~2000 and for 5.1 it was closer to 20000. So I continue to think that 5.6 is more aggressive about doing truncate for purge and that explains the regression. After hacking 5.6 to do the truncate less frequently, the CPU profile is better, but QPS was still a lot worse than 5.1 -- 8200 or 8300 versus 9000+ updates/second. So this helps, but there is still almost a 10% difference. I am not sure my hack was reasonable -- I added "rseg_history_len > 20000" if (!(count++ % TRX_SYS_N_RSEGS) && rseg_history_len > 20000) { /* Force a truncate of the history list. */ trx_purge(1, srv_purge_batch_size, true); } --- this is a CPU profile with my hack 3.98% mysqld mysqld.flushcond2.noopttrc.purge [.] MYSQLparse(void*) 2.13% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL14ut_crc32_sse42PKhm 2.12% mysqld mysqld.flushcond2.noopttrc.purge [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*) 1.54% mysqld mysqld.flushcond2.noopttrc.purge [.] log_group_write_buf(log_group_t*, unsigned char*, unsigned long, unsigned long, unsigned long) 1.49% mysqld mysqld.flushcond2.noopttrc.purge [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long) 1.14% mysqld libc-2.13.so [.] __memcpy_ssse3_back 1.04% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL13lex_one_tokenPvS_ 0.69% mysqld mysqld.flushcond2.noopttrc.purge [.] mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*) 0.65% mysqld mysqld.flushcond2.noopttrc.purge [.] page_cur_search_with_match(buf_block_t const*, dict_index_t const*, dtuple_t const*, unsigned long, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*) 0.65% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL15get_hash_symbolPKcjb
[3 Apr 2013 22:54]
Mark Callaghan
Even with my hack, the 5.6 CPU profile shows more time spent in purge compared to 5.1: 0.42% mysqld mysqld.flushcond2.noopttrc.purge [.] trx_purge(unsigned long, unsigned long, bool) 0.32% mysqld mysqld.flushcond2.noopttrc.purge [.] srv_purge_coordinator_thread 0.17% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL34trx_purge_get_rseg_with_min_trx_idP11trx_purge_t 0.16% mysqld mysqld.flushcond2.noopttrc.purge [.] row_purge_step(que_thr_t*) 0.12% mysqld mysqld.flushcond2.noopttrc.purge [.] read_view_purge_open(mem_block_info_t*) 0.09% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL35trx_purge_rseg_get_next_history_logP10trx_rseg_tPm 0.09% mysqld mysqld.flushcond2.noopttrc.purge [.] trx_purge_add_update_undo_to_history(trx_t*, unsigned char*, mtr_t*) 0.07% mysqld mysqld.flushcond2.noopttrc.purge [.] srv_wake_purge_thread_if_not_active() 0.06% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL22trx_purge_get_next_recPmP16mem_block_info_t 0.04% mysqld mysqld.flushcond2.noopttrc.purge [.] _ZL31trx_purge_truncate_rseg_historyP10trx_rseg_tPK12purge_iter_t ... compared to 5.1.63 with the FB patch including purge threads 0.14% mysqld mysqld.noprof [.] trx_purge_add_update_undo_to_history 0.05% mysqld mysqld.noprof [.] trx_purge_rseg_get_next_history_log 0.03% mysqld mysqld.noprof [.] trx_purge_truncate_rseg_history 0.03% mysqld mysqld.noprof [.] trx_purge_choose_next_log 0.02% mysqld mysqld.noprof [.] trx_purge_fetch_next_rec 0.01% mysqld mysqld.noprof [.] trx_purge_rec_release 0.01% mysqld mysqld.noprof [.] row_purge_step 0.01% mysqld mysqld.noprof [.] trx_purge_get_next_rec 0.01% mysqld mysqld.noprof [.] trx_purge
[4 Apr 2013 0:21]
Mark Callaghan
Expanded my hack to this and QPS is up to about 8400. The reason was to reduce the frequency at which purge ran and hopefully reduce the CPU from it. Purge is using a lot more CPU than it did in 5.1 if (!(count++ % TRX_SYS_N_RSEGS) && rseg_history_len > 20000) { /* Force a truncate of the history list. */ trx_purge(1, srv_purge_batch_size, true); } os_thread_sleep(100000);
[18 Sep 2013 9:19]
Annamalai Gurusami
We are unable to reproduce this problem. We followed the original steps how-to-repeat as described at: https://www.facebook.com/notes/mysql-at-facebook/mysql-cpu-bound-performance-51-55-and-the... Closing this bug as not reproducible.
[30 Sep 2013 20:51]
Mark Callaghan
Why would this be closed as not a bug, not reproducible. There are many comments on this bug report from Oracle employees that mention at least 1 bug was fixed for purge when there are many rollback segments. For example: [2 Dec 2012 8:30] Sunny Bains The purge lag due to 128 rsegs has been fixed. This bug morphed into one about furious flushing. There have been several improvements made around flushing code in 5.6. The flushing is done in a separate thread now and has been moved out of the master thread. Ovais, are you experiencing any issues around this? Do you have a test case that is purge related and not about furious flushing?
[30 Sep 2013 21:01]
Sunny Bains
Mark, We've tried to reproduce this issue internally and have been unsuccessful so far. That's why we've decided to close it.
[2 Oct 2013 15:08]
Mark Callaghan
The bug was created for purge lag on 5.5.8. Comments from InnoDB developers indicate the bug was fixed. My request is that you include the bzr commit that fixed the purge lag bug for 128 rollback segments.
[3 Oct 2013 6:09]
Annamalai Gurusami
Please refer to the following commit for further details: revno: 2995.37.155 revision-id: sunny.bains@oracle.com-20110222050408-73uxqkbkpm17emfu parent: marko.makela@oracle.com-20110217202756-hr889aqmt152tfn0 committer: Sunny Bains <Sunny.Bains@Oracle.Com> branch nick: 5.5 timestamp: Tue 2011-02-22 16:04:08 +1100 message: Bug #11766227: InnoDB purge lag much worse for 5.5.8 versus 5.1 Bug #11766501: Multiple RBS break the get rseg with mininum trx_t::no code during purge Bug# 59291 changes: Main problem is that truncating the UNDO log at the completion of every trx_purge() call is expensive as the number of rollback segments is increased. We truncate after a configurable amount of pages. The innodb_purge_batch_size parameter is used to control when InnoDB does the actual truncate. The truncate is done once after 128 (or TRX_SYS_N_RSEGS iterations). In other words we truncate after purge 128 * innodb_purge_batch_size. The smaller the batch size the quicker we truncate.
[3 Oct 2013 18:59]
Ovais Tariq
Mark, Unfortunately I do not have a test case to prove that the purge lag is greater on 5.5 as compared to 5.1, because I wasn't able to reproduce it in a synthetic environment. I didn't get much time to do more benchmarks though, I must confess. However, I can tell you that I have worked with one of our customers here at Percona who saw the purge lag increase many folds once they upgraded to 5.1 from 5.5, with no change in infrastructure or workload, which is why I was also interested in this bug. Obviously the only changes were the code related changes between the two major MySQL releases and the fact that they had purge_thread enabled in 5.5. The main characteristics for them are write intensive workload and slow IO subsystem, the datastore is on NFS. I was able to resolve the purge lag issue by disabling the purge_thread and increasing the purge_batch_size to 100. There is not much more I can share though. I may get to run some more benchmarks in the future around the theme that the purge lag on 5.5 is more with a write intensive workload and slow IO subsystem.
[3 Oct 2013 20:29]
Sunny Bains
The purge batch size (--innodb-purge-batch-size) should be 300 or thereabouts. That is the default in 5.6+.