MySQL Bugs: #59291: InnoDB purge lag much worse for 5.5.8 versus 5.1

Bug #59291	InnoDB purge lag much worse for 5.5.8 versus 5.1
Submitted:	4 Jan 2011 20:32	Modified:	3 Oct 2013 6:10
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.5	OS:	Any
Assigned to:	Sunny Bains	CPU Architecture:	Any
Tags:	innodb, lag, multiple rollback segments, purge

Description:
InnoDB purge lag grows much faster on my benchmarks using 5.5.8 than 5.1 with the InnoDB plugin. At high levels of concurrency the value of innodb_purge_threads does not make a difference.

How to repeat:
forthcoming

Suggested fix:
don't know yet

Before realizing that purge lag was the problem, I created http://bugs.mysql.com/bug.php?id=57489 for significant performance regressions between MySQL 5.5 and 5.1

See https://www.facebook.com/notes/mysql-at-facebook/mysql-cpu-bound-performance-51-55-and-the... for details on my tests

Mark,

query_cache_type=OFF should be used for 5.5 when there's lots of concurrency. It reduces the mutex contention for the query cache even when size is 0. I think we're changing the default from on to off for 5.5, not sure when.

I always wonder about innodb_io_capacity when background threads get behind because Oracle InnoDB tries to respect that setting and the 200 default is often too low for high performance systems.

Why not fast mutexes in 5.5 as well as 5.1?

Mark,

       8          16           32            64            128             256     threads
15592 118953   91617 8017707  15197122 15214778   5.5.8-0
    359        284        280  6376105 12897579  11738230   5.5.8-1
71260   80561 310605     17609          86113         52303  5.1.52
 
Above 5.1.52 is the FB version of 5.1? If so, have you made any changes in purge code or how often purge is scheduled in FB patch? Do you have numbers for stock 5.1.47?

I tried with stock sysbench on my modest 8 core box with 1 point_select and 1 index update in equal ratio and uniform distribution on a single 5 million table with 32 clients. 5.5 shines as compare to 5.1. The TPS are higher and the purge lag is slightly higher as we are able to do more work in 5.5.

5.5: TPS: 11541
History list length growth:
History list length 2941465
History list length 3219597
History list length 3499632
History list length 3779928
History list length 4055433
History list length 4333017
History list length 4598088
History list length 4880252
History list length 5160025

5.1: TPS: 9916
history list length growth:
History list length 2069883
History list length 2275163
History list length 2482399
History list length 2627458
History list length 2838499
History list length 3043952
History list length 3242746
History list length 3449381
History list length 3645951
History list length 3821100

Next I have to try multi-table sysbench or some variant of it to see if I can reproduce this

Mark,

Quick question about your test setup:

Do you restart the server between each run or by 'giving time to recover' you mean that you let the server idle for 20 minutes and then start next run without restart. I ask this because if the purge lag is running in millions then it is possible that due to large number of UNDO pages your entire 16G buffer pool got filled up. After that point, if server is not restarted before every run, then every subsequent run, strictly speaking, cannot be called an in-memory workload. I am not sure how much impact triggering of LRU eviction will have on the results but I suspect it not be a lot given that most of the UNDO pages are 'cold pages'. In any case, it'd be good to know if at any point the buffer pool spill happened.

I will repeat the test and collect purge lag throughout the benchmark. I would have that data already if Innodb reported purge lag in SHOW STATUS (hint, hint).

I don't restart the server between runs for each concurrency level.

You get ~10,000 TPS for 32 concurrent clients. I get between 27,000 and 35,000. Are you sure your database is cached at the start of the test? They are in my case -- 16G buffer pool and ~8G of data and I prime the buffer pool at test start.

If your test has IO stalls then it is more likely that purge can keep up.

Mark,

There were no IO stalls. The difference in TPS is perhaps because I am running the stock sysbench on a single while you were running your own flavor. The test was CPU bound throughout the run.

Also can you answer the questions by James about the query cache and the fast mutexes?

query_cache_type=ON for all tests. I will repeat with it OFF, but if the problem is limited to 5.5.8 and ON, then that is a performance regression. Dump the query cache or make it off by default.

I didn't build with fast mutexes for 5.5.8 because you don't build with it as I used the cmake scripts that MySQL claims to use for their binaries. I get lots of different messages from people in MySQL about the future of fast mutexes. Is it staying or going? Why don't your builds use it?

I will try it with fast mutexes enabled, but a few changes on your side could make life easier for people like me.

Inaam, 

You don't need my sysbench hacks. All of the magic is in the shell scripts. Just run 8 concurrent instances of it. Also, the problem appears to start with 64 concurrent clients.

And now 5.5.9 will get with-fast-mutexes, when it wasn't in 5.5.8 builds
http://bugs.mysql.com/bug.php?id=58766

Mark,

To get the same workload I've tried a bit different way - I've slightly modified
oltp_rw.lua workload from sysbench 0.5 and added --tables parameter that allows
to specify what number of tables one would use for testing.

so prepare/run will look like following

prepare:

sysbench-0.5.0  --test=multi_table_rw.lua --oltp-table-size=4000000 --mysql-table-engine=innodb \
                --db-ps-mode=disable --mysql-engine-trx=yes --rand-type=uniform  --tables=8 \
                --mysql-host=<host> --mysql-user=<user> --mysql-port=<port> prepare

run:
   
sysbench-0.5.0  --test=multi_table_rw.lua --oltp-table-size=4000000 --mysql-table-engine=innodb \
                --db-ps-mode=disable --mysql-engine-trx=yes --rand-type=uniform  --tables=8 \
                --mysql-host=<host> --mysql-user=<user> --mysql-port=<port> \
                --max-requests=0 --max-time=600 --num-threads=<threads> run 

I've run this workload with 256 threads/8 tables that means 64 threads per table and your exact configs for 5.5.8/5.1. It seems that I don't see any notable  difference in sense of purge lag:

5.1.50 TPS: 13729.27
5.5.8  TPS: 13885.19 

5.1.50: size of ibdata1/history length right after stop of sysbench
- /data0/data/ibdata1  1.679.818.752
- History length: 3212664 

5.5.8: size of ibdata1/history length right after stop of sysbench
- /data0/data/ibdata1  1.696.595.968
- History length: 3251717 

If you will have chance can you please try my variant of workload to check if you can repeat the same behavior with it.

Test case for the issue

Attachment: multi_table_rw.lua (text/x-lua), 3.51 KiB.

I have never used the Lua version of sysbench. I know I should, but I have a few nice patches for sysbench 0.4 that I like. So I will do this but in return can you repeat your test with 8 threads per table for 64 threads total. 

Where did you get sysbench from? Can you send me a URL for that?

I repeated the test for 64 concurrent threads with 8 per table and the history list lengths at 1-minute intervals are listed below. The test ran for 20 minutes. The results for 558 are much worse than for the 51 releases. Results with query_cache_type=0 and builds using --with-fast-mutexes for 558 were no better than what is listed here. 122 12
     5147              5152            558
   27225           45827            8713
   51899          330980          356086
  110242          326102          410195
  576066           12713          572020
  132556           84705          656031
   37837           11501         1158614
 1272295          361269         1690426
     504           79095         1700081
  118360          156928         2072686
 1544348          244717         2665173
  152711          138543         2838506
  341397           30268         3166159
 1521737          249165         3820811
  441446          167466         4130098
  958994           32983         4231953
 1065207          240991         4838093
  313345           80180         5444631
  802280          354299         5501184
  544112          160487         5864877

My TPS numbers are still much higher than yours (25,000 to 35,000). You report ~10,000. For my tests begin/commit are not run. Each transaction is select/update and the update is auto-commit.

The columns are mysql: 5.1.47, 5.1.52, 5.5.8
The last column (5.5.8) has by far the largest values for purge lag

Mark,

Sure I will do that testing on my side.

Lua version of sysbench - bzr branch lp:sysbench 

Actually this version is much more simpler for adding any extension,query or flow modification. It is synced with 0.4 so it has all fixes from there like:

  --tx-rate=N                target transaction rate (tps) [0]
  --tx-jitter=N              target transaction variation, in microseconds [0]
  --report-interval=N        periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports 

Please note that in 0.5.0 dist type options have different names (--rand* instead of --oltp-dist*):

  --rand-init=[on|off]       initialize random number generator [off]
  --rand-type=STRING         random numbers distribution {uniform,gaussian,special} [special]
  --rand-spec-iter=N         number of iterations used for numbers generation [12]
  --rand-spec-pct=N          percentage of values to be treated as 'special' (for special distribution) [1]
  --rand-spec-res=N          percentage of 'special' values to use (for special distribution) [75]
  --rand-seed=N              seed for random number generator, ignored when 0 [0]

Mark,

I presume that you have run 'show innodb status' periodically to get history list length. If you have that output can you send it over. If you also have 'show engine innodb mutex' that would be fabulous.

I provided history list length at 1 minute intervals from a 20-minute test in my previous message.

I have to repeat tests to get mutex status because MySQL continues to change the syntax for SHOW commands. InnoDB also changed the format ('os_waits=2' now, used to just list the number). 

aggregated mutex stats from 5.5.8
    121461	/s/bld/558orig/storage/innobase/trx/trx0rseg.c:207
     94302	/s/bld/558orig/storage/innobase/log/log0log.c:771
     40327	/s/bld/558orig/storage/innobase/srv/srv0srv.c:996
     19525	/s/bld/558orig/storage/innobase/btr/btr0sea.c:182
     10604	/s/bld/558orig/storage/innobase/buf/buf0buf.c:1208
      6269	/s/bld/558orig/storage/innobase/fil/fil0fil.c:1251
       428	/s/bld/558orig/storage/innobase/log/log0log.c:775
       101	/s/bld/558orig/storage/innobase/fil/fil0fil.c:1559
        52	/s/bld/558orig/storage/innobase/log/log0log.c:832
        11	/s/bld/558orig/storage/innobase/buf/buf0buf.c:1243
        10	/s/bld/558orig/storage/innobase/dict/dict0dict.c:695
         1	/s/bld/558orig/storage/innobase/trx/trx0sys.c:190
         1	/s/bld/558orig/storage/innobase/trx/trx0purge.c:238
         1	/s/bld/558orig/storage/innobase/srv/srv0srv.c:999

more for 5.5.8 from another 20-minute test run
    154189	/s/bld/558orig/storage/innobase/log/log0log.c:771
    112734	/s/bld/558orig/storage/innobase/trx/trx0rseg.c:207
     59097	/s/bld/558orig/storage/innobase/srv/srv0srv.c:996
     37855	/s/bld/558orig/storage/innobase/buf/buf0buf.c:1208
     25252	/s/bld/558orig/storage/innobase/btr/btr0sea.c:182
      6863	/s/bld/558orig/storage/innobase/fil/fil0fil.c:1251
      1225	/s/bld/558orig/storage/innobase/log/log0log.c:775
       341	/s/bld/558orig/storage/innobase/fil/fil0fil.c:1559
        94	/s/bld/558orig/storage/innobase/buf/buf0buf.c:1243
        64	/s/bld/558orig/storage/innobase/log/log0log.c:832
        10	/s/bld/558orig/storage/innobase/dict/dict0dict.c:695
         3	/s/bld/558orig/storage/innobase/trx/trx0sys.c:190

Top-N for 5.1.52 + facebook patch
   2957178	trx/trx0rseg.c:210
    182466	srv/srv0srv.c:1135
    110395	log/log0log.c:759
     30822	btr/btr0sea.c:183
      1952	buf/buf0buf.c:1050
       322	ha/hash0hash.c:192
       315	fil/fil0fil.c:1388
       112	log/log0log.c:826
         8	srv/srv0start.c:980
         2	dict/dict0dict.c:655

Top-N for 5.1.47
   5284161	trx/trx0rseg.c:210
    254172	srv/srv0srv.c:938
    109009	log/log0log.c:759
     46978	btr/btr0sea.c:170
     28260	buf/buf0buf.c:957
      1990	fil/fil0fil.c:1517
       199	log/log0log.c:815
        10	srv/srv0start.c:980
         1	dict/dict0dict.c:667
         1	dict/dict0dict.c:655

The other thing I have are PMP (http://poormansprofiler.org) samples during the run. I grep the stacks for those with purge and the results from several samples are:

---

pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

--

      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 __lll_mutex_unlock_wake,_L_mutex_unlock_59,__pthread_mutex_unlock_usercnt,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      1 buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone

----

  1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

-----

     1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

-------------------

And then I have this from two samples for 5.1.52

   1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,log_preflush_pool_modified_pages,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

1 flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone

I repeated the tests for 20 minutes on 5147 and 558 and collected PMP samples every 30 seconds. The results below are filtered to only list stack traces with "purge". There were many more for 558 and there appear to be more cases where 558 waits on a mutex or condition variable than 5147.

The format for this is #occurrences, stack

Results for 558

     10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      8 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      8 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      5 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      3 trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      3 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mtr_commit,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      2 sched_yield,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      2 buf_page_get_gen,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      2 buf_calc_page_new_checksum,buf_flush_init_for_writing,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 log_write_low,mtr_commit,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

Results for 5147

      3 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      2 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 ut_delay,rw_lock_x_lock_func,buf_page_get_gen,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 sched_yield,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 rw_lock_x_lock_func,buf_page_get_gen,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,buf_page_get_gen,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_rseg_get_next_history_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,buf_page_get_gen,trx_purge_rseg_get_next_history_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 mtr_read_ulint,flst_add_first,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 mlog_write_ulint,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 mlog_write_ulint,flst_insert_before,trx_purge_add_update_undo_to_history,trx_undo_update_cleanup,trx_commit_off_kernel,trx_commit_for_mysql,innobase_commit,ha_commit_one_phase,ha_commit_trans,ha_autocommit_or_rollback,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 buf_page_get_gen,trx_purge_truncate_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

Mark,

Thanks for the info. What I meant was if you can send the entire 'show innodb status' output taken periodically over the run. May be, we'll get some hint from there.

After patching 5.5.8 to use one rollback segment rather than 128 and recreating the database, the history list length growth problem appears to have stopped.

Can you ask Dimitri to run tests for this using a cached database for dbstress and measure history list length over time?

"
  10
pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
"

As the number of UNDO segments increase the function trx_purge_choose_next_log() becomes
more expensive. This has been fixed in 5.6.

The thread doing purge holds purge_sys->mutex while iterating on the list of rollback segments. Do any other threads need to lock purge_sys->mutex and thus get blocked?

No other thread needs to block on purge_sys->mutex except the thread
doing the purge. Strictly speaking I don't think the mutex is even required.

Another problem we have had in production with 5.1 is purge lag on a few servers that doesn't decrease. Looking at the srv_master_thread makes it clear why that can happen when the purge delay option is not used -- trx_purge is called once per 10 seconds for up to 20 undo log pages.

5.5 with the innodb_purge_threads=1 should improve that

Mark,

any more details about the server you're using to reproduce the observed issue?
(I'm pretty sure it has more than 8 cores, no?)..

then, are you running your test locally on the same server or via network?
(I'm curious why the issue was not reproduced easily)

Then regarding the problem:

#1.) I think there are many reasons purge may not follow the load.. - and the only possible solutions here are: purge harder OR use a kind of throttling on DML statements..

purge harder - if we have more purge threads (but it's not ready yet in 5.5, and will not be an easy task as it may yet again reduce performance (as I've observed before on XtraDB))

DML throttling - using innodb_max_purge_lag=N will help here, but the code should be fixed (the check for the consistent read view should be removed)..
Mark, you may try it, it may only work, and gives a better result than having more purge threads (http://dimitrik.free.fr/blog/archives/2010/05/mysql-performance-improving-stability.html)

#2.) According Mark, the issue is coming from the bigger number of the rollback segments available within 5.5 (looks to me like a contention on the rollback segment mutex blocked InnoDB before from a higher load..)  - now, what if we simply add a configuration parameter like "innodb_rollback_segments" - I mean we'll still create 128 segments as we're doing currently, but then will use *only* a given number, that's all. So, in case there are still other users meeting the same problem - it can be easily fixed by using a lower number of RBS..

Rgds,
-Dimitri

My setup is:
* server host has 8 cores
* client host has 16 cores with HT on (same HW as server)
* server and client are in the same rack

I don't understand why my peak TPS is much higher than yours. My guess is that your tests ran with the clients and server on the same host. But I am not sure if your tests used 8 instances of sysbench.

When you ran your lua/sysbench test did you confirm that QPS/TPS was relatively stable throughout the test. I think the latest version of sysbench or the lua script is unstable (sysbench 0.5 with Lua). I built it from launchpad and this is the rate of Questions from SHOW STATUS reported over 20 second intervals. There are repeated intervals of 1 to 2 minutes where sysbench+lua isues ~0 queries.

35013.1
67222.4
68018.45
72725.3
74538.75
74902.65
74107.75
75325.7
71333.8
10345.75
130.25
0.5
0.35
0.35
0.35
0.35
42404.45
73697.05
74900.55
74858.9
74921.3
73783.7
74735.95
73921.3
60418.95
0.35
0.35
0.35
510.15
0.35
0.35
14594.15
73156.15

All of the threads in sysbench get stuck on network read, and when hung there are queries displayed in SHOW PROCESSLIST but the Questions count stops growing, so it looks like sysbench+lua pauses and stops reading data from sockets that have data.

The thread stack is:

read,vio_read_buff,my_real_read,my_net_read,cli_safe_read,cli_read_query_result,mysql_real_query,mysql_drv_query,mysql_drv_execute,db_execute,sb_lua_db_execute,luaD_precall,luaV_execute,luaD_call,luaD_rawrunprotected,luaD_pcall,lua_pcall,sb_lua_op_execute_request,runner_thread,start_thread,clone

But the binary uses the thread safe mysql client libs:

gcc -W -Wall -Wextra -Wpointer-arith -Wbad-function-cast -Wstrict-prototypes -Wnested-externs -Winline -funroll-loops -Wundef -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wcast-align -pthread -O2 -ggdb3 -o sysbench --all-static sysbench.o sb_timer.o sb_options.o sb_logger.o db_driver.o  tests/fileio/libsbfileio.a tests/threads/libsbthreads.a tests/memory/libsbmemory.a tests/cpu/libsbcpu.a tests/mutex/libsbmutex.a scripting/libsbscript.a drivers/mysql/libsbmysql.a -L/data/5152orig/lib/mysql /data/5152orig/lib/mysql/libmysqlclient_r.so -lcrypt -lnsl -lpthread -lz scripting/lua/src/liblua.a -ldl -lrt -lm   -Wl,--rpath -Wl,/data/5152orig/lib/mysql -Wl,--rpath -Wl,/data/5152orig/lib/mysql

final result from sysbench+lua --> ~21,000 TPS
The max per-request time is ~146 seconds. That is really bad.

OLTP test statistics:
    queries performed:
        read:                            24962864
        write:                           24962864
        other:                           0
        total:                           49925728
    transactions:                        0      (0.00 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 49925728 (40102.80 per sec.)
    other operations:                    0      (0.00 per sec.)

Test execution summary:
    total time:                          1244.9436s
    total number of events:              24962864
    total time taken by event execution: 78830.5580s
    per-request statistics:
         min:                                  0.29ms
         avg:                                  3.16ms
         max:                             146571.19ms
         approx.  95 percentile:               7.17ms

Threads fairness:
    events (avg/stddev):           390044.7500/2477.07
    execution time (avg/stddev):   1231.7275/1.98

------------------

I didn't look at this before but the max request times for my sysbench read-write tests are also bad in 5.5.8 (65 seconds) compared to ~1 second for 5.1.52+facebook and ~400ms for 5.1.47. This only a problem for my sysbench read-write tests. It is not a problem for my read-only tests.

Here is more data from my use of 8 concurrent instances of sysbench with 8 threads each. The only runs that have high max per-request values are the read-write on 5.5.8

max per instance for 5.5.8 read-only
[1293100742]          max:                                 15.95ms
[1293101344]          max:                                 11.84ms
[1293101945]          max:                                102.63ms
[1293102546]          max:                                299.48ms
[1293103147]          max:                                397.00ms
[1293103749]          max:                                254.50ms
[1293104351]          max:                                416.42ms
[1293104954]          max:                                421.48ms

max per instance for 5.1.52 read-only
[1293076036]          max:                                103.12ms
[1293076637]          max:                                 27.39ms
[1293077238]          max:                                162.98ms
[1293077839]          max:                                998.04ms
[1293078440]          max:                               1506.41ms
[1293079042]          max:                               2302.87ms
[1293079648]          max:                               1832.13ms
[1293080268]          max:                                738.39ms

max per-instance for 5.1.47 read-only
[1293090886]          max:                                 86.99ms
[1293091487]          max:                                 10.64ms
[1293092088]          max:                                 78.61ms
[1293092689]          max:                                476.69ms
[1293093291]          max:                                362.65ms
[1293093893]          max:                                365.18ms
[1293094498]          max:                                512.64ms
[1293095105]          max:                                272.17ms

max per-instance for 5.1.47 read-write
[1293631822]          max:                                232.57ms
[1293633044]          max:                                201.62ms
[1293634266]          max:                                182.75ms
[1293635488]          max:                               1129.07ms
[1293636711]          max:                                195.82ms
[1293637913]          max:                                228.27ms
[1293640238]          max:                               1505.17ms
[1293641462]          max:                                239.67ms

max per-instance for 5.1.52 read-write
[1293582169]          max:                                121.25ms
[1293583392]          max:                                146.69ms
[1293584614]          max:                                348.14ms
[1293585836]          max:                                480.16ms
[1293587059]          max:                                705.35ms
[1293588281]          max:                                562.35ms
[1293589504]          max:                                743.61ms
[1293590728]          max:                                918.94ms

max per instance for 5.5.8 read-write
[1293664157]          max:                               1228.81ms
[1293665379]          max:                               9414.58ms
[1293666602]          max:                              14973.30ms
[1293667824]          max:                              20148.61ms
[1293669135]          max:                              45227.82ms
[1293670498]          max:                              31623.50ms
[1293671863]          max:                              36253.58ms
[1293673154]          max:                              24608.35ms

These numbers make no sense to me right now so I will recompile binaries and then repeat the test on a different pair of servers.

Repeated test using sysbench+lua with better 5.5.8 binary (compiled with fast mutexes, not hacked to use 1 rollback segment). Still has bad max per-request time (120 seconds). Purge lag grew to 6M during the test. This does ~37 QPS on average which is ~18,500 TPS on average. That is still much better than the 10,000 TPS that everyone but me has reported. But there are many stalls where TPS goes to zero for 1+ minute.

I think that sysbench+lua are fine given the 5.1.52 results below. I think the problem is either a bug in 5.5.8 or a problem with my 5.5.8 binary.

Results for 5.5.8

OLTP test statistics:
    queries performed:
        read:                            22605256
        write:                           22605256
        other:                           0
        total:                           45210512
    transactions:                        0      (0.00 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 45210512 (37364.76 per sec.)
    other operations:                    0      (0.00 per sec.)

Test execution summary:
    total time:                          1209.9774s
    total number of events:              22605256
    total time taken by event execution: 77142.7893s
    per-request statistics:
         min:                                  0.29ms
         avg:                                  3.41ms
         max:                             121380.38ms
         approx.  95 percentile:               9.55ms

Threads fairness:
    events (avg/stddev):           353207.1250/2298.50
    execution time (avg/stddev):   1205.3561/1.17

QPS every 20 seconds has a lot of variance:
0.35
677.35
3411.9
3931.15
4762.5
5935.85
7527.85
11319.45
21171.8
55009.55
69177.25
73735.7
75611.2
75752.5
75581.95
71883.3
16848.05
200.3
0.35
0.35
0.5
21806.55
72472.8
74143.45
75860.05
76030.55
74360.2
73787.85
72129.95
22247.15
0.35
469.65
0.35
0.35
0.35
72072.8
69042.95
73679.35
72285.8
70123.35
73747.85
71243.05
61629.8
0.35
0.35
0.35
0.35
0.35
57218.65
71939.8
70774.65
70116.6
71240.5
72391.15
69894.85
59205.8
18088.65
0.65
0.35
0.35
-------------------

Results for 5.1.52 are much better: average, max and fewer stalls, Peak purge lag was less than 400,000 versus 6M for 5.5.58

OLTP test statistics:
    queries performed:
        read:                            37414066
        write:                           37414066
        other:                           0
        total:                           74828132
    transactions:                        0      (0.00 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 74828132 (62356.24 per sec.)
    other operations:                    0      (0.00 per sec.)

Test execution summary:
    total time:                          1200.0104s
    total number of events:              37414066
    total time taken by event execution: 76769.7165s
    per-request statistics:
         min:                                  0.28ms
         avg:                                  2.05ms
         max:                               1612.07ms
         approx.  95 percentile:              10.55ms

Threads fairness:
    events (avg/stddev):           584594.7812/2517.86
    execution time (avg/stddev):   1199.5268/0.01

QPS every 20 seconds:
56329.3
67086.5
68839.8
68499.8
68483.0
70003.8
72789.1
73814.6
74485.3
74132.3
75499.6
75058.5
74627.8
75160.9
75146.9
75262.3
74457.4
31643.5
726.3
1840.3
40750.1
60005.0
65113.0
72981.6
72749.3
71929.1
71321.7
74930.2
72290.9
76580.1
74689.8
73368.5
74931.6
74967.9
75683.0
75039.9
74697.4
47255.0
1252.0
2186.5
31514.9
58356.0
62233.5
74019.7
72598.3
73788.0
72976.5
74370.8
71570.8
74253.1
75616.2
75012.9
70646.4
75856.0
76486.6
74392.1
73610.9
62255.5
1928.3
2198.7
16901.4
57148.2
62231.2
70456.2
70519.3
70772.2
72650.6
75979.9
75779.7
72831.3
75493.9
75682.7
75455.6
74966.3
76342.6
74573.4
74517.9
74070.6
5822.1
2549.5
10215.7
58248.9
65300.1
66148.1
70590.6
70040.3
73672.6
72493.9
74786.4
73297.4
73530.1
76199.3
74989.2
75767.1
75089.1
73468.7
75305.0
75738.6
17530.9
2459.5
4430.2
51282.5
61999.1
62736.2
67090.9
73165.2
71576.5
72746.4
72631.5
73728.7
74813.7
74847.0
75925.4
73384.8
74791.2
73393.5
75102.1
73586.8
5731.1

When 5.5.8 stalls, the thread stacks are listed below

     32 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     20 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::index_read_idx_map,join_read_const,join_read_const_table,make_join_statistics,JOIN::optimize,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     12 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone
      1 pwrite64,os_file_write_func,os_aio_func,fil_io,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_lock_timeout_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_error_monitor_thread,start_thread,clone
      1 poll,handle_connections_sockets,mysqld_main,__libc_start_main,_start,poll
      1 do_sigwait,sigwait,signal_hand,start_thread,clone

And the problem is this thread:

 pwrite64,os_file_write_func,os_aio_func,fil_io,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

I used the mstat binary to plot these values at 20 second intervals
* qps - the rate for Questions from SHOW STATUS
* ups - the rate for Innodb_rows_updated from SHOW STATUS
* pdirty - the value of Innodb_buffer_pool_pages_dirty from SHOW STATUS
* pflush - the rate of Innodb_buffer_pool_pages_flushed from SHOW STATUS
* data_wr - the rate of Innodb_data_written from SHOW STATUS
* page_wr - the rate of Innodb_pages_written from SHOW STATUS

You can see the stalls in this data. Is 5.5.8 stalled by slow disks? But 5.1.52 is not? Or is there a problem in the page flush code? SHOW ENGINE INNODB STATUS does not report latency for disk reads/writes/fsync so this is harder than it should be to debug. From the 5.1.52+facebook run I know that disk writes by Innodb take about 0.01ms each (thanks to HW RAID card) and fsync takes ~3.5ms.

echo "   qps     ups  pdirty    pflush   data_wr  page_wr"; awk '{ if (NF > 500) {printf "%6d  %6d  %7d  %7d  %8d  %8d\n", $563, $477, $396, $399, $431, $433 } }' o.ms | head -100
   qps     ups  pdirty    pflush   data_wr  page_wr
     0       0   445208        0         0         0
 71991   35990   455369        0     33002  36411340
 73518   36757   461879        0     33683  36997990
 75361   37675   466692        0     34622  37943756
 75431   37711   470472        0     34701  38037350
 72669   36329   473990        0     33541  36629248
 74165   37081   477053        0     34121  37369190
 77005   38498   480777        0     35419  38738688
 75715   37852   483470        0     34894  38313574
 75997   37999   486483        0     34946  38331033
 76164   38077   489882        0     35042  38356275
 71613   35798   480698       25     32665  87789260
  7125    3560   457895        0      3321  81876889
     0       0   434599        0        83  76336332
     0       0   413287        0        83  69835161
     0       0   388775        0        95  80320921
   422     211   368582    24861       271  68220313
     0       0   345542        0        71  75287756
     0       0   319782        0        82  84724940
     0       0   294726        0        97  81998643
     0       0   271686        0        90  75497472
     0       0   247236        0        96  80320921
     0       0   223174        0        93  78643200
     0       0   201286        0        85  71722598
     0       0   178758        0        88  73819750
     0       0   154182        0        96  80530636
     0       0   135494        0        73  61236838
     0       0   111686        0        93  78014054
     0       0    84806        0       106  88080384
     0       0    61382        0        92  76755763
 49531   24769   233608    65934     22701  82348185
 74975   37490   377977      614     34397  47274752
 75992   37986   444059      307     34908  43205222
 76393   38191   473434      460     35075  45949593
 76547   38273   488931      588     35119  48029747
 75528   37758   496404      640     34703  48535705
 75320   37655   502052      614     34640  48117862
 75556   37776   507798      691     34669  49136742
 75034   37509   511400      563     34471  46937907
 75586   37790   512880      793     34761  50985420
 76346   38192   516677      486     35084  46345164
 75558   37752   519362      640     34662  48433766
 76361   38174   522625      640     35074  48904908
 72629   36313   518191     2585     33222  79934515
 71084   35537   515839     1894     32422  91468236
 24923   12458   499821        0     11401  84649881
     0       0   477165        0        88  74029465
     0       0   451565        0       100  84095795
     0       0   425069        0       103  86822092
     0       0   402413        0        88  74448896
     0       0   379437        0        89  75078041
     0       0   356845        0        88  74029465
     0       0   334317        0        88  73610035
     0       0   310381        0        94  78643200
     0       0   285677        0        96  80950067
     0       0   263021        0        88  74239180
     0       0   240621        0        87  73400320
     0       0   217957        0        88  74448896
     0       0   196281        0        84  70883737
     0       0   172397        0        93  78223769
     0       0   146029        0       103  86402662
     0       0   121325        0        97  80950067
 39601   19804   257156    85702     18178  86207283
 74225   37114   409716     1331     34103  59077734
 75162   37571   475519     1228     34497  58350028
 75007   37501   502091     1484     34444  61632972
 75723   37864   513106     1280     34737  59022592
 74600   37289   518156     1280     34247  58364313
 74448   37221   519504     1561     34159  63365068
 75774   37883   520525     1356     34784  59951769
 75881   37934   522953      972     34877  54294732
 75324   37663   523845     1203     34563  57319782
 72340   36162   524486     1203     33099  56194867
 74158   37075   528526     1228     34099  57367398
 74059   37028   528789     1996     33925  73507584
 72635   36311   527478     3456     33159  96810035
 69623   34808   530054     3302     31858  95451289
 46373   23181   525342        0     21195  88891084

Repasting as some columns were wrong:
* qps - rate of Questions
* ups - rate of Innodb_rows_updated
* pdirty - value of Innodb_buffer_pool_pages_dirty
* pflush - rate of Innodb_buffer_pool_pages_flushed
* data_wr - rate of Innodb_data_writes
* page_wr - rate of Innodb_dblwr_pages_written
* dsync - rate of Innodb_data_fsyncs
* lsync - rate of Innodb_os_log_fsyncs

Here is one example of a stall. When the stall happens, 
* data_wr drops to ~100, which is about the speed of a single disk (odd)
* page_wr spikes - given the drop in data_wr, a lot of large IOs are being done

But why is InnoDB staying in this state?

   qps     ups  pdirty    pflush   data_wr  page_wr  dsync  lsync
     0       0   477165        0        88      2252     69      0
     0       0   451565        0       100      2572     81      0
     0       0   425069        0       103      2649     80      0
     0       0   402413        0        88      2265     68      0
     0       0   379437        0        89      2291     68      0
     0       0   356845        0        88      2265     68      0
     0       0   334317        0        88      2240     68      0
     0       0   310381        0        94      2406     73      0
     0       0   285677        0        96      2470     74      0
     0       0   263021        0        88      2265     68      0
     0       0   240621        0        87      2240     67      0

--- and all of the data

# echo "   qps     ups  pdirty    pflush   data_wr  page_wr  dsync  lsync"; awk '{ if (NF > 500) {printf "%6d  %6d  %7d  %7d  %8d  %8d  %5d  %5d\n", $563, $477, $396, $399, $431, $435, $419, $445 } }' o.ms | head -80
   qps     ups  pdirty    pflush   data_wr  page_wr  dsync  lsync
     0       0   445208        0         0         0      0      0
 71991   35990   455369        0     33002         0      1      1
 73518   36757   461879        0     33683         0      1      1
 75361   37675   466692        0     34622         0      1      1
 75431   37711   470472        0     34701         0      1      1
 72669   36329   473990        0     33541         0      1      1
 74165   37081   477053        0     34121         0      1      1
 77005   38498   480777        0     35419         0      1      1
 75715   37852   483470        0     34894         0      1      1
 75997   37999   486483        0     34946         0      1      1
 76164   38077   489882        0     35042         0      1      1
 71613   35798   480698       25     32665      1574     47      8
  7125    3560   457895        0      3321      2393     62      1
     0       0   434599        0        83      2329     68      0
     0       0   413287        0        83      2124     64      0
     0       0   388775        0        95      2457     74      0
   422     211   368582    24861       271      2075     60      0
     0       0   345542        0        71      2291     60      0
     0       0   319782        0        82      2585     67      0
     0       0   294726        0        97      2508     75      0
     0       0   271686        0        90      2304     70      0
     0       0   247236        0        96      2444     74      0
     0       0   223174        0        93      2406     75      0
     0       0   201286        0        85      2188     67      0
     0       0   178758        0        88      2252     67      0
     0       0   154182        0        96      2457     74      0
     0       0   135494        0        73      1868     56      0
     0       0   111686        0        93      2380     71      0
     0       0    84806        0       106      2688     81      0
     0       0    61382        0        92      2342     65      0
 49531   24769   233608    65934     22701      1747     46      3
 74975   37490   377977      614     34397       294      6      1
 75992   37986   444059      307     34908       153      3      1
 76393   38191   473434      460     35075       230      4      1
 76547   38273   488931      588     35119       294      6      1
 75528   37758   496404      640     34703       320      6      1
 75320   37655   502052      614     34640       307      6      1
 75556   37776   507798      691     34669       345      6      1
 75034   37509   511400      563     34471       281      5      1
 75586   37790   512880      793     34761       396      7      1
 76346   38192   516677      486     35084       243      5      1
 75558   37752   519362      640     34662       320      6      1
 76361   38174   522625      640     35074       320      6      1
 72629   36313   518191     2585     33222      1331     38      6
 71084   35537   515839     1894     32422      1689     53      9
 24923   12458   499821        0     11401      2201     61      3
     0       0   477165        0        88      2252     69      0
     0       0   451565        0       100      2572     81      0
     0       0   425069        0       103      2649     80      0
     0       0   402413        0        88      2265     68      0
     0       0   379437        0        89      2291     68      0
     0       0   356845        0        88      2265     68      0
     0       0   334317        0        88      2240     68      0
     0       0   310381        0        94      2406     73      0
     0       0   285677        0        96      2470     74      0
     0       0   263021        0        88      2265     68      0
     0       0   240621        0        87      2240     67      0
     0       0   217957        0        88      2265     67      0
     0       0   196281        0        84      2163     65      0
     0       0   172397        0        93      2393     71      0
     0       0   146029        0       103      2636     81      0
     0       0   121325        0        97      2470     73      0

the stall happens when the main thread is in log_preflush_pool_modified_pages on a line that changed between 5.5.8:
        n_pages = buf_flush_list(ULINT_MAX, new_oldest); 
...and 5.1.52
        n_pages = buf_flush_batch(BUF_FLUSH_LIST, ULINT_MAX, new_oldest);

The full stack trace for the stall:

Thread 67 (Thread 1215478080 (LWP 12505)):
#0  0x0000003f7dc0d838 in pwrite64 () from /lib64/libpthread.so.0
#1  0x0000000000896d38 in os_file_write_func (name=0x309b8198 "./ibdata1", file=3, buf=0x2aaf677e8000, offset=2097152, offset_high=0, n=1048576) at /s/bld/558orig/storage/innobase/os/os0
file.c:2298
#2  0x000000000089a450 in os_aio_func (type=11, mode=<value optimized out>, name=0x309b8198 "./ibdata1", file=3, buf=0x2aaf677e8000, offset=2097152, offset_high=0, n=1048576, message1=0x
309b8098, message2=0x0) at /s/bld/558orig/storage/innobase/include/os0file.ic:354
#3  0x000000000085c9b8 in fil_io (type=<value optimized out>, sync=<value optimized out>, space_id=<value optimized out>, zip_size=0, block_offset=2097152, byte_offset=0, len=1048576, bu
f=0x2aaf677e8000, message=0x0) at /s/bld/558orig/storage/innobase/include/os0file.ic:246
#4  0x000000000082eefe in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:800
#5  0x000000000082fc58 in buf_flush_try_neighbors (space=15, offset=11741, flush_type=BUF_FLUSH_LIST, n_flushed=230474, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobas
e/buf/buf0flu.c:952
#6  0x00000000008303d2 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4872ad88) at /s/bld/558orig/s
torage/innobase/buf/buf0flu.c:1499
#7  0x000000000083091e in buf_flush_batch (buf_pool=0x159b4708, flush_type=BUF_FLUSH_LRU, min_n=18446744073709551614, lsn_limit=69091037851) at /s/bld/558orig/storage/innobase/buf/buf0fl
u.c:1621
#8  0x0000000000832560 in buf_flush_list (min_n=18446744073709551614, lsn_limit=69091037851) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1929
#9  0x000000000088a7f5 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658
#10 0x00000000008af335 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#11 0x00000000007ecf47 in trx_purge (limit=3757588) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#12 0x00000000007e1692 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2584
#13 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6

Looks like row0purge.c may be forcing extra flushing. Can you verify with the following patch?

=== modified file 'storage/innobase/row/row0purge.c'
--- storage/innobase/row/row0purge.c	revid:vasil.dimov@oracle.com-20110108145236-ffp48jfkfiq0n46r
+++ storage/innobase/row/row0purge.c	2011-01-11 01:24:11 +0000
@@ -282,7 +282,6 @@
 	mtr_t			mtr;
 	enum row_search_result	search_result;
 
-	log_free_check();
 	mtr_start(&mtr);
 
 	search_result = row_search_index_entry(index, entry, BTR_MODIFY_TREE,
@@ -362,8 +361,6 @@
 	btr_pcur_t		pcur;
 	enum row_search_result	search_result;
 
-	log_free_check();
-
 	mtr_start(&mtr);
 
 	/* Set the purge node for the call to row_purge_poss_sec(). */

This may possibly break correctness, I haven't checked but I would like to see its
affect on flushing. The problem with the way it is currently written is that it will do
it multiple times if row_purge_remove_sec_if_poss_leaf() fails.

The benchmark doesn't update indexed columns

#5  0x000000000082fc58 in buf_flush_try_neighbors (space=15, offset=11741, flush_type=BUF_FLUSH_LIST, n_flushed=230474, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952

230474 pages is a serious number of pages for one flush.

log_preflush_pool_modified_pages calls buf_flush_list with no limit on the number of pages to flush. The limit is on max LSN, and when dirty neighbor pages are also flushed, that adds up.

        n_pages = buf_flush_list(ULINT_MAX, new_oldest);

Yes, I realize that, but 230474 pages flushed to LSN is still
a lot of pages, 230474 * 16 * 1024 = ~3G.

This note is about stalls (not about purge lag)

* data_wr - rate of Innodb_data_writes
* page_wr - rate of Innodb_dblwr_pages_written

qps ups pdirty pflush data_wr page_wr dsync lsync
...
...
0 0 434599 0 83 2329 68 0
0 0 413287 0 83 2124 64 0
0 0 388775 0 95 2457 74 0
422 211 368582 24861 271 2075 60 0
0 0 345542 0 71 2291 60 0
0 0 319782 0 82 2585 67 0

The data_wr in this case is number of write IOs i.e.: calls to os_file_pwrite(). As you are using simulated AIO it is possible, indeed very likely, that IO requests are coalesced together to give us this rather small number i.e.: each IO operation was much bigger than a single page. What also points in this direction is the fact that page_wr values are much higher which represents the number of pages written from dblwrite buffer.

The pattern above looks a classic case where background flushing was unable to keep pace with dirtying activity. The question to answer is why do we lag behind so woefully with 5.5.8. There are two possible explanations:

1) In 5.5 we have enforced the innodb_io_capacity more strictly. If we flush innodb_io_capacity pages in one batch we sleep for the rest of the second. In 5.1 we skip the sleep. In cases where innodb_io_capacity is set very low (like the default 200 value) the effect in 5.1 will be that we'll be able to flush much more than 200 because of skipping sleep. As we fixed this subtle bug in 5.5 the result will be that we'll flush max of 200 pages in one second. If you want to get more pages flushed because you think your IO subsystem has more capacity then you should set innodb_io_capacity to a higher value. I see that in your benchmark you have not set this parameter which means a default value of 200 will be used. If you can retry with a higher value of, say, 4000 may be that will resolve the stall problem.

2) If you can retry with native aio enabled just to see the difference. We have mostly tested with native aio on linux and it may be that when writing the native aio code I introduced some regression in the simulated aio code.

#7 0x000000000083091e in buf_flush_batch (buf_pool=0x159b4708, flush_type=BUF_FLUSH_LRU,
min_n=18446744073709551614, lsn_limit=69091037851) at
/s/bld/558orig/storage/innobase/buf/buf0fl
u.c:1621

This is strange. flush_type should be BUF_FLUSH_LIST not BUF_FLUSH_LRU. Maybe optimization has screwed this up. Further up in the stack we see BUF_FLUSH_LIST which is the right value.

I didn't document it, but I have been using innodb_io_capacity=1000. The real capacity is closer to 1600. There is also a HW RAID card with a 256MB write cache which can absorb write bursts at a higher rate. I will try with innodb_io_capacity=2000

Going by the n_flushed=230474 value, it is impossible for purge
to catch up. Assuming it is the only one doing these massive flushes.
Why isn't the master thread doing the flushing? Purge shouldn't
have to do such massive flushes.

I am not sure if innodb_io_capacity is the reason. 

QPS per 20-second interval for 5.5.8 with innodb_io_capacity=2000, still lots of stalls

0
0.7
0.3
1825.8
3267.5
3568.2
3798.3
4165.7
4593.4
5031.2
5649.7
6037.9
7239.8
8424.5
10527.2
13817.4
18446.3
29599.5
51319.3
66186.0
70229.9
72659.1
74321.8
74712.6
75891.7
77234.2
77156.7
76265.0
75825.5
75748.8
74873.9
59240.6
0.3
481.6
0.3
0.7
0.3
38550.2
71336.0
73959.1
76251.5
76752.0
75159.6
76384.6
76460.2
75394.1
76457.2
73363.8
70129.8
76130.6
73811.6
70250.4
49935.2
0.7
0.3
0.7
0.3
0.7

---------

and 5.1.47 with innodb_io_capacity=1000, no stalls

1484.0
1451.5
4135.1
4672.7
4863.4
5289.3
5392.4
6019.1
6448.3
7269.2
8204.6
9099.6
10779.1
13292.0
17081.2
24265.7
41039.9
58226.6
62334.3
64135.1
65280.2
67507.3
67796.1
67166.9
67981.7
69349.0
70122.1
68304.7
69416.4
69188.9
71185.7
69237.3
67757.3
68835.3
67961.7
66663.1
66443.3
69977.4
69987.3
67996.5
68416.1
67625.7
69960.2
69583.8
69037.4
68291.8
69989.4
67236.0

Mark,

Can you try setting io_capacity to unlimited (something like 999999). If that doesn't work then try enabling native aio. Thanks for your effort to nail this down.

Sunny,

It is in fact master thread that is doing the purge and during purge it encountered the log_free_check() call. When we cross this limit then one thread does the flushing and all other threads wait for the huge batch to finish. Therefore, I don't think it really matters which thread is doing the batch.

Anything calling log_free_check can trigger this as log_free_check -> log_check_margins -> log_checkpoint_margins -> log_preflush_pool_modified_pages -> buf_flush_list(ULINT_MAX, ...)

------

After starting 5.5.8, this message is in the db error log

110110 20:33:39  InnoDB: Warning: purge reached the head of the history list,
InnoDB: but its length is still reported as 3819087! Make a detailed bug

Inaam, even if it is the master thread the problem is that such a huge flush by purge will prevent it from catching up with user threads.

The comment on the return value is wrong. This returns FALSE when buf_pool_mutex was not released

/********************************************************************//**
Check if the block is modified and ready for flushing. If the the block
is ready to flush then flush the page and try o flush its neighbors.

@return TRUE if buf_pool mutex was not released during this function.
This does not guarantee that some pages were written as well.
Number of pages written are incremented to the count. */
static
ibool
buf_flush_page_and_try_neighbors(
...

                buf_pool_mutex_enter(buf_pool);
                flushed = TRUE;
...

               return(flushed)

--------

buf_flush_flush_list_batch does more work in the inner loop as it gets/releases the flush list mutex on each iteration after a block has been flushed.

--------

I don't think neighbor pages should be flushed in this case as there is no limit on the number of pages that can be flushed. The limit in on the oldest_modification value.

--------

buf_flush_flush_list_batch has an optimization to not start over from the end of the flush list after finding a block to flush. When the optimization is used, there is no check to confirm that the lsn of the current block is large enough that the function can return. This is different from what 5.1 did and I have confirmed that this code continues past the block where it could have stopped.

What I see under load is:
1) srv_master_thread does its normal work with many calls to buf_flush_list from adaptive flushing
2) then it gets stuck in purge, I have been running with innodb_purge_threads=0
pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

It stays in purge for a long time, so its stops flushing dirty pages leading to the eventual problem which is a huge amount of checkpoint IO that must eventually get done.

We are certainly flushing more than we need to. That is most likely the bug.
The 3G flush that I noted earlier is just too much for a single flush. Now onto
the why:

1. The comment is wrong.

2. There is an additional flush list mutex enter/exit in 5.5, we can ignore that.

3. 5.5 optimisation is the same as 5.1

4. In 5.1 we have:

                        mutex_enter(block_mutex);
                        ready = buf_flush_ready_for_flush(bpage, flush_type);
                        mutex_exit(block_mutex);

                        if (ready) {
                                space = buf_page_get_space(bpage);
                                offset = buf_page_get_page_no(bpage);

                                buf_pool_mutex_exit();

                                /* Try to flush also all the neighbors */
                                page_count += buf_flush_try_neighbors(
                                        space, offset, flush_type);

                                buf_pool_mutex_enter();
                                goto flush_next;

In 5.5 we have:

               /* The list may change during the flushing and we cannot
                safely preserve within this function a pointer to a
                block in the list! */
                while (bpage != NULL
                       && len > 0
                       && !buf_flush_page_and_try_neighbors(
                                bpage, BUF_FLUSH_LIST, min_n, &count)) {

                        buf_flush_list_mutex_enter(buf_pool);

                        /* If we are here that means that buf_pool->mutex
                        was not released in buf_flush_page_and_try_neighbors()
                        above and this guarantees that bpage didn't get
                        relocated since we released the flush_list
                        mutex above. There is a chance, however, that
                        the bpage got removed from flush_list (not
                        currently possible because flush_list_remove()
                        also obtains buf_pool mutex but that may change
                        in future). To avoid this scenario we check
                        the oldest_modification and if it is zero
                        we start all over again. */
                        if (bpage->oldest_modification == 0) {
                                buf_flush_list_mutex_exit(buf_pool);
                                break;
                        }

                        bpage = UT_LIST_GET_PREV(list, bpage);

                        ut_ad(!bpage || bpage->in_flush_list);

                        buf_flush_list_mutex_exit(buf_pool);

                        --len;
                }

We are doing the same thing as 5.1, the check for "ready" in 5.1 is the
return value of buf_flush_page_and_try_neighbors().

Where exactly in purge is it stuck? Is purge just slower with 128 rollback segments due to
trx_purge_choose_next_log() ?

Seems that Mark did not use purge thread in his tests.. - the reason of appearing of the purge thread feature is exactly for this, and honestly, I thin that purge thread should be *always* enabled (and probably did not have any other option)..

But it did not explain for me why the similar issue was not observed on 5.1?..

Mark, may you please post your my.conf setting here too?..

Rgds,
-Dimitri

I still think 5.5 does more work as lock/unlock on the flush list mutex should encounter more contention compared to lock/unlock on block->mutex as was done in 5.1.

In 5.5, after a page and its neighbors have been flushed, there is an optimization to not start over from the end of the flush list. In 5.1, the search always goes back to the end of the flush list. While I think this is a good change, it also skips the check to halt the search by comparing oldest_modification with lsn_limit and callers can stay in longer than they need to. I don't think that is good.

Dmitri -- this problem was less of an issue in 5.1 because with only 1 rollback segment to iterate over there is less work to do in trx_purge.

Inaam suggested trying innodb_io_capacity=99999. I don't think that will help as in some cases it will call buf_flush_batch with much larger values and those calls will take more time and disrupt the balance of the main background thread.

I supplied several thread stacks above to show where purge gets stuck. But will save the next round of data for my next update.

my.cnf settings that I use:
innodb_buffer_pool_size=16G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=1
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table
max_connections=2000
table_cache=2000
key_buffer_size=2000M
innodb_io_capacity=1000
innodb_thread_concurrency=32

I started with a new database (dropped all files, ran sysbench prepare) and a warm buffer pool. Then I waited for the sysbench clients to become idle from the stall and ran PMP in a loop to get thread stacks. I also added printfs to some of the buffer flush related routines.

In this case I see srv_master_thread -> trx_purge -> buf_flush_list taking a very long time because on consecutive calls to trx_purge, buf_flush_list flushes 140,000 and then 280,000 pages.

A typical thread stack:

Thread 67 (Thread 1232353600 (LWP 15022)):
#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ace9 in os_event_wait_low (event=0x20886b70, reset_sig_count=9493) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655
#3  0x00000000008310b8 in buf_flush_try_neighbors (space=1, offset=20602, flush_type=BUF_FLUSH_LIST, n_flushed=84441, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952
#4  0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x49742d88) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499
#5  0x0000000000831e3b in buf_flush_batch (buf_pool=0x208ba708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1621
#6  0x0000000000832703 in buf_flush_list (min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1943
#7  0x000000000088a9e5 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658
#8  0x00000000008af525 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#9  0x00000000007ecfd7 in trx_purge (limit=8906108) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#10 0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586
#11 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6

But I don't think the 5.5 optimization is the problem. With more printfs, this is the work done in buf_flush_flush_list batch when 70,000 pages are flushed in one call. I added printfs to display the page flush count each time the search starts over and each time the optimization is used. Almost all of the work is done from starting the search over at the end of the list.

buf_flush_start for LIST
buf_flush_batch LIST enter 18446744073709551614 min, 23896715046 lsn
buf_flush_flush_list_batch start at end 64 count
buf_flush_flush_list_batch start at end 128 count
buf_flush_flush_list_batch start at end 192 count
buf_flush_flush_list_batch start at end 256 count
buf_flush_flush_list_batch start at end 320 count
buf_flush_flush_list_batch start at end 384 count
buf_flush_flush_list_batch start at end 448 count
buf_flush_flush_list_batch start at end 512 count
buf_flush_flush_list_batch start at end 576 count
buf_flush_flush_list_batch start at end 640 count
...
buf_flush_flush_list_batch start at end 73259 count
buf_flush_flush_list_batch start at end 73323 count
buf_flush_flush_list_batch start at end 73387 count
buf_flush_flush_list_batch start at end 73451 count
buf_flush_flush_list_batch start at end 73515 count
buf_flush_flush_list_batch optimization when lsn limit reached, 73515 count
buf_flush_flush_list_batch start at end 73579 count
buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count
buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count
buf_flush_flush_list_batch start at end 73643 count
buf_flush_batch flush writes
buf_flush_batch LIST flushed 73643
buf_flush_end for LIST

I started with a new database (dropped all files, ran sysbench prepare) and a warm buffer pool. Then I waited for the sysbench clients to become idle from the stall and ran PMP in a loop to get thread stacks. I also added printfs to some of the buffer flush related routines.

In this case I see srv_master_thread -> trx_purge -> buf_flush_list taking a very long time because on consecutive calls to trx_purge, buf_flush_list flushes 140,000 and then 280,000 pages.

A typical thread stack:

Thread 67 (Thread 1232353600 (LWP 15022)):
#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ace9 in os_event_wait_low (event=0x20886b70, reset_sig_count=9493) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655
#3  0x00000000008310b8 in buf_flush_try_neighbors (space=1, offset=20602, flush_type=BUF_FLUSH_LIST, n_flushed=84441, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952
#4  0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x49742d88) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499
#5  0x0000000000831e3b in buf_flush_batch (buf_pool=0x208ba708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1621
#6  0x0000000000832703 in buf_flush_list (min_n=18446744073709551614, lsn_limit=11704623228) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1943
#7  0x000000000088a9e5 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658
#8  0x00000000008af525 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#9  0x00000000007ecfd7 in trx_purge (limit=8906108) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#10 0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586
#11 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6

But I don't think the 5.5 optimization is the problem. With more printfs, this is the work done in buf_flush_flush_list batch when 70,000 pages are flushed in one call. I added printfs to display the page flush count each time the search starts over and each time the optimization is used. Almost all of the work is done from starting the search over at the end of the list.

buf_flush_start for LIST
buf_flush_batch LIST enter 18446744073709551614 min, 23896715046 lsn
buf_flush_flush_list_batch start at end 64 count
buf_flush_flush_list_batch start at end 128 count
buf_flush_flush_list_batch start at end 192 count
buf_flush_flush_list_batch start at end 256 count
buf_flush_flush_list_batch start at end 320 count
buf_flush_flush_list_batch start at end 384 count
buf_flush_flush_list_batch start at end 448 count
buf_flush_flush_list_batch start at end 512 count
buf_flush_flush_list_batch start at end 576 count
buf_flush_flush_list_batch start at end 640 count
...
buf_flush_flush_list_batch start at end 73259 count
buf_flush_flush_list_batch start at end 73323 count
buf_flush_flush_list_batch start at end 73387 count
buf_flush_flush_list_batch start at end 73451 count
buf_flush_flush_list_batch start at end 73515 count
buf_flush_flush_list_batch optimization when lsn limit reached, 73515 count
buf_flush_flush_list_batch start at end 73579 count
buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count
buf_flush_flush_list_batch optimization when lsn limit reached, 73579 count
buf_flush_flush_list_batch start at end 73643 count
buf_flush_batch flush writes
buf_flush_batch LIST flushed 73643
buf_flush_end for LIST

I still see stalls first in pure purge code that occur prior to the stalls from trx_purge -> buf_flush_list. This is the problem that stalls srv_master_thread in trx_purge and prevents it from doing anything else which can explain why the next call to buf_flush_list after this stall requires a huge amount of IO.

Examples are:

 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_choose_next_log,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,trx_purge_truncate_rseg_history,trx_purge_fetch_next_rec,row_purge_step,que_run_threads,trx_purge,srv_master_thread,start_thread,clone

#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ad99 in os_event_wait_low (event=0x2d55f060, reset_sig_count=33) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x00000000007e8742 in sync_array_wait_event (arr=0x10feddb0, index=1) at /s/bld/558orig/storage/innobase/sync/sync0arr.c:457
#3  0x00000000007e966f in mutex_spin_wait (mutex=0x2d55ef50, file_name=0xa33108 "/s/bld/558orig/storage/innobase/trx/trx0purge.c", line=521) at /s/bld/558orig/storage/innobase/sync/sync0sync.c:621
#4  0x00000000007ef181 in trx_purge_truncate_rseg_history (rseg=0x2d55ef48, limit_trx_no=138501018, limit_undo_no=0) at /s/bld/558orig/storage/innobase/include/sync0sync.ic:222
#5  0x00000000007ef8ac in trx_purge_fetch_next_rec (roll_ptr=<value optimized out>, cell=0x2d56ae10, heap=0x2d56b2c0) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:636
#6  0x00000000008b9bf1 in row_purge_step (thr=0x2d56ac10) at /s/bld/558orig/storage/innobase/row/row0purge.c:725
#7  0x00000000008af8c7 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/que/que0que.c:1247
#8  0x00000000007ecfd7 in trx_purge (limit=1696160) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#9  0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586
#10 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6

#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ad99 in os_event_wait_low (event=0x2d549c90, reset_sig_count=43) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x00000000007e8742 in sync_array_wait_event (arr=0x10feddb0, index=1) at /s/bld/558orig/storage/innobase/sync/sync0arr.c:457
#3  0x00000000007e966f in mutex_spin_wait (mutex=0x2d549b80, file_name=0xa33108 "/s/bld/558orig/storage/innobase/trx/trx0purge.c", line=792) at /s/bld/558orig/storage/innobase/sync/sync0sync.c:621
#4  0x00000000007edb09 in trx_purge_choose_next_log () at /s/bld/558orig/storage/innobase/include/sync0sync.ic:222
#5  0x00000000007ef572 in trx_purge_fetch_next_rec (roll_ptr=<value optimized out>, cell=<value optimized out>, heap=0x2d56b2c0) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:899
#6  0x00000000008b9bf1 in row_purge_step (thr=0x2d56ac10) at /s/bld/558orig/storage/innobase/row/row0purge.c:725
#7  0x00000000008af8c7 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/que/que0que.c:1247
#8  0x00000000007ecfd7 in trx_purge (limit=2447520) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#9  0x00000000007e16d0 in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2586
#10 0x0000003f7dc062f7 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003f7d4d1e3d in clone () from /lib64/libc.so.6

I added more instrumentation and print out the values used by log_checkpoint_margin when computing arguments for log_preflush_pool_modified_pages.

In this case, the async checkpoint limits are used ("0 sync"). Note that advance has a large value (almost 1M) and buf_flush_list ends up flushing almost 100,000 pages. But the huge flush done by buf_flush_list is a symptom of the problem. The cause is the earlier stalls in trx_purge from the overhead of 128 rollback segments. I will try innodb_purge_threads=1 next.

log_preflush 0 sync, 35792609753 lsn, 32769015802 oldest, 2822766225 age, 962638 advance, 32769978440 new

buf_flush_list 99085 return, 32769978440 lsn_limit, 0 skipped, 99085 total_page_count for 1 instances 0x413e2858 ptr

Mark, 

you simply entering the issue I've told many times in the past (and nobody yet did not believe it's ever possible) - get a look on:
http://dimitrik.free.fr/db_STRESS_MySQL_InnoDB_dirty_pages_and_log_size_impact_May2009.htm...

and then the final fix:
http://dimitrik.free.fr/blog/archives/2009/08/mysql-performance-final-fix-for-ahead-flushi...

again: you should always keep purge thread enable to keep InnoDB working "as designed".. - otherwise on the hard purge you'll freeze the Master thread from anything else (and that's why I think there should be no more option within InnoDB to disable purge thread)..

Rgds,
-Dimitri

With innodb_purge_threads=1 I still get stalls where QPS drops to almost 0 for 60+ seconds. And huge flushes (more than 80,000 pages) still occur from buf_flush_list. Here are thread stacks when that occurs:

os_event_wait_low,buf_flush_wait_batch_end,srv_master_thread,start_thread,clone

os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge,srv_purge_thread,start_thread,clone

os_event_wait_low,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone

And the full stack for the flushing thread
#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ad99 in os_event_wait_low (event=0x422eb70, reset_sig_count=6078) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655
#3  0x00000000008310b8 in buf_flush_try_neighbors (space=6, offset=27020, flush_type=BUF_FLUSH_LIST, n_flushed=98933, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952
#4  0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4a322788) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499
#5  0x0000000000831e48 in buf_flush_batch (buf_pool=0x4262708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=45321000437) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1622
#6  0x0000000000832733 in buf_flush_list (min_n=18446744073709551614, lsn_limit=45321000437) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1944
#7  0x000000000088aaff in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658
#8  0x00000000007dc93e in row_upd_step (thr=0x2aaf5815fce8) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#9  0x00000000007c86d7 in row_update_for_mysql (mysql_rec=<value optimized out>, prebuilt=0x2aaf580da6f8) at /s/bld/558orig/storage/innobase/row/row0mysql.c:1453
#10 0x00000000007b622e in ha_innobase::update_row (this=0x2aaf6000d4c0, old_row=0x2aaf6000d780 "<FF>2\233\037", new_row=0x2aaf6000d6c0 "<FF>2\233\037") at /s/bld/558orig/storage/innobase/handler/ha_innodb.cc:5262
#11 0x0000000000682f77 in handler::ha_update_row (this=0x2aaf6000d4c0, old_data=0x2aaf6000d780 "<FF>2\233\037", new_data=0x2aaf6000d6c0 "<FF>2\233\037") at /s/bld/558orig/sql/handler.cc:4782
#12 0x00000000005e2ef7 in mysql_update (thd=0x1f3b9400, table_list=0x2aaf60007868, fields=@0x1f3bb518, values=@0x1f3bb8f0, conds=0x2aaf60008228, order_num=<value optimized out>, order=) at /s/bld/558orig/sql/sql_update.cc:707
#13 0x000000000057390a in mysql_execute_command (thd=0x1f3b9400) at /s/bld/558orig/sql/sql_parse.cc:2669
#14 0x000000000057835b in mysql_parse (thd=0x1f3b9400, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4a324de0) at /s/bld/558orig/sql/sql_parse.cc:5496
#15 0x00000000005794ea in dispatch_command (command=COM_QUERY, thd=0x1f3b9400, packet=<value optimized out>, packet_length=159) at /s/bld/558orig/sql/sql_parse.cc:1032
#16 0x0000000000579854 in do_command (thd=0x1f3b9400) at /s/bld/558orig/sql/sql_parse.cc:772
#17 0x000000000060e294 in do_handle_one_connection (thd_arg=0x1f3b9400) at /s/bld/558orig/sql/sql_connect.cc:745
#18 0x000000000060e384 in handle_one_connection (arg=<value optimized out>) at /s/bld/558orig/sql/sql_connect.cc:684

Also, loop counts from srv_master_thread don't increase as they should. These are taken at 5 second intervals:

srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 606 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush

In one case, it got stuck behind a huge flush. Thread stacks in the next post.

Per my last update, innodb_purge_threads=1 still has the problem. I still get stalls (60+ seconds with near zero QPS). I still get huge flushes even with it enabled. That is the current mystery.

Thread stacks in one case where srv_master_thread is stuck waiting for a long (100,000) page flush to end:

the flusher:

#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ad99 in os_event_wait_low (event=0x422eb70, reset_sig_count=14464) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082f29d in buf_flush_buffered_writes () at /s/bld/558orig/storage/innobase/buf/buf0flu.c:655
#3  0x00000000008310b8 in buf_flush_try_neighbors (space=2, offset=46947, flush_type=BUF_FLUSH_LIST, n_flushed=185174, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:952
#4  0x0000000000831832 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4a2a0788) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499
#5  0x0000000000831e48 in buf_flush_batch (buf_pool=0x4262708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=51175735319) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1622
#6  0x0000000000832733 in buf_flush_list (min_n=18446744073709551614, lsn_limit=51175735319) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1944
#7  0x000000000088aaff in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658
#8  0x00000000007dc93e in row_upd_step (thr=0x1f4cfad8) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#9  0x00000000007c86d7 in row_update_for_mysql (mysql_rec=<value optimized out>, prebuilt=0x1f491de8) at /s/bld/558orig/storage/innobase/row/row0mysql.c:1453
#10 0x00000000007b622e in ha_innobase::update_row (this=0x1f499090, old_row=0x1f499350 "<FF>\032<E8>\034", new_row=0x1f499290 "<FF>\032<E8>\034") at /s/bld/558orig/storage/innobase/handler/ha_innodb.cc:5262
#11 0x0000000000682f77 in handler::ha_update_row (this=0x1f499090, old_data=0x1f499350 "<FF>\032<E8>\034", new_data=0x1f499290 "<FF>\032<E8>\034") at /s/bld/558orig/sql/handler.cc:4782
#12 0x00000000005e2ef7 in mysql_update (thd=0x2aaf581137d0, table_list=0x1f3bc988, fields=@0x2aaf581158e8, values=@0x2aaf58115cc0, conds=0x1f3bd348, order_num=<value optimized out>, order=) at /s/bld/558orig/sql/sql_update.cc:707
#13 0x000000000057390a in mysql_execute_command (thd=0x2aaf581137d0) at /s/bld/558orig/sql/sql_parse.cc:2669
#14 0x000000000057835b in mysql_parse (thd=0x2aaf581137d0, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4a2a2de0) at /s/bld/558orig/sql/sql_parse.cc:5496
#15 0x00000000005794ea in dispatch_command (command=COM_QUERY, thd=0x2aaf581137d0, packet=<value optimized out>, packet_length=159) at /s/bld/558orig/sql/sql_parse.cc:1032
#16 0x0000000000579854 in do_command (thd=0x2aaf581137d0) at /s/bld/558orig/sql/sql_parse.cc:772
#17 0x000000000060e294 in do_handle_one_connection (thd_arg=0x2aaf581137d0) at /s/bld/558orig/sql/sql_connect.cc:745
#18 0x000000000060e384 in handle_one_connection (arg=<value optimized out>) at /s/bld/558orig/sql/sql_connect.cc:684

srv_master_thread

#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ad99 in os_event_wait_low (event=0x1f262ea0, reset_sig_count=2238) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082e5dc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1845
#3  0x00000000007e1a6a in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2956

#0  0x0000003f7dc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000089ad99 in os_event_wait_low (event=0x1f262ea0, reset_sig_count=2238) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082e5dc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1845
#3  0x000000000088adfc in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1661
#4  0x00000000008af5d5 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#5  0x00000000007ecfd7 in trx_purge (limit=18623221) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#6  0x00000000007e445c in srv_purge_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:3125

Mark,

I still think adaptive_flushing is the problem. The following also points to that:

srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush
srv_master_thread loops: 605 1_second, 584 sleeps, 60 10_second, 5 background, 5 flush

So out of 605 1_second loops we end up sleeping in 584 iterations. That is when most of the work piles up. How many sleeps do you see in 5.1? If you can resurrect the 'skip_sleep' code of 5.1 then my guess would be that even with lower io_capacity of 1000 or 2000 you'll see problem being mitigated/delayed (yes I suspect that even in case of 5.1 we'll probably hit this issue but it is delayed because of aggressive flushing and overall more contention in 5.1).

If you try the above please make sure that we are running a separate purge thread.

I restored the old behavior where skip_sleep is set to true when dirty page flushing is done. I also made it skip sleep when the suggested adaptive flush rate was 10% or more of innodb_io_capacity. The stalls are not as bad with this change. But they are still bad (20 seconds of zero QPS rather than 60+ seconds).

Here is an example of output scraped from SHOW ENGINE INNODB STATUS. The age column is the checkpoint age divided by 1M. The data is scraped every 5 seconds and immediately prior to the stall (where 0 rows updated/second) there are two rounds where srv_master_thread sleeps 5 times in the one second loop. This test will never exceed the max dirty pages pct as that is set to 80%, the buffer pool has ~1M pages and the data fits in less than half of that.

So it appears that in the middle of the test when the server is very busy, that buf_flush_get_desired_rate() returns a value that is less than 10% of innodb_io_capacity.

History list length 277068
srv_master_thread loops: 910 1_second, 178 sleeps, 90 10_second, 194 background, 194 flush
0.00 inserts/s, 37861.23 updates/s, 0.00 deletes/s, 75705.46 reads/s
Database pages     489256
Modified db pages  426194
Age 2420

History list length 312354
srv_master_thread loops: 916 1_second, 178 sleeps, 91 10_second, 194 background, 194 flush
0.00 inserts/s, 38147.77 updates/s, 0.00 deletes/s, 76261.35 reads/s
Database pages     490254
Modified db pages  428179
Age 2518

History list length 348031
srv_master_thread loops: 920 1_second, 179 sleeps, 91 10_second, 194 background, 194 flush
0.00 inserts/s, 31539.91 updates/s, 0.00 deletes/s, 63060.82 reads/s
Database pages     491304
Modified db pages  430507
Age 2615

History list length 381109
srv_master_thread loops: 936 1_second, 180 sleeps, 93 10_second, 194 background, 194 flush
0.00 inserts/s, 36643.27 updates/s, 0.00 deletes/s, 73267.75 reads/s
Database pages     492264
Modified db pages  429442
Age 2709

History list length 359585
srv_master_thread loops: 986 1_second, 185 sleeps, 98 10_second, 194 background, 194 flush
0.00 inserts/s, 35649.67 updates/s, 0.00 deletes/s, 71268.75 reads/s
Database pages     492328
Modified db pages  429126
Age 2802

History list length 367433
srv_master_thread loops: 1036 1_second, 190 sleeps, 103 10_second, 194 background, 194 flush
0.00 inserts/s, 31083.78 updates/s, 0.00 deletes/s, 62150.97 reads/s
Database pages     492584
Modified db pages  424842
Age 2883

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  391818
Age 2883

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  355850
Age 2883

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  321851
Age 2883

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  284992
Age 2883

More data from a stall with old style (skip_sleep) management of sleep in 5.5. The data below is scraped at 5 second intervals. The pattern is:
* initially the 1_second counter increases ~5 times per sample
* initially there are 0, 1 or 2 sleeps per sample

When the checkpoint age approaches the async checkpoint limit, then the 1_second counter is incremented many more times than once per second. This is because I changed the code to set skip_sleep=TRUE when buf_flush_get_desired_rate() > 10% of innodb_io_capacity, but buf_flush_list returns immediately because another thread is doing a flush.

At that time, buf_flush_get_desired_rate() was returning a value great than 2000, and it was reduced to 2000 because innodb_io_capacity=2000. Eventually, there were ~7 consecutive calls where it returned a small value (~30) and sleep was not skipped.

After that the main background loop stopped for 10+ seconds and was probably stuck waiting for a flush to end. Extra printfs show that one flush was for more than 130,000 pages.

History list length 134694
srv_master_thread loops: 221 1_second, 97 sleeps, 21 10_second, 2 background, 2 flush
0.00 inserts/s, 35941.81 updates/s, 0.00 deletes/s, 71858.83 reads/s
Database pages     448181
Modified db pages  419196
Age 2552
0.00 reads/s, 0 avg bytes/read, 32887.82 writes/s, 84.38 fsyncs/s
0.00 reads/s, 89.58 creates/s, 2669.07 writes/s

History list length 132789
srv_master_thread loops: 227 1_second, 98 sleeps, 22 10_second, 2 background, 2 flush
0.00 inserts/s, 34695.66 updates/s, 0.00 deletes/s, 69371.13 reads/s
Database pages     448245
Modified db pages  419222
Age 2642
0.00 reads/s, 0 avg bytes/read, 32019.40 writes/s, 74.19 fsyncs/s
0.00 reads/s, 12.80 creates/s, 2171.37 writes/s

History list length 185957
srv_master_thread loops: 242 1_second, 99 sleeps, 24 10_second, 2 background, 2 flush
0.00 inserts/s, 35411.92 updates/s, 0.00 deletes/s, 70791.64 reads/s
Database pages     450164
Modified db pages  420233
Age 2732
0.00 reads/s, 0 avg bytes/read, 32371.33 writes/s, 83.18 fsyncs/s
0.00 reads/s, 383.72 creates/s, 2521.70 writes/s

History list length 376732
srv_master_thread loops: 302 1_second, 105 sleeps, 30 10_second, 2 background, 2 flush
0.00 inserts/s, 31787.70 updates/s, 0.00 deletes/s, 63552.07 reads/s
Database pages     455384
Modified db pages  425393
Age 2822
0.00 reads/s, 0 avg bytes/read, 28886.35 writes/s, 80.15 fsyncs/s
0.00 reads/s, 869.86 creates/s, 2132.98 writes/s

History list length 505579
srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush
0.00 inserts/s, 25760.65 updates/s, 0.00 deletes/s, 51512.50 reads/s
Database pages     458901
Modified db pages  417026
Age 2883
0.00 reads/s, 0 avg bytes/read, 23482.30 writes/s, 134.17 fsyncs/s
0.00 reads/s, 703.26 creates/s, 4082.38 writes/s

History list length 505579
srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     458901
Modified db pages  394235
Age 2883
0.00 reads/s, 0 avg bytes/read, 181.16 writes/s, 139.57 fsyncs/s
0.00 reads/s, 0.00 creates/s, 4557.29 writes/s

History list length 505579
srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     458901
Modified db pages  370103
Age 2883
0.00 reads/s, 0 avg bytes/read, 191.36 writes/s, 146.77 fsyncs/s
0.00 reads/s, 0.00 creates/s, 4825.43 writes/s

History list length 505579
srv_master_thread loops: 342 1_second, 110 sleeps, 34 10_second, 3 background, 3 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     458901
Modified db pages  347842
Age 2883
0.00 reads/s, 0 avg bytes/read, 175.56 writes/s, 132.77 fsyncs/s
0.00 reads/s, 0.00 creates/s, 4451.31 writes/s

I can explain the numbers and build a hypothesis about what is happening in the master thread. But please don't yell at me for the convoluted logic of the master thread (we are working on it in 5.6).

So here is the story:

-- During first fifteen second of the readings the master thread is trying hard to keep up with page dirtying pace. The adaptive flushing is working. We can say that because there are no sleeps for master thread. We can also say that master thread is not able to keep pace with the dirtying pace because modified pages and checkpoint age is increasing.

History list length 277068
srv_master_thread loops: 910 1_second, 178 sleeps, 90 10_second, 194 background, 194
flush
0.00 inserts/s, 37861.23 updates/s, 0.00 deletes/s, 75705.46 reads/s
Database pages     489256
Modified db pages  426194
Age 2420

History list length 312354
srv_master_thread loops: 916 1_second, 178 sleeps, 91 10_second, 194 background, 194
flush
0.00 inserts/s, 38147.77 updates/s, 0.00 deletes/s, 76261.35 reads/s
Database pages     490254
Modified db pages  428179
Age 2518

History list length 348031
srv_master_thread loops: 920 1_second, 179 sleeps, 91 10_second, 194 background, 194
flush
0.00 inserts/s, 31539.91 updates/s, 0.00 deletes/s, 63060.82 reads/s
Database pages     491304
Modified db pages  430507
Age 2615

-- Somewhere in the next 5 second interval the rot sets in. From some user thread log_preflush_buffer_pool with a huge flush batch is called in. We can deduce that from the fact that dirty pages are now decreasing and during this five second period the master thread actually shows going through 1_second loop 16 times (more on this below)

History list length 381109
srv_master_thread loops: 936 1_second, 180 sleeps, 93 10_second, 194 background, 194
flush
0.00 inserts/s, 36643.27 updates/s, 0.00 deletes/s, 73267.75 reads/s
Database pages     492264
Modified db pages  429442
Age 2709

Note the difference above 936 - 920 between in 1_second loops.

-- In next two readings the master thread is fully choked because of the huge flush batch that is happening in some user thread. Here is what is happening:
  * master thread gets some value from adaptive_flushing
  * master thread tries to start a batch calling buf_flush_list()
  * because another batch is running master thread returns with failure immediately
  * because adaptive_flushing has provided a value > 10% of io_capacity therefore we skip sleep
  * master thread immediately starts next iteration
  * after going through 10 iterations master thread goes to 10 second loop
  * when doing 10 second stuff master thread restarts the 1 second loop again it sets skip_sleep = FALSE and therefore in first iteration it sleeps
  * If above assumptions are true then master thread should do 1 second loop 50 times, 10 second loop 5 times and 5 sleeps in the 5 second interval. The next two entries exactly show this

History list length 359585
srv_master_thread loops: 986 1_second, 185 sleeps, 98 10_second, 194 background, 194
flush
0.00 inserts/s, 35649.67 updates/s, 0.00 deletes/s, 71268.75 reads/s
Database pages     492328
Modified db pages  429126
Age 2802

History list length 367433
srv_master_thread loops: 1036 1_second, 190 sleeps, 103 10_second, 194 background, 194
flush
0.00 inserts/s, 31083.78 updates/s, 0.00 deletes/s, 62150.97 reads/s
Database pages     492584
Modified db pages  424842
Age 2883

-- Although in above ten seconds things are coming to a screeching halt but there are still some threads (may be purge guy) working. This can be deduced from changes in the history list. What this means is that there is still some activity that keeps the master thread in the 1 and 10 second loop. But by now all threads are waiting for the huge flush batch to finish. There is no server activity. Accordingly master thread assumes that server is 'idle' and leaves the 10 second loop and go to the background and flush loop. In flush loop master thread finds buf_flush_wait_batch_end() and blocks there. If this is true then there should be no activity reported (like updates or changes in history length), there should be no increment to the 1 and 10 second loop and there should an increment of 1 from previous reading of background and flush loop. This is exactly what we see in next readings:

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195
flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  391818
Age 2883

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195
flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  355850
Age 2883

History list length 367433
srv_master_thread loops: 1046 1_second, 192 sleeps, 104 10_second, 195 background, 195
flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     492584
Modified db pages  321851
Age 2883

So much for the explanation. I hope it is clear as mud. Now lets think about why we don't see this in 5.1 and what can we do to avoid this.

Mark Callaghan said:
"In 5.5, after a page and its neighbors have been flushed, there is an
optimization to not start over from the end of the flush list. In 5.1,
the search always goes back to the end of the flush list. While I think
this is a good change, it also skips the check to halt the search by
comparing oldest_modification with lsn_limit and callers can stay in
longer than they need to. I don't think that is good."

The claim about the oldest_modification check is not correct. If you
look at the code snippet I posted yesterday, the check is done in
exactly the same way. If we flush the neighbour pages then 
buf_flush_page_and_try_neighbors() will return TRUE, this fill
force the thread to break out out of the while loop. It will start
from the tail of the list again. The check for oldest modification in
both 5.1 and 5.5 is in the outer loop.

I need to reproduce this to trace it now. By adding artificial delays in the
purge code it should in theory reproduce the symptoms.

I thought perhaps that the first checkpoint after restart was the worst. But that is not the case. Here is an interesting example. I am using the purge thread and hacked srv_master_thread to not sleep when work was done.

The async limit has been reached and advance is ~84M when log_preflush_pool_modified_pages is called

log_preflush 0 sync, 113358852695 lsn, 110261884096 oldest, 2905956188 age, 2913351011 checkpoint_age, 84152601 advance, 110346036697 new, 3023360986 sync_limit, 2821803587 async_limit

buf_flush_list then flushes ~140k pages

buf_flush_list 138757 return, 110346036697 lsn_limit, 0 skipped, 138757 total_page_count for 1 instances 0x4a7b0838 ptr

purge thread blocks waiting for the flush above to end

#1  0x000000000089aeb9 in os_event_wait_low (event=0x21f56ea0, reset_sig_count=893) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082e6bc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1846
#3  0x000000000088af1c in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1661
#4  0x00000000008af6f5 in que_run_threads (thr=<value optimized out>) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#5  0x00000000007ed0b7 in trx_purge (limit=15164416) at /s/bld/558orig/storage/innobase/trx/trx0purge.c:1184
#6  0x00000000007e409e in srv_purge_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:3140

srv_master_thread goes into the flush_loop code because it does not detect server activity

#1  0x000000000089aeb9 in os_event_wait_low (event=0x21f56ea0, reset_sig_count=893) at /s/bld/558orig/storage/innobase/os/os0sync.c:207
#2  0x000000000082e6bc in buf_flush_wait_batch_end (buf_pool=<value optimized out>, type=BUF_FLUSH_LIST) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1846
#3  0x00000000007e192a in srv_master_thread (arg=<value optimized out>) at /s/bld/558orig/storage/innobase/srv/srv0srv.c:2969

and a user connection has flushed more than 80,000 pages

#0  buf_calc_page_new_checksum (page=<value optimized out>) at /s/bld/558orig/storage/innobase/include/ut0rnd.ic:222
#1  0x000000000082eafb in buf_flush_init_for_writing (page=0x2aaceded4000 "\022\233,<EE>", page_zip_=<value optimized out>, newest_lsn=<value optimized out>) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1023
#2  0x000000000083155d in buf_flush_try_neighbors (space=3, offset=50450, flush_type=BUF_FLUSH_LIST, n_flushed=78703, n_to_flush=18446744073709551614) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1117
#3  0x0000000000831912 in buf_flush_page_and_try_neighbors (bpage=<value optimized out>, flush_type=BUF_FLUSH_LIST, n_to_flush=18446744073709551614, count=0x4a7b0768) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1499
#4  0x0000000000831f1b in buf_flush_batch (buf_pool=0x6f56708, flush_type=BUF_FLUSH_LIST, min_n=18446744073709551614, lsn_limit=110346036697) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1622
#5  0x0000000000832803 in buf_flush_list (min_n=18446744073709551614, lsn_limit=110346036697) at /s/bld/558orig/storage/innobase/buf/buf0flu.c:1945
#6  0x000000000088abe7 in log_check_margins () at /s/bld/558orig/storage/innobase/log/log0log.c:1658
#7  0x00000000007dc93e in row_upd_step (thr=0x221ba058) at /s/bld/558orig/storage/innobase/include/log0log.ic:443
#8  0x00000000007c86d7 in row_update_for_mysql (mysql_rec=<value optimized out>, prebuilt=0x221bebe8) at /s/bld/558orig/storage/innobase/row/row0mysql.c:1453
#9  0x00000000007b622e in ha_innobase::update_row (this=0x220c2980, old_row=0x220c2c40 "<FF><BC>x/", new_row=0x220c2b80 "<FF><BC>x/") at /s/bld/558orig/storage/innobase/handler/ha_innodb.cc:5262
#10 0x0000000000682f77 in handler::ha_update_row (this=0x220c2980, old_data=0x220c2c40 "<FF><BC>x/", new_data=0x220c2b80 "<FF><BC>x/") at /s/bld/558orig/sql/handler.cc:4782
#11 0x00000000005e2ef7 in mysql_update (thd=0x2aaf581346e0, table_list=0x2aaf581a2d38, fields=@0x2aaf581367f8, values=@0x2aaf58136bd0, conds=0x2aaf581a36f8, order_num=<value optimized out>, order=0x0, limit=) at /s/bld/558orig/sql/sq
l_update.cc:707
#12 0x000000000057390a in mysql_execute_command (thd=0x2aaf581346e0) at /s/bld/558orig/sql/sql_parse.cc:2669
#13 0x000000000057835b in mysql_parse (thd=0x2aaf581346e0, rawbuf=<value optimized out>, length=<value optimized out>, parser_state=0x4a7b2de0) at /s/bld/558orig/sql/sql_parse.cc:5496
#14 0x00000000005794ea in dispatch_command (command=COM_QUERY, thd=0x2aaf581346e0, packet=<value optimized out>, packet_length=159) at /s/bld/558orig/sql/sql_parse.cc:1032
#15 0x0000000000579854 in do_command (thd=0x2aaf581346e0) at /s/bld/558orig/sql/sql_parse.cc:772
#16 0x000000000060e294 in do_handle_one_connection (thd_arg=0x2aaf581346e0) at /s/bld/558orig/sql/sql_connect.cc:745
#17 0x000000000060e384 in handle_one_connection (arg=<value optimized out>) at /s/bld/558orig/sql/sql_connect.cc:684

History list length 2336785
srv_master_thread loops: 3887 1_second, 480 sleeps, 388 10_second, 7 background, 7 flush
0.00 inserts/s, 36703.46 updates/s, 0.00 deletes/s, 73375.52 reads/s
Database pages     527393
Modified db pages  472350
Age 2826

History list length 2358018
srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush
0.00 inserts/s, 18607.07 updates/s, 0.00 deletes/s, 37206.30 reads/s
Database pages     528033
Modified db pages  464622
Age 2883

History list length 2358018
srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     528033
Modified db pages  442542
Age 2883

History list length 2358018
srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     528033
Modified db pages  416803
Age 2883

History list length 2358018
srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     528033
Modified db pages  389678
Age 2883

History list length 2358018
srv_master_thread loops: 3927 1_second, 485 sleeps, 392 10_second, 8 background, 8 flush
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Database pages     528033
Modified db pages  357742
Age 2883

History list length 2344766
srv_master_thread loops: 3958 1_second, 488 sleeps, 395 10_second, 8 background, 8 flush
0.00 inserts/s, 23056.39 updates/s, 0.00 deletes/s, 46082.78 reads/s
Database pages     528097
Modified db pages  378078
Age 2865

Mark,

I posted the last comment before looking at your latest note. I think we both have reached kind of same conclusion. Are we on the same page about what is happening here?

Yes, we are. And I even agree with Sunny about the 'start from flush list end' behavior not changing.

I have been using sysbench+lua with the lua script provided by your side. Do you have hardware to reproduce this on your side and compare 5.5 with 5.1? While people have claimed they could not reproduce this on your side, all of the TPS numbers are far lower than what I get with 5.1 or 5.5 so I wonder what is different about your repro HW. I have 2 hosts (16-core client, 8-core server, and server has HW RAID 10 + 8 SAS disks)

As even with skip_sleep hack we are unable to catch up on flushing, we can safely assume that there is more to it than just adaptive_flushing heuristic and enforced sleep.

Some plausible ideas are:

1) In 5.5 we are doing more work because of less contention and therefore we hit the wall.
 -- Not very likely to have such a huge impact. But then reverting rbs removed the stalls which points in that direction.
 -- Sometimes small scalability improvements can have cumulative effect. For example, if we are working more then purge falls behind a little more which means more dirty UNDO pages in the buffer pool. 
2) In 5.5 we generating more redo compared to 5.1
 -- Possibly because of change buffering (unlikely as whole dataset is in memory)
 -- May be multiple rbs somehow changed the amount of redo
 -- Any changes in purge code that can cause more redo?
3) We are dirtying more pages in 5.5
 -- Again change buffering should not be it because of memory resident dataset
 -- UNDO log pages stuff. Not sure about this but we have seen a bug in 5.6 where more than required pages are allocated for UNDO, meaning more dirty pages to be flushed.

Missed another likely culprit:

4) buf_flush_list() is somehow less efficient in 5.5.

Here is the prefix of the rate for Innodb_os_log_written for innodb_purge_threads=0, 64 concurrent threads and my sysbench (no lua). The rates for 5.1.47 and 5.5.58 are about the same

from 5.5.8, note the stalls

36406272.0
36681779.2
34856243.2
35550668.8
36225689.6
36595200.0
38077798.4
38436300.8
38426316.8
27824896.0
26976307.2
33760972.8
33155584.0
21431552.0
26237388.8
29718169.6
30882764.8
56217.6
0.0
0.0
0.0
14287616.0
36970240.0
37363046.4
37629952.0
37632460.8
37352857.6
38316032.0
38275891.2
37844224.0
38198988.8
38642995.2
38260121.6
38405427.2
37520435.2
35800320.0
21494630.4
0.0
0.0
0.0

from 5.1.47, no stalls

33570508.8
33567232.0
33452697.6
32866048.0
33323776.0
32619827.2
34948403.2
33984614.4
34770124.8
34721945.6
35243980.8
35481702.4
35339673.6
36613529.6
35518617.6
35636736.0
34742425.6
34868275.2
35366400.0
36071014.4
35994982.4
35894169.6
35740979.2
36201062.4
35822950.4
36118067.2
36815052.8
35276185.6
35970201.6
35788236.8
37049702.4
36561766.4
36245708.8
35397222.4
35373414.4
35375206.4
35613440.0
35516364.8
36310425.6
36300646.4
35225804.8
36576614.4
36192204.8
36178790.4
33087795.2
35442380.8
37365862.4
36126003.2
35294259.2
35714150.4
35303936.0

5.1.47 also experiences the huge flushes, but for whatever reason, they don't cause stalls as they do for 5.5.8

buf_flush_batch LIST has 1024 flushes for 1000 min and 18446744073709551615 lsn
buf_flush_batch LIST has 1024 flushes for 1000 min and 18446744073709551615 lsn
buf_flush_batch LIST has 128 flushes for 100 min and 18446744073709551615 lsn

log_preflush 0 sync, 1679265076702 lsn, 1676267915463 oldest, 2829357857 age, 7554270 advance, 1676275469733 new, 3023360986 max_sync, 2821803587 max_async
buf_flush_batch LIST has 43268 flushes for 18446744073709551614 min and 1676275469733 lsn
log_preflush flushed 43268 for lsn 1676275469733 and 0 sync

log_preflush 0 sync, 1679390981442 lsn, 1676275629391 oldest, 2989447675 age, 167644088 advance, 1676443273479 new, 3023360986 max_sync, 2821803587 max_async
log_preflush flushed 30908 for lsn 1676443273479 and 0 sync
buf_flush_batch LIST has 30908 flushes for 18446744073709551614 min and 1676443273479 lsn

It seems like 5.5.8 takes longer to do the first call to log_preflush_pool_modified_pages with sync=0 so when it does make that call, there is much more work to do as the value of the "advance" variable in log_checkpoint_margin is much larger.

Is log_free_check being called less frequently in 5.5.8?

5.1.47 with a huge flush -- 4.5 seconds

buf_flush_batch LIST has 24840 flushes for 18446744073709551614 min and 1694941425550 lsn, 4.58 seconds
log_preflush flushed 24840 for lsn 1694941425550 and 0 sync
log_preflush 0 sync, 1697840633932 lsn, 1694937772605 oldest, 2825456532 age, 3652945 advance, 
1694941425550 new, 3023360986 max_sync, 2821803587 max_async

The large flush for 5.5.8 was much larger (160k+ pages) and took much longer (34.31 seconds)
The value for "advance" was also much larger 24M versus 3M, thus more pages were flushed

buf_flush_flush_list_batch done with 162598 count, 2548 times in 34.31 seconds
buf_flush_list 162598 return, 163852193457 lsn_limit, 0 skipped, 162598 total_page_count for 1 instances 0x4aa64838 ptr
log_preflush 0 sync, 166850905821 lsn, 163827046822 oldest, 2846950222 age, 2847512440 checkpoint_age, 25146635 advance, 163852193457 new, 3023360986 sync_limit, 2821803587 async_limit

I also think that the values for max_modified_age_async, max_modified_age_sync should be much larger given that I am using 3.8GB total for innodb transaction logs. The current values are ~2.8GB for async and 3GB for sync. 

The current behavior is 'stop the world' when the async limit is reached. I don't think it should be when called by user connections. In that case it might be better to have them do at most a fixed amount of IO when the async limit is exceeded. If there is a large flush to be done, it should be done by background threads. I added srv_background_checkpoint to the Facebook patch for that.

Finally, this has a huge cliff in performance. As soon as the async limit is reached, the next caller is forced to do way too much IO.

The other thing I noticed is that 5.5.8 takes longer to flush a similar number of pages from buf_flush_batch and ends up calling log_preflush_pool_modified_pages earlier in the test (probably because it takes too long to flush pages).

This is work done from the 1 second loop in srv_master_thread and 5.5.8 takes ~0.75 seconds per call to flush 2000 pages versus ~0.5 seconds per call for 5.1.47. From imprecise samples, after ~250 seconds the checkpoint age for 5.5.8 is ~2B versus ~1.7B for 5.1.47

From 5.1.47, the last buf_flush_batch times before the first call to log_preflush_pool_modified_pages. It takes ~0.5 seconds per 2000 pages.

buf_flush_batch LIST has 2016 flushes for 2000 min and 18446744073709551615 lsn, 0.51 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.46 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.53 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.39 seconds
buf_flush_batch LIST has 2047 flushes for 2000 min and 18446744073709551615 lsn, 0.61 seconds
buf_flush_batch LIST has 2047 flushes for 2000 min and 18446744073709551615 lsn, 0.64 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.50 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.45 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.48 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.66 seconds
buf_flush_batch LIST has 256 flushes for 200 min and 18446744073709551615 lsn, 0.05 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.61 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.69 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.47 seconds
buf_flush_batch LIST has 2048 flushes for 2000 min and 18446744073709551615 lsn, 0.65 seconds

from 5.5.8, it takes 0.7 to 0.8 seconds per 2000 pages

buf_flush_flush_list_batch done with 2001 count, 32 times in 0.88 seconds
buf_flush_flush_list_batch done with 2000 count, 32 times in 0.86 seconds
buf_flush_flush_list_batch done with 2000 count, 32 times in 0.77 seconds
buf_flush_flush_list_batch done with 2001 count, 32 times in 0.74 seconds
buf_flush_flush_list_batch done with 2000 count, 32 times in 0.79 seconds
buf_flush_flush_list_batch done with 2001 count, 32 times in 0.62 seconds
buf_flush_flush_list_batch done with 2001 count, 32 times in 0.62 seconds
buf_flush_flush_list_batch done with 2001 count, 32 times in 0.68 seconds
buf_flush_flush_list_batch done with 2001 count, 32 times in 0.72 seconds
buf_flush_flush_list_batch done with 2000 count, 32 times in 0.74 seconds

I can reproduce the symptoms with the following settings:

  REDO log size 32M
  Buffer pool size 16G
  Sysbench OLTP RW 128 threads

I can see that when all the user threads simultaneously discover that a
flush is required they stall and the master thread goes into the background
loop. I have separate purge thread and that thread ends up doing more purge
than the user threads because it calls log_free_check() more frequently than
user threads. This I think is something we just have to live with.

What we can and should fix is that wait for flush should wait for free space
and not for flush to end. Waiting for flush to end is too coarse. Also, I think 5.5
does more flushing (or did more flushing than 5.1), Inaam can comment on
that.

One thing that the master thread can do is to check activity count and also the
number of threads active within InnoDB. A quick hack is to add the check
for srv_conc_n_threads == 0 before suspending or going to the flush loop.
Currently this will only work for innodb_thread_concurrency > 0.

Should have been:

I have separate purge thread and that thread ends up doing more flushing
than the user threads because it calls log_free_check() more frequently than
user threads.

Mark, 

See below for my findings/observations:

I've run sysbench_lua scenario with 64 threads for 20 minutes with 5 minutes warmup with your config files/parameters:

scenario:
- start server with fresh datadir with 8 tables (4M rows in each table)
- run POINT_SELECT_ALL_COLUMNS for 5 minutes to warmup 
- run POINT_SELECT/UPDATE_NO_KEY for 20 minutes
- after the test wait while history length decreases

I ran tests on box with RAID0 and 24 Cores and result pattern for 5.1 was not really different from 5.5. Then I've bound mysqld to 8 cores and sysbench to 16 cores and got exact results that you have observed.

See below tps, sizes of ibdata1 and history length 

# Server 1 - 5.1.47                        
# Server 2 - 5.5.8-purge0                               
# Server 3 - 5.5.8-purge1                               
#
#             Server 1  Server 2  Server 3
# Thread        INNODB    INNODB    INNODB
         64   10160.00  37151.68  34525.43

# Server 1 - 5.1.47-taskset:mysqld8.sysbench16                           
# Server 2 - 5.5.8-taskset:mysqld8.sysbench16.purge0                                
# Server 3 - 5.5.8-taskset:mysqld8.sysbench16.purge1                               
#
#             Server 1  Server 3  Server 4  
# Thread        INNODB    INNODB    INNODB  
         64   26154.53  30784.68  36419.39  

5.1.47  5.5.8   5.5.8
        purge0  purge1
 2.2GB     9GB   8.1GB      no taskset - 24 Cores
0.97GB   6.6GB   7.1GB      taskset: mysqld:8 sysbench:16

History list length (30sec interval)
-------------------------------------------------------
task set mysqld:8cores     |    no task set 24 Cores
         sysbench:16 cores |
--------------------------------------------------------
5.1.47   5.5.8     5.5.8   | 5.1.47   5.5.8     5.5.8
         purge0    purge1  |          purge0    purge1
1842     12970     995     | 448      5894      1957
123087   506472    386988  |  128810   634217    402325
165539   935820    794198  |  250617   1111405   948795
184994   1341715   1200092 |  370444   1607104   1459331
133886   1792488   1558891 |  495441   2204015   2014718
135782   2260436   1941357 |  618949   2804505   2584142
110466   2737207   2301905 |  740097   3393790   3161552
111057   3216861   2676161 |  862050   4000312   3760839
130766   3697016   3051781 |  982639   4609270   4354120
130553   4150425   3441934 |  1104660  5177329   4936754
156050   4346137   3831758 |  1226815  5381039   5255374
163308   4348558   4196715 |  1350083  5381039   5255374
191986   4669859   4588007 |  1472785  5768203   5521208
209534   5075578   4891004 |  1597586  6316360   6090445
228700   5435242   4981820 |  1721242  6906792   6427128
267935   5878905   4981820 |  1844370  7505044   6771906
288848   6384443   5220351 |  1965695  8111471   7235152
335026   6884865   5606313 |  2087810  8724499   7581476
325376   7392254   5948588 |  2206316  9338437   7924890
417389   7876254   6337510 |  2325080  9940157   8264284
612274   8304442   6774621 |  2444471  10487672  8692110
806752   8661950   7276047 |  2563422  10794299  9270781
996647   8661950   7794486 |  2683284  10797333  9827396
931796   8855504   8305764 |  2808315  11192350  10330990
801311   9343227   8834988 |  2915054  11792906  10380368
790607   9820083   9307020 |  3029193  12398997  10526311
823107   10296457  9681316 |  3148762  13006574  11105807
698139   10793111  9681316 |  3269419  13610343  11692256
366234   11284027  9804518 |  3387346  14221039  12273521
104141   11784270  10336998|  3505152  14824285  12857632
10134    12284047  10855595|  3625502  15402246  13437536
2918     12761845  11376463|  3745770  15930156  14012999
7341     12943961  11918201|  3862317  16194402  14591772
413413   12957935  12440025|  3983269  16228078  15115281
971979   13424751  12972485|  4105675  16832049  15538626
1497021  13911070  13498213|  4226988  17442612  15538626
2017967  14406625  13917190|  4349110  18050606  15994001
1951350  14887142  14159856|  4475181  18649876  16557206
1638291  15346455  14159856|  4601556  19260133  17126295
1338088  15839828  14484727|  4727873  19873176  17683785

chart for purge lag observed on 24 core box with and without taskset

Attachment: img_13685_3.png (image/png, text), 4.76 KiB.

In my tests I've also observed stalls for 558:

with taskset and purge0
[ 375s] threads: 64, tps: 0.00, read_ops: 14623.13, write_ops: 14622.93                                                       
[ 390s] threads: 64, tps: 0.00, read_ops: 11753.74, write_ops: 11753.34                                                       
[ 405s] threads: 64, tps: 0.00, read_ops: 11851.14, write_ops: 11851.54                                                       
[ 420s] threads: 64, tps: 0.00, read_ops: 1066.73, write_ops: 1065.87                                                         
[ 435s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 450s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 465s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 480s] threads: 64, tps: 0.00, read_ops: 12105.95, write_ops: 12106.75                                                       
[ 495s] threads: 64, tps: 0.00, read_ops: 14639.41, write_ops: 14639.01           
...
[ 780s] threads: 64, tps: 0.00, read_ops: 20566.65, write_ops: 20567.05                                                       
[ 795s] threads: 64, tps: 0.00, read_ops: 10874.67, write_ops: 10872.40                                                       
[ 810s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 825s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 840s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 855s] threads: 64, tps: 0.00, read_ops: 12976.46, write_ops: 12977.66                                                       
[ 870s] threads: 64, tps: 0.00, read_ops: 22077.18, write_ops: 22077.38                  

with taskset and purge1
[ 240s] threads: 64, tps: 0.00, read_ops: 20639.13, write_ops: 20639.66                                                       
[ 255s] threads: 64, tps: 0.00, read_ops: 17347.44, write_ops: 17346.91                                                       
[ 270s] threads: 64, tps: 0.00, read_ops: 18699.77, write_ops: 18701.57                                                       
[ 285s] threads: 64, tps: 0.00, read_ops: 19427.00, write_ops: 19426.20                                                       
[ 300s] threads: 64, tps: 0.00, read_ops: 9540.32, write_ops: 9539.32                                                         
[ 315s] threads: 64, tps: 0.00, read_ops: 176.07, write_ops: 176.07                                                           
[ 330s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 345s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 360s] threads: 64, tps: 0.00, read_ops: 17429.05, write_ops: 17429.05                                                       
[ 375s] threads: 64, tps: 0.00, read_ops: 20981.24, write_ops: 20983.04                       
...
[ 630s] threads: 64, tps: 0.00, read_ops: 18966.92, write_ops: 18968.06                                                       
[ 645s] threads: 64, tps: 0.00, read_ops: 13793.60, write_ops: 13791.60                                                       
[ 660s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 675s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 690s] threads: 64, tps: 0.00, read_ops: 1477.07, write_ops: 1477.07                                                         
[ 705s] threads: 64, tps: 0.00, read_ops: 16986.43, write_ops: 16986.76                                                       
[ 720s] threads: 64, tps: 0.00, read_ops: 21377.22, write_ops: 21378.22                   
...
[ 960s] threads: 64, tps: 0.00, read_ops: 21117.10, write_ops: 21117.76                                                       
[ 975s] threads: 64, tps: 0.00, read_ops: 3871.53, write_ops: 3870.86                                                         
[ 990s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[1005s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[1020s] threads: 64, tps: 0.00, read_ops: 14073.07, write_ops: 14074.07                                                       
[1035s] threads: 64, tps: 0.00, read_ops: 21732.61, write_ops: 21733.61

without taskset and purge0
[ 285s] threads: 64, tps: 0.00, read_ops: 20316.78, write_ops: 20316.85                                                       
[ 300s] threads: 64, tps: 0.00, read_ops: 1281.13, write_ops: 1279.53                                                         
[ 315s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 330s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 345s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 360s] threads: 64, tps: 0.00, read_ops: 21001.61, write_ops: 21003.01                 
...
[ 615s] threads: 64, tps: 0.00, read_ops: 20174.48, write_ops: 20175.81                                                       
[ 630s] threads: 64, tps: 0.00, read_ops: 16316.18, write_ops: 16314.44                                                       
[ 645s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 660s] threads: 64, tps: 0.00, read_ops: 216.93, write_ops: 216.93                                                           
[ 675s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 690s] threads: 64, tps: 0.00, read_ops: 16075.96, write_ops: 16077.49                                                       
[ 705s] threads: 64, tps: 0.00, read_ops: 21081.37, write_ops: 21081.24                  
...
[ 945s] threads: 64, tps: 0.00, read_ops: 20676.27, write_ops: 20677.07                                                       
[ 960s] threads: 64, tps: 0.00, read_ops: 17416.77, write_ops: 17414.83                                                       
[ 975s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 990s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[1005s] threads: 64, tps: 0.00, read_ops: 5125.33, write_ops: 5125.33                                                         
[1020s] threads: 64, tps: 0.00, read_ops: 21622.08, write_ops: 21622.28                

without taskset and purge1 
[ 285s] threads: 64, tps: 0.00, read_ops: 18991.20, write_ops: 18991.87                                                       
[ 300s] threads: 64, tps: 0.00, read_ops: 12301.98, write_ops: 12300.65                                                       
[ 315s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 330s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 345s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 360s] threads: 64, tps: 0.00, read_ops: 13395.09, write_ops: 13396.56                                                       
[ 375s] threads: 64, tps: 0.00, read_ops: 21595.69, write_ops: 21594.82               
...
[ 690s] threads: 64, tps: 0.00, read_ops: 20325.56, write_ops: 20327.29                                                       
[ 705s] threads: 64, tps: 0.00, read_ops: 16040.44, write_ops: 16038.71                                                       
[ 720s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 735s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[ 750s] threads: 64, tps: 0.00, read_ops: 1872.34, write_ops: 1872.34                                                         
[ 765s] threads: 64, tps: 0.00, read_ops: 17628.53, write_ops: 17629.73                                                       
[ 780s] threads: 64, tps: 0.00, read_ops: 21576.42, write_ops: 21576.62         
...
[1020s] threads: 64, tps: 0.00, read_ops: 20592.50, write_ops: 20591.50                                                       
[1035s] threads: 64, tps: 0.00, read_ops: 10460.08, write_ops: 10460.08                                                       
[1050s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[1065s] threads: 64, tps: 0.00, read_ops: 0.00, write_ops: 0.00                                                               
[1080s] threads: 64, tps: 0.00, read_ops: 15247.00, write_ops: 15247.93                                                       
[1095s] threads: 64, tps: 0.00, read_ops: 21673.43, write_ops: 21672.70

Sunny,

I think that flushing is done faster with 5.5, but I need to update my stats collection tool to also report checkpoint age per interval before I repeat my tests and confirm.

I also think that the buf_flush_batch equivalent in 5.5 has more overhead than in 5.1. I have instrumentation in the 5.1 facebook patch to time all tasks done by srv_master_thread and buf_flush_batch. Having similar timers in official mysql would make it easier to notice and debug this.

Alexey - thanks for the details and if I learn lua then I can upgrade to the new sysbench. Were your 5.1.47 numbers for built-in or plug-in innodb?

Mark, I think I've found the bug, I need to verify the fix and run it
by the other devs. Once it passes muster then I can send you a patch
to confirm that it fixes what you observe too. The problem is in the
flush code and the changes made to handle multiple buffer pools.

Thanks for your help in nailing this down.

Mark,

The overhead in buf_flush_batch() is quite possibly real with removal of the
global buffer pool mutex. It should however be offset by benefits in the
mtr commit code.

Once I confirm the fix for this I will take a look.

Mark, 

my results for 5.1.47 - for InnoDB plugin

Mark,

"The current behavior is 'stop the world' when the async limit is reached. I don't think it should be when called by user connections."

You mean 'sync' limit or 'async' limit. All user threads are blocked when 'sync' limit is reached. The calling user thread is blocked when 'async' limit is reached.

"In that case it might be better to have them do at most a fixed amount of IO when the async limit is exceeded. If there is a large flush to be done, it should be done by background threads. I added srv_background_checkpoint to the Facebook patch for that."

We have done something similar in 5.6. The async stuff is totally off loaded to a new dedicated flushing thread called page_cleaner.

I have thread stacks somewhere from poormans profiler that show no progress. I also have benchmark results showing 30 second stalls where the async limit is reached, a huge flush occurs, srv_master_thread then gets stuck (maybe waiting for flush to end) and then all other user connections get stuck.

I can't tell you exactly where all other threads get stuck. Perhaps checkpoint age grows quickly at that point and then they hit the sync limit. They certainly will run slower regardless because of the furious flushing.

Mark, it would be interesting to see those stacks. A sync checkpoint
should be visible (log_free_check() -> ... ). 
A periodic output of SHOW STATUS through the stall would also be useful,
specifically the LSNs it reports.
The idea of async checkpoint processing triggering the sync checkpoint 
somehow is sound, however it wasn't showing in my tests
which means there may be something interesting about your configuration
parameters. Could you tell us:

1. redo log size
2. buffer pool size
3. active database size and/or average buffer pool hit ratio
4. max_dirty_pages_pct (sp?)

Michael - some people at Oracle can reproduce my 5.5.8 results. So you can get the thread stacks by running poormansprofiler (http://poormansprofiler.org) on their setup. I posted stacks from two stalls earlier in this thread. In one case everything was stuck in purge code and in the other case on buf_flush_wait_batch_end. I suspect the second case was from sync checkpoints. 5.5.8 is not able to manage checkpoint age on this test, so the sync checkpoint limit is hit soon after the async checkpoint limit and it is difficult to get stacks in between the two.

You also have to guess at which limit is being enforced because SHOW ENGINE INNODB STATUS doesn't tell you: checkpoint age, async limit, sync limit.

The async checkpoint limit is definitely 'stop the world' for the one thread that gets stuck flushing 100,000 pages in buf_flush_batch.

I ran 5.5.8 until the first stall occurred. The stacks are below and the stall is quickly from the sync checkpoint limit. I missed this at first because you can't compute this from SHOW ENGINE INNODB STATUS because the value of buf_pool_get_oldest_modification() is not provided. 
    oldest_lsn = log_buf_pool_get_oldest_modification();
    age = log->lsn - oldest_lsn;

     32 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,log_check_margins,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     20 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     12 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,ha_innobase::index_read,handler::index_read_idx_map,join_read_const,join_read_const_table,make_join_statistics,JOIN::optimize,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
     10 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone
      2 
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,buf_flush_wait_batch_end,srv_master_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_lock_timeout_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event_wait_time_low,srv_error_monitor_thread,start_thread,clone
      1 poll,handle_connections_sockets,mysqld_main,__libc_start_main,_start,poll
      1 do_sigwait,sigwait,signal_hand,start_thread,clone
      1 buf_calc_page_new_checksum,buf_flush_init_for_writing,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,buf_flush_batch,buf_flush_list,log_check_margins,que_run_threads,trx_purge,srv_purge_thread,start_thread,clone

Sorry, I failed to read the entire bug, I'll post some comments separately.
Mark, irrespective of how people at Oracle know how to reproduce this,
could you still post your entire sysbench invocation, buffer pool size 
and redo log size? Time before you get the first stall, too.
I'm trying to collect all variants in order to create the simplest possible test case
to use in the future.
As for SHOW STATUS output I find the following [derived] value useful:

LSN: 555555555
Last checkpoint at: 444444444

LSN - Last checkpoint at

is then what the sync/async/none checkpoint decision is made on,
the exact values for the margin can be easily calculated from the code.
Not too easily, unfortunately, given various extra margins added here and there.
I found a graph on this value especially useful.

For the max_dirty_pages_pct a similar metrics should be
Modified DB pages / Database pages * 100

here's what I see by rereading the entire bug (I was on vacation
when most of the action happened), some of this is just the shorthand
version of the above contents:

* it is hard to tell from the stacks whether the flush is sync,
because of inlining
* irrespective, the immediate problem is the size of the flush, 80k to 230k pages.
This is quite amazing given how we calculate 'advance' in log_check_margins()
When this is fixed everything may follow suit.

* the simplest test will be Sunny's
REDO log size 32M
Buffer pool size 16G
Sysbench OLTP RW 128 threads
missing values are the number of rows and max_dirty_pages_pct (default?)
but the important thing that
it is not a multitable test. I would also guess the dataset fits in the buffer pool a few times over.

Mark's test is like this:

* multitable sysbench, uniform distribution
* 16G buffer pool ('about 1M pages')
* 32M rows or so altogether ('data fits in less than half the buffer pool')
* max_dirty_pages_pct=80
missing information: redo log size.

my guess is it's not very large based on the data from my tests.

What I run:

* 10G buffer pool
* single table sysbench with 25M rows, just over half the buffer pool
* max_dirty_pages_pct=default of

the observed flushing behaviour is quite OK:
* flushing is driven by async flushing limit, with some help from adaptive flushing
* somewhat unexpectedly, the async flushing is called mostly from user
threads. Somewhat modified code showed there are literally millions of those,
mostly hitting the situation 'someone else doing the flush', a minority
advancing the checkpoint little by little.
* No sync checkpoints, definitely no huge flushes.
Everything goes quite smoothly even though my sysbench lacks the periodic
tps rate printout capability, it is visible from other data.

So it would seem that what makes the bug tick is the redo log size,
if my guess about Mark's environment is right.
With a small redo log, async and sync limits get pretty close in absolute numbers. When many threads call log_check_margins and decide on a sync checkpoint
they retry the check even if another thread is flushing.
I can see how it could lead to some form of stall but I can't see how
this or anything could lead to runaway dirtying of pages to provoke
the huge flush we observe.

Other notes:

* the current knowledge is that we call log_free_check often enough,
there was a bug that was fixed in that area relatively recently
* I don't think we can avoid calling log_free_check in purge because
if the master thread does purge and there's no other db activity,
there's noone else to call log_free_check()
* the logic in log_check_margins() of 'sync and someone else doing flushing?
OK, retry the whole thing' is questionable, given that log_free_check
is called often.

if by redo log file size you mean the innodb transaction log --> 2 x 1.9G

Please don't run mysqld across 24-cores during your reproduction. That is likely to suffer from too much mutex contention everywhere that this result won't be reproduced. I use mysqld on an 8 core server.

Looking closely at the stats collected by Ranger where he reproduced the issue when binding mysqld to 8 cores I am now not sure if purge lag is because of huge flushing or the huge flushing is because of purge lag.

I am more inlined to believe that purge lag is causing other grief as well. The evidence in support:

(all that is written below will make much more sense if Ranger can post the InnoDB buffer pool usage charts)

1) dataset (in terms of database pages in buffer pool) is around 400K pages at startup.
2) with 5.1.47 the number of database pages increase to around 500K after the run.
3) with 5.5 (both with and without a separate purge thread the number of database pages increase to above 800K by the end of the run.
4) with 5.1.47 the number of modified pages stays almost the same as number of database pages with periodic dips when the checkpoint flushing happens
5) with 5.5 the dips at flushing are more pronounced and the number of dirty pages steadies at around 600K during steady period. This means around 200K more dirty pages when compared to 5.1.47.
6) Also the difference of total - dirty pages == 800 - 600 = 200K pages is significant. This means (making an educated guess here) that these 200K pages which were flushed were UNDO log pages. Had they been data pages they would have been dirtied again because of uniform load distribution.

The above rant tells us a few things:

-- We are dealing with almost double dataset size in case of 5.5.
-- As purge lags behind more the .ibd file probably will get more fragmented potentially making write operations on it more expensive.
-- Both ibdata and .ibd files have to extend which is probably causing extra grief.
-- There are more than 200K UNDO log pages in the buffer pool. These pages are dirty to start with and hence they add up to the flushing woes.

Bottom line, I think we can do an apple to apple comparison of flushing only after we are able to bring dataset size down to what it is in 5.1.47. In other words, first fix purge lag and then work on flushing.

transaction log size is what I meant, thank you. So my theory on small redo log size
triggering it didn't hold. Hopefully Sunny's fix in the works will explain this.
Sunny, what is the table size in your reproducible test case?

more on 'do we call log_free_check' often enough?
We think we call log_free_check in normal course of operation often enough.
What about the purge thread? By looking row0purge.c we call log_free_check()
before every mtr_start() with two exceptions:

1. in row_purge_upd_exist_or_extern() under 

skip_secondaries:
        /* Free possible externally stored fields */
        for (i = 0; i < upd_get_n_fields(node->update); i++) {

                const upd_field_t*      ufield
                        = upd_get_nth_field(node->update, i);

                if (dfield_is_ext(&ufield->new_val)) {

this is actually suspicious, why don't we call log_free_check() here?
But it should be irrelevant to any sysbench testcase.

2. in row_purge_poss_sec().

First we can't call log_free_check() here because we are holding a sync object.
Second, this doesn't seem to be a write operation, from the function description.

log_free_check() is a blocking call. Fix for bug#39168 introduced extra log_free_check() calls one of which was in row0purge.c:row_purge_remove_clust_if_poss_low((). The fix went in 5.1.49. What this means is that it is possible that in 5.1.47 the purge thread doesn't get blocked by log_free_check() when a sync flush is happening and gets a kind of free ride (as other threads block on flushing) during this small period. We see a manifestation of this behaviour in Ranger's stats where in 5.1.47 (first column below) the history list actually decreases in length potentially during a checkpoint:

996647   8661950   7794486 |  2683284  10797333  9827396
931796   8855504   8305764 |  2808315  11192350  10330990
801311   9343227   8834988 |  2915054  11792906  10380368
790607   9820083   9307020 |  3029193  12398997  10526311
823107   10296457  9681316 |  3148762  13006574  11105807
698139   10793111  9681316 |  3269419  13610343  11692256
366234   11284027  9804518 |  3387346  14221039  12273521
104141   11784270  10336998|  3505152  14824285  12857632
10134    12284047  10855595|  3625502  15402246  13437536
2918     12761845  11376463|  3745770  15930156  14012999
7341     12943961  11918201|  3862317  16194402  14591772
413413   12957935  12440025|  3983269  16228078  15115281
971979   13424751  12972485|  4105675  16832049  15538626
1497021  13911070  13498213|  4226988  17442612  15538626
2017967  14406625  13917190|  4349110  18050606  15994001
1951350  14887142  14159856|  4475181  18649876  16557206

The above still doesn't explain why history list length is 10 times higher in 5.5 though.

buffer pool usage for Ranger's results

Attachment: buffer_pool_usage.png (image/png, text), 8.75 KiB.

The initial description of my test setups was listed long ago at the start of this bug:
https://www.facebook.com/notes/mysql-at-facebook/mysql-cpu-bound-performance-51-55-and-the...

That has the my.cnf values used with two exceptions. I didn't list innodb_io_capacity and have been using 1000 or 2000. I also began using innodb_purge_threads=1.

1. the difference in throughput between Mark's and our tests could be
due to innodb_flush_logs_at_trx_commit=1 in our tests. I always run with 1.
2. calling log_free_check() in purge more often as a culprit 
is an interesting idea. Note that Mark's 5.1.52 results for readwrite
are considerably worse than 5.1.47. But as Inaam says this alone doesn't explain the 5.5 problem.
3. Ranger's graph: I see the 200k flushes, same size, same interval of 6-7 
units of time (minutes)? I'd love to see a file with complete SHOW STATUS
output every N seconds, the checkpoint lag (LSN - Last checkpoint at) would be particularly interesting to trace
5. buf_flush_wait_batch_end,log_check_margins,que_run_threads,trx_purge
This would mean the flush is synchronous. The question still remains, 
why we don't stay at the async limit like what happens in my tests?
Why do we flush 200k pages? 
6. I consider it a long chance that 200k is the result of neighbour page
explosion but we may consider not flushing neighbours for sync checkpoint flushes.
7. More modified pages in 5.5: this should mean INSERTs proceed happily
while UPDATEs/DELETEs have problems. Because INSERTs don't interact with the purge?

considering the stalls here's a hypothesis that would explain 
the periodicity observed in Ranger's graph:
(1 and 4 are more of empirical observations)

1. the flushing in the test of this kind is [a]sync checkpoint driven
with the codepatch of log_free_check...log_checkpoint_margin
2. the above codepath is executed a lot so in the end 'age' as calculated
in log_checkpoint_margin remains more or less a linear function over time
3. in my tests this line is constant so nothing bad happens
4. in the case at hand the 'age' is linearly increasing until the sync checkpoint
boundary is hit resulting in a huge flash and a stall.

Why the asynch checkpointing can't keep up? Probably because of too small
'advance' as calculated in log_checkpoint_margin().
Why the adaptive flushing doesn't play much of a role is a separate question.
The purge lag may well be a separate question with this approach
but it's worthwhile to try a q&d patch to eliminate the sync checkpoints ans see what happens.

Problem finally found. The previous "found" was about the excessive flushing/stalls
not purge lag. The problem is due to extra rollback segments and the way the original
code was written. It tries to truncate all segments after each purge. This results
in excessive buffer pool traffic and extra mini-transactions. The fix is
not to call truncate at each iteration but after some configurable period (or
possibly make it adaptive).

I've tested my hypothesis and it fixes the problem.

See also bug# 59625. Both need to be fixed at the same time. Bug# 59625 will need
the back port of the min binary heap from 5.6 that is part of this bug fix.

Mark, could you clarify the value of innodb_flush_log_at_trx_commit
in your tests? The reason I'm asking is that while testing a potential fix
I found that the history list length problem depends on the value
of innodb_flush_log_at_trx_commit. Specifically, when I use the current
5.1 plugin code without any fix, just plain unpatched code,
and the multitable lua sysbench test mentioned elsewhere in the bug,
I have good behaviour with history list length around 50K, for
innodb_flush_log_at_trx_commit=1 only. With either of the two other values,
history list length grows to millions in 10 minute.
Your blog post mentions innodb_flush_log_at_trx_commit=2, could you confirm that,
and one other thing, are you sure that the dataset you use for a 5.1 test
was created with 5.1? As you know using a 5.5 dataset with a 5.1 executable
means the 5.1 executable will use 128 rollback segments as the relevant code
is already there. 
For the reference, here is what I have now for UNPATCHED code, no fixes at all,
GOOD/BAD is an obvious function of the history list length:

                                 5.1-exec-5.1-dataset 5.1-exec-5.5-dataset 5.5

innodb_flush_log_at_trx_commit=1 GOOD                 GOOD                 BAD*
innodb_flush_log_at_trx_commit=2 BAD**                BAD**                BAD

* this is the current bug as I understand it
** this is what I see in my tests but it seems to contradict your observations
thatsay that you saw a short history list length with 5.1 and innodb_flush_log_at_trx_commit=2. 

So, we know that everything will deteriorate with 5.5, but to reiterate my question,
could you confirm that you have a test case with innodb_flush_log_at_trx_commit=2, 5.1 plugin executable, database created with
5.1 plugin, history list length being not long?

I did not share database files between 5.1 and 5.5.
I used innodb_flush_log_at_trx_commit=2
I did not produce stalls in unmodified 5.1.47 as I did in 5.5.8

Thanks, Mark.
What I see is that the configuration is about the same, the difference
is in sysbench, or this is where I see it.
There are three multitable sysbench version I know about:

a-  LUA sysbench 0.5. Here we see at least one problem.
b- Mark's sysbench 0.4 modified. Here it would appear Mark sees one more problem
not seen with sb 0.5
c- my own multitable sysbench based on 0.4. Untested in the context of this bug and only ever tested with secondary index search enabled. Whatever testing
I did it wasn't showing any surprises. This may mean lack of secondary index search triggers the bug(s).

given that I see that a- is different from c- in implementation it doesn't make
me wonder that b- shows results different from  a-. It means I have to get
b- and retest. More importantly when Sunny is back from vacation we have
to revisit/refactor/fork the bugs in question as there may be more than one.

I switched to lua/sysbench 0.5 early in the process of generating reproduction data.

When does the fix for this get committed?

This problem was discussed on a recent Percona post. But I disagree with Dimitri that proper tuning solves the problem.

http://www.mysqlperformanceblog.com/2011/03/31/innodb-flushing-a-lot-of-memory-and-slow-di...

The fix the has been approved (and committed) is not for "furious flushing" but purge lag
due to 128 rollback segments.

Mark,

if I understand well, your bug was about the "purge lagging".. - but the article you're mentioning now is about the "adaptive flushing".. - may you clarify, please?..

and about which exactly tuning you're not agree that I've proposed?..

Rgds,
-Dimitri

In the bug report there are details of cases where the server stalled for more than 60 seconds during checkpoint and log_preflush_pool_modified_pages flushes more than 200,000 pages. That isn't a new problem specific to 5.5 but it is a serious problem.

The problem occurs because when InnoDB must flush X pages to advance the min dirty page LSN, it can end up flushing X * 64 because dirty neighbors are flushed.

A bit of followup for the neighbor flushing on this, substantial changes in 5.6 labs release: http://blogs.innodb.com/wp/2012/04/optimizing-neighbor-flush-behavior/ .

Sunny,

I would like to followup here, when do you think the bug fix for multiple rollback segments is going to be released?

The purge lag due to 128 rsegs has been fixed. This bug morphed into one about furious flushing. There have been several improvements made around flushing code in 5.6. The flushing is done in a separate thread now and has been moved out of the master thread.

Ovais, are you experiencing any issues around this? Do you have a test case that is purge related and not about furious flushing?

Sunny,

I am experiencing issues with very noticeable differences in purge lag after upgrading from 5.1 to 5.5.  There has been no other change in the application, or anything else, apart from this upgrade from 5.1 to 5.5. This behaviour is repeatable on many servers.

Having read all the comments in this bug report, I have a theory that this might be because of running with a separate purge_thread, which if I understand correctly gets blocked by the master thread when its doing lot of flushing.

I would be trying with disabling separate purge_thread, to see if that fixes purge lag.

I was wondering if the fix for this bug will go in a 5.5 release, because the 5.5 release introduced the concept of 128 rsegs.

You must remember that making the purge thread separate from the master thread is meant to  reduce lag. If it is part of the master thread ie. innodb_purge_threads=0 then purge is done via the master thread  and you will either have the master thread doing purge or flushing. If purge is done separately in its own thread then you can have flushing and purging in parallel, thereby reducing the purge lag. 

The problem around truncating the UNDO log (for 128 rsegs) has been fixed in 5.5 as mentioned in this bug report. It is done in batches rather than at the end of each round.

However, given that you are experiencing lag after upgrading from 5.1 to 5.5, it may be worthwhile investigating this. Can you post a few stack traces? and some numbers around the history list. Also, please do try and run with innodb-purge-threads=0. One potential advantage in doing it via the master thread is that the master thread will purge the full list before it does the next flush.

Furious flushing revolves around a lot of issues, purge gets affected indirectly only. The ongoing changes to 5.6+ will not be back ported and would be difficult to backport too.

Ovais, if you're using 5.5's default innodb_purge_batch_size of 20, please change to 5.6's default of 300 and let us know how that goes, while using innodb_purge_threads = 1. 20 turned out to be too small to be really efficient.

Still not resolved? What is innodb_io_capacity and what's the real io capacity of the system (100 =1 spinning disk). If it's the default of 200 it's probably too low, change to the number of disks used for the main tablespace * 100 (and an additional * 5 if using SSD).

If that doesn't resolve it, check the percentage of dirty pages in your buffer pool at a time when there's trouble and set innodb_max_dirty_pages_pct to observed_percentage * 0.9. This is to cause flushing to happen in that part of the work rather than other threads.

If that still doesn't work, look at your history list length and monitor it regularly at troubled times, then set innodb_max_purge_lag = that troubled value * 0.9. Possibly lower. Caution: setting this too aggressively low can destroy foreground thread performance, particularly badly if you do a backup with a read view open that then finishes and closes the read view with a high history list length. You should probably set innodb_max_purge_lag to 0 or a few billion outside peak times. While this setting can work and has in benchmarks, it's the least desirable one because of the risk and because it's abusing a setting really intended only to prevent disk filling, People who touch it tend to set it far too low.

The suggestions are in roughly least potentially disruptive to most potentially disruptive order, just try working through them one at a time until the problem is solved or remains after all of them.

James Day, MySQL Senior Principal Support Engineer, Oracle

Sunny,

I agree that separating the purge_thread from the master_thread was done to reduce lag, however, as I can see currently, it does not really work as well when there is good amount of mixed workload, as then the master_thread is busy flushing pages and blocks the purge_thread. Doing the purging from within the master_thread makes sense in such a case, as then the master_thread would give equal opportunity to both purging and flushing.

Regarding the bug fix, can you tell me if the fix has been released?

I understand that furious flushing can have a lot of unintended side-affects as the one where purge gets affected indirectly, that is why I am tempted to try disabling the purge thread. I agree with you that many of the changes going into 5.6 cannot be backported, and I am waiting impatiently for a 5.6 GA release :)

Sunny,

So in my case disabling purge_thread has helped. The master_thread does more purging as compared to purge_thread now, and the history_list_length has eased a lot. I think this was certainly a case of master_thread blocking purge_thread and producing increase in purge lag.

Trying to understand the regression for a single-threaded & update-only sysbench result comparing 5.1 with 5.6 -- http://mysqlha.blogspot.com/2013/04/mysql-56-single-thread-update-only.html

This is from running "perf -p $( pidof mysqld ) -- sleep 180" during the test and buf_page_get_gen uses almost 3X more CPU with 5.6. This is true whether I use 1 or 8 buffer pool instances. The table is cached in the InnoDB buffer pool before the test starts. 

From 5.6.10

     5.46%   mysqld  mysqld.notrc.noprof.flushcond2  [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)                                                                      
     3.72%   mysqld  mysqld.notrc.noprof.flushcond2  [.] MYSQLparse(void*)                                                                                                                                                                                                  
     1.77%   mysqld  mysqld.notrc.noprof.flushcond2  [.] _ZL14ut_crc32_sse42PKhm                                                                                                                                                                                            
     1.58%   mysqld  mysqld.notrc.noprof.flushcond2  [.] _ZL21mtr_memo_slot_releaseP5mtr_tP15mtr_memo_slot_t.isra.5                                                                                                                                                         
     1.50%   mysqld  mysqld.notrc.noprof.flushcond2  [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)                                                                     
     1.40%   mysqld  mysqld.notrc.noprof.flushcond2  [.] log_group_write_buf(log_group_t*, unsigned char*, unsigned long, unsigned long, unsigned long)                                                                                                                     
     1.35%   mysqld  mysqld.notrc.noprof.flushcond2  [.] rw_lock_x_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long)                                               

From 5.1.63 with the facebook patch

     4.87%   mysqld  mysqld.noprof        [.] MYSQLparse(void*)                                                                                                                                                  
     2.87%   mysqld  libc-2.13.so         [.] __memcpy_ssse3_back                                                                                                                                                
     1.87%   mysqld  mysqld.noprof        [.] rec_get_offsets_func                                                                                                                                               
     1.83%   mysqld  mysqld.noprof        [.] buf_page_get_gen                                                                                                                                                   
     1.82%   mysqld  mysqld.noprof        [.] log_group_write_buf                                                                                                                                                
     1.29%   mysqld  mysqld.noprof        [.] my_fast_crc32                                                                                                                                                      
     1.19%   mysqld  mysqld.noprof        [.] os_aio_simulated_handle                                                                                                                                            
     1.14%   mysqld  mysqld.noprof        [.] MYSQLlex(void*, void*)

These are the top callers to buf_page_get_gen in 5.6

     5.51%   mysqld  mysqld.notrc.noprof.flushcond2  [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)                                                                      
             |
             --- buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)
                |          
                |--66.43%-- _ZL31trx_purge_truncate_rseg_historyP10trx_rseg_tPK12purge_iter_t
                |          trx_purge(unsigned long, unsigned long, bool)
                |          srv_purge_coordinator_thread
                |          start_thread
                |          __clone
                |          
                |--19.06%-- btr_cur_search_to_nth_level(dict_index_t*, unsigned long, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, char const*, unsigned long, mtr_t*)
                |          row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
                |          ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
                |          handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)
                |          handler::multi_range_read_next(char**)
                |          QUICK_RANGE_SELECT::get_next()
                |          _ZL8rr_quickP11READ_RECORD
                |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*)
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                |          do_handle_one_connection(THD*)
                |          handle_one_connection
                |          start_thread
                |          __clone
                |          
                |--4.76%-- _ZL35trx_purge_rseg_get_next_history_logP10trx_rseg_tPm
                |          _ZL22trx_purge_get_next_recPmP16mem_block_info_t
                |          trx_purge(unsigned long, unsigned long, bool)
                |          srv_purge_coordinator_thread
                |          start_thread
                |          __clone
                |          
                |--2.83%-- trx_undo_assign_undo(trx_t*, unsigned long)
                |          trx_undo_report_row_operation(unsigned long, unsigned long, que_thr_t*, dict_index_t*, dtuple_t const*, upd_t const*, unsigned long, unsigned char const*, unsigned long const*, unsigned long*)
                |          btr_cur_update_in_place(unsigned long, btr_cur_t*, unsigned long const*, upd_t const*, unsigned long, que_thr_t*, unsigned long, mtr_t*)
                |          btr_cur_optimistic_update(unsigned long, btr_cur_t*, unsigned long**, mem_block_info_t**, upd_t const*, unsigned long, que_thr_t*, unsigned long, mtr_t*)
                |          _ZL17row_upd_clust_recP10upd_node_tP12dict_index_tPmPP16mem_block_info_tP9que_thr_tP5mtr_t
                |          _ZL18row_upd_clust_stepP10upd_node_tP9que_thr_t
                |          row_upd_step(que_thr_t*)
                |          row_update_for_mysql(unsigned char*, row_prebuilt_t*)
                |          ha_innobase::update_row(unsigned char const*, unsigned char*)
                |          handler::ha_update_row(unsigned char const*, unsigned char*)
                |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*)
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                |          do_handle_one_connection(THD*)
                |          handle_one_connection
                |          start_thread
                |          __clone
                |          
                |--1.92%-- trx_purge_add_update_undo_to_history(trx_t*, unsigned char*, mtr_t*)
                |          trx_undo_update_cleanup(trx_t*, unsigned char*, mtr_t*)
                |          _ZL31trx_write_serialisation_historyP5trx_t
                |          trx_commit(trx_t*)
                |          trx_commit_for_mysql(trx_t*)
                |          _ZL15innobase_commitP10handlertonP3THDb
                |          ha_commit_low(THD*, bool)
                |          TC_LOG_DUMMY::commit(THD*, bool)
                |          ha_commit_trans(THD*, bool)
                |          trans_commit_stmt(THD*)
                |          mysql_execute_command(THD*)
                |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                |          do_handle_one_connection(THD*)
                |          handle_one_connection
                |          start_thread
                |          __clone

The problem is probably that purge iterates over 128 rollback segments in 5.6 versus 1 in 5.1

Code from 5.6

/********************************************************************//**
Removes unnecessary history data from rollback segments. NOTE that when this
function is called, the caller must not have any latches on undo log pages! */
static
void
trx_purge_truncate_history(
/*========================*/
        purge_iter_t*           limit,          /*!< in: truncate limit */
        const read_view_t*      view)           /*!< in: purge view */
{
        ulint           i;

        /* We play safe and set the truncate limit at most to the purge view
        low_limit number, though this is not necessary */

        if (limit->trx_no >= view->low_limit_no) {
                limit->trx_no = view->low_limit_no;
                limit->undo_no = 0;
        }

        ut_ad(limit->trx_no <= purge_sys->view->low_limit_no);

        for (i = 0; i < TRX_SYS_N_RSEGS; ++i) {
                trx_rseg_t*     rseg = trx_sys->rseg_array[i];

                if (rseg != NULL) {
                        ut_a(rseg->id == i);
                        trx_purge_truncate_rseg_history(rseg, limit);
                }
        }
}

Code from 5.1, the while loop does 1 iteration in practice

/********************************************************************//**
Removes unnecessary history data from rollback segments. NOTE that when this
function is called, the caller must not have any latches on undo log pages! */
static
void
trx_purge_truncate_history(void)
/*============================*/
{
        trx_rseg_t*     rseg;
        trx_id_t        limit_trx_no;
        undo_no_t       limit_undo_no;

        ut_ad(mutex_own(&(purge_sys->mutex)));

        trx_purge_arr_get_biggest(purge_sys->arr, &limit_trx_no,
                                  &limit_undo_no);

        if (ut_dulint_is_zero(limit_trx_no)) {

                limit_trx_no = purge_sys->purge_trx_no;
                limit_undo_no = purge_sys->purge_undo_no;
        }

        /* We play safe and set the truncate limit at most to the purge view
        low_limit number, though this is not necessary */

        if (ut_dulint_cmp(limit_trx_no, purge_sys->view->low_limit_no) >= 0) {
                limit_trx_no = purge_sys->view->low_limit_no;
                limit_undo_no = ut_dulint_zero;
        }

        ut_ad((ut_dulint_cmp(limit_trx_no,
                             purge_sys->view->low_limit_no) <= 0));

        rseg = UT_LIST_GET_FIRST(trx_sys->rseg_list);

        while (rseg) {
                trx_purge_truncate_rseg_history(rseg, limit_trx_no,
                                                limit_undo_no);
                rseg = UT_LIST_GET_NEXT(rseg_list, rseg);
        }
}

My other theory is that truncate_purge(..., truncate=TRUE) is getting called more frequently than the equivalent in 5.1. And the caller would be this code from srv_do_purge:

                if (!(count++ % TRX_SYS_N_RSEGS)) {
                        /* Force a truncate of the history list. */
                        trx_purge(1, srv_purge_batch_size, true);
                }

The second hypothesis is worth considering. The code was introduced to fix exactly this problem. The choice of magic number was arbitrary.

Watched purge lag during a run of the single-threaded & update-only & cached workload. For 5.6 the average lag is ~2000 and for 5.1 it was closer to 20000. So I continue to think that 5.6 is more aggressive about doing truncate for purge and that explains the regression.

After hacking 5.6 to do the truncate less frequently, the CPU profile is better, but QPS was still a lot worse than 5.1 -- 8200 or 8300 versus 9000+ updates/second. So this helps, but there is still almost a 10% difference.

I am not sure my hack was reasonable -- I added "rseg_history_len > 20000"

                if (!(count++ % TRX_SYS_N_RSEGS) && rseg_history_len > 20000) {
                        /* Force a truncate of the history list. */
                        trx_purge(1, srv_purge_batch_size, true);
                }

--- this is a CPU profile with my hack

    3.98%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] MYSQLparse(void*)                                                                                                                                                                                                  
     2.13%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL14ut_crc32_sse42PKhm                                                                                                                                                                                            
     2.12%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)                                                                      
     1.54%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] log_group_write_buf(log_group_t*, unsigned char*, unsigned long, unsigned long, unsigned long)                                                                                                                     
     1.49%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)                                                                     
     1.14%   mysqld  libc-2.13.so                      [.] __memcpy_ssse3_back                                                                                                                                                                                                
     1.04%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL13lex_one_tokenPvS_                                                                                                                                                                                             
     0.69%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*)                                     
     0.65%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] page_cur_search_with_match(buf_block_t const*, dict_index_t const*, dtuple_t const*, unsigned long, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*)                                   
     0.65%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL15get_hash_symbolPKcjb

Even with my hack, the 5.6 CPU profile shows more time spent in purge compared to 5.1:
 0.42%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] trx_purge(unsigned long, unsigned long, bool)      
0.32%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] srv_purge_coordinator_thread       
0.17%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL34trx_purge_get_rseg_with_min_trx_idP11trx_purge_t 
0.16%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] row_purge_step(que_thr_t*)    
0.12%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] read_view_purge_open(mem_block_info_t*)  
0.09%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL35trx_purge_rseg_get_next_history_logP10trx_rseg_tPm   
 0.09%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] trx_purge_add_update_undo_to_history(trx_t*, unsigned char*, mtr_t*) 
0.07%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] srv_wake_purge_thread_if_not_active()
0.06%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL22trx_purge_get_next_recPmP16mem_block_info_t 
0.04%   mysqld  mysqld.flushcond2.noopttrc.purge  [.] _ZL31trx_purge_truncate_rseg_historyP10trx_rseg_tPK12purge_iter_t 

... compared to 5.1.63 with the FB patch including purge threads

     0.14%   mysqld  mysqld.noprof        [.] trx_purge_add_update_undo_to_history    
     0.05%   mysqld  mysqld.noprof        [.] trx_purge_rseg_get_next_history_log 
     0.03%   mysqld  mysqld.noprof        [.] trx_purge_truncate_rseg_history                                                                                                                                    
     0.03%   mysqld  mysqld.noprof        [.] trx_purge_choose_next_log               
     0.02%   mysqld  mysqld.noprof        [.] trx_purge_fetch_next_rec    
     0.01%   mysqld  mysqld.noprof        [.] trx_purge_rec_release  
     0.01%   mysqld  mysqld.noprof        [.] row_purge_step 
     0.01%   mysqld  mysqld.noprof        [.] trx_purge_get_next_rec 
     0.01%   mysqld  mysqld.noprof        [.] trx_purge

Expanded my hack to this and QPS is up to about 8400. The reason was to reduce the frequency at which purge ran and hopefully reduce the CPU from it. Purge is using a lot more CPU than it did in 5.1

                if (!(count++ % TRX_SYS_N_RSEGS) && rseg_history_len > 20000) {
                        /* Force a truncate of the history list. */
                        trx_purge(1, srv_purge_batch_size, true);
                }

                os_thread_sleep(100000);

We are unable to reproduce this problem.   We followed the original steps how-to-repeat as described at:

https://www.facebook.com/notes/mysql-at-facebook/mysql-cpu-bound-performance-51-55-and-the...

Closing this bug as not reproducible.

Why would this be closed as not a bug, not reproducible. There are many comments on this bug report from Oracle employees that mention at least 1 bug was fixed for purge when there are many rollback segments. For example:

[2 Dec 2012 8:30] Sunny Bains
The purge lag due to 128 rsegs has been fixed. This bug morphed into one about furious flushing. There have been several improvements made around flushing code in 5.6. The flushing is done in a separate thread now and has been moved out of the master thread.

Ovais, are you experiencing any issues around this? Do you have a test case that is purge related and not about furious flushing?

Mark, We've tried to reproduce this issue internally and have been unsuccessful so far. That's why we've decided to close it.

The bug was created for purge lag on 5.5.8. Comments from InnoDB developers indicate the bug was fixed. My request is that you include the bzr commit that fixed the purge lag bug for 128 rollback segments.

Please refer to the following commit for further details:

    revno: 2995.37.155
    revision-id: sunny.bains@oracle.com-20110222050408-73uxqkbkpm17emfu
    parent: marko.makela@oracle.com-20110217202756-hr889aqmt152tfn0
    committer: Sunny Bains <Sunny.Bains@Oracle.Com>
    branch nick: 5.5
    timestamp: Tue 2011-02-22 16:04:08 +1100
    message:
      Bug #11766227: InnoDB purge lag much worse for 5.5.8 versus 5.1
      Bug #11766501: Multiple RBS break the get rseg with mininum trx_t::no code during purge

      Bug# 59291 changes:

      Main problem is that truncating the UNDO log at the completion of every
      trx_purge() call is expensive as the number of rollback segments is increased.
      We truncate after a configurable amount of pages. The innodb_purge_batch_size
      parameter is used to control when InnoDB does the actual truncate. The truncate
      is done once after 128 (or TRX_SYS_N_RSEGS iterations). In other words we
      truncate after purge 128 * innodb_purge_batch_size. The smaller the batch
      size the quicker we truncate.

Mark,

Unfortunately I do not have a test case to prove that the purge lag is greater on 5.5 as compared to 5.1, because I wasn't able to reproduce it in a synthetic environment. I didn't get much time to do more benchmarks though, I must confess.

However, I can tell you that I have worked with one of our customers here at Percona who saw the purge lag increase many folds once they upgraded to 5.1 from 5.5, with no change in infrastructure or workload, which is why I was also interested in this bug. Obviously the only changes were the code related changes between the two major MySQL releases and the fact that they had purge_thread enabled in 5.5. The main characteristics for them are write intensive workload and slow IO subsystem, the datastore is on NFS.

I was able to resolve the purge lag issue by disabling the purge_thread and increasing the purge_batch_size to 100.

There is not much more I can share though. I may get to run some more benchmarks in the future around the theme that the purge lag on 5.5 is more with a write intensive workload and slow IO subsystem.

The purge batch size (--innodb-purge-batch-size) should be 300 or thereabouts. That is the default in 5.6+.