MySQL Bugs: #68555: thread convoys from log_checkpoint_margin with innodb_buffer_pool

Bug #68555	thread convoys from log_checkpoint_margin with innodb_buffer_pool_instances > 1
Submitted:	3 Mar 2013 21:34	Modified:	11 Nov 2013 15:53
Reporter:	Mark Callaghan	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.6.10	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Thread convoys can form in log_preflush_pool_modified_pages when check_flush_or_checkpoint is set to TRUE after buf_flush_list returns FALSE. The problem is that buf_flush_list returns FALSE when a flush is in progress for any of the buffer pool instances -- which happens when buf_flush_start returns FALSE.

On the return of FALSE, a thread in log_checkpoint_margin resets check_flush_or_checkpoint to TRUE, which has global visibility, and then starts the loop over. Setting that to TRUE forces other threads to call into log_checkpoint_margin which is a side-effect I don't like as all of those threads end up waiting for no good reason. Given the thread that set check_flush_or_checkpoint to TRUE is about to repeat the loop why invite other threads to do the same?

The problem is worse when innodb_buffer_pool_instances > 1. I would see many threads with this stack:
os_thread_sleep
fil_flush
fil_flush_file_spaces
buf_flush_common
buf_flush_list
log_preflush_pool_modified_pages
log_checkpoint_margin
log_check_margins
log_free_check
row_upd
row_upd_step
row_update_for_mysql
ha_innobase::update_row
handler::ha_update_row
mysql_update
mysql_execute_command
mysql_parse
mysql_parse

How to repeat:
Run an update-intensive workload for a cached database and set innodb_io_capacity to a smaller than optimal value so that foreground threads have to do fuzzy checkpoint page writes.

This shows the performance regression from using multiple buffer pool instances. The numbers are QPS where each query is update 1 row by PK. The blog post has more details.

updates/second for innodb_buffer_pool_instances=1
    8      16      32      64    128      256   concurrent clients
36691   54741   62789   62262   61208   60884   orig5610, iocap=1000
34860   55060   61938   62270   61661   61107   orig5610, iocap=2048
40105   56780   64174   64257   62996   62208   orig5610, iocap=4096
43293   61882   66132   64147   63124   62533   orig5610, iocap=8192

updates/second for innodb_buffer_pool_instances=8
    8      16      32      64     128     256   concurrent clients
30076   30535   39266   33280   37135   31415   orig5610, iocap=1000
30151   32422   38110   39712   37551   34242   orig5610, iocap=2048
32092   38397   43014   44260   42148   41299   orig5610, iocap=4096
38302   51486   55676   55044   54883   54032   orig5610, iocap=8192

----

This is QPS after apply the diff provided in this bug report. QPS is much better.

updates/second for innodb_buffer_pool_instances=8
    8      16      32      64     128     256   concurrent clients
46726   70465   57360   49329   49120   48730   orig5610, iocap=1024
42710   70487   65408   57632   57587   57067   orig5610, iocap=2048
42464   66687   70299   63091   62940   61831   orig5610, iocap=4096
39295   69216   72740   66955   65382   65283   orig5610, iocap=8192
41798   68033   71868   66918   64496   64865   orig5610, iocap=16384

And QPS improves a bit more after switching to innodb_flush_method=O_DIRECT_NO_FSYNC.

updates/second for innodb_buffer_pool_instances=8
    8      16      32      64     128     256   concurrent clients
44092   69082   73446   69393   68118   67118   orig5610, iocap=1024
42445   67282   74463   69547   68324   67443   orig5610, iocap=2048
40601   69185   75044   69907   68406   67640   orig5610, iocap=4096
35835   68327   74777   69765   68069   67286   orig5610, iocap=8192
41997   66772   75226   70166   68531   67716   orig5610, iocap=16384

Suggested fix:
--- orig5610/storage/innobase/log/log0log.cc	2013-02-26 14:29:03.000000000 -0800
+++ mysql-5.6.10/storage/innobase/log/log0log.cc	2013-01-22 08:54:49.000000000 -0800
@@ -2101,7 +2101,6 @@
 	ibool		checkpoint_sync;
 	ibool		do_checkpoint;
 	bool		success;
-	bool		force = FALSE;
 loop:
 	checkpoint_sync = FALSE;
 	do_checkpoint = FALSE;
@@ -2110,12 +2109,11 @@
 	mutex_enter(&(log->mutex));
 	ut_ad(!recv_no_log_write);
 
-	if (!force && log->check_flush_or_checkpoint == FALSE) {
+	if (log->check_flush_or_checkpoint == FALSE) {
 		mutex_exit(&(log->mutex));
 
 		return;
 	}
-	force = FALSE;
 
 	oldest_lsn = log_buf_pool_get_oldest_modification();
 
@@ -2159,8 +2157,7 @@
 		if (!success) {
 			mutex_enter(&(log->mutex));
 
-			/* log->check_flush_or_checkpoint = TRUE; */
-			force = TRUE;
+			log->check_flush_or_checkpoint = TRUE;
 
 			mutex_exit(&(log->mutex));
 			goto loop;

More results at http://mysqlha.blogspot.com/2013/03/mysql-56-cached-update-only-workload.html

Mark,

log_free_check() is our guard against overwriting redo log. check_flush_or_checkpoint flag tells a thread that it cannot safely proceed without first making sure if there is enough room in the redo logs. In log_checkpoint_margin() we'll only go to preflush if we have crossed sync limit.

What is important to note here is that redo log is a system wide construct. Even with multiple buffer pools if the overall oldest modification tells us that there is not enough redo space usable then we have no choice but to wait for LSN to move forward.

So the wait in this case seems reasonable choice but there is an issue with the current design. Dimitri nailed it down that if we have multiple buffer pools the page_cleaner thread which is responsible for flushing in background will do flush batches for each buffer pool instance serially. This means that if the oldest modification is in, say, the buffer pool instance every thread will remain stuck unless the page cleaner is able to clear that.

Also the stack you mentioned above should not be for more threads then buffer pool instances. This stack means a thread doing flush batch but there is only one batch allowed per instance. If you see it in more threads then there is a bug which needs to be fixed.

Note that I doubled QPS by changing the code to prevent all threads from getting stuck in this code.

I don't think a foreground thread in log_checkpoint_margin() should set check_flush_or_checkpoint when log_preflush_pool_modified_pages returned FALSE. That thread is about to do "goto loop" and will attempt to do flushing if needed. Why force the other threads to do the same only because FALSE was returned in this case? Note that FALSE is returned because flush was in progress by another thread for at least one of buffer pool instances, so FALSE doesn't mean a flush is more urgent and all threads should stop.

Mark,

How can we safely let other threads do the work?
* Suppose T1 is doing the flush. 
* T2 goes to log_preflush_pool_modified_pages() and returns FALSE. 
* Lets assume check_flush_or_checkpoint is not set by T2. 
* Now T3 comes in log_checkpoint_margin() and seeing check_flush_or_checkpoint == FALSE it continues to do its work
* T3 can now generate redo and if for some reason T1 hasn't flushed pages yet we'll have a situation where redo can potentially be overwritten before the corresponding dirty pages have made it to the disk.

Lets consider a 4G combined log file size. 85% of that is 600 megabytes from using all of the available log space. Even with our new default log file size the 15% remaining is 14.4 megabytes, 1.5 times the old total amount of redo log space.

Is it really necessary to block when so much redo space is available?

Inaam - T3 should decide in log_checkpoint_margin whether or not to flush based on the comparisons between checkpoint_age, max_checkpoint_age and max_checkpoint_age_async. The problem is that the current code can force T3 into log_checkpoint_margin when it sets check_flush_or_checkpoint for no good reason. T3 can also decide elsewhere, like in log_free_check whether or not to call log_checkpoint_margin.

By "for no good reason" I mean that log_preflush_pool_modified_pages returned FALSE. There is code elsewhere that sets check_flush_or_checkpoint. That code is fine. It is only the set in log_checkpoint_margin that is the problem.

I don't think that a return of FALSE from log_preflush_pool_modified_pages, which is really a return of FALSE from buf_flush_start for one of the buf pool instances, which is really an indicator that a flush was in progress for that buf pool instance...

... I don't think that anything in the previous paragraph is a reason to force all threads into log_checkpoint_margin. But that is what the code does today and it is much more of a problem when innodb_buffer_pool_instances > 1

James - may be it is a conservative threshold. In any case, we need to have a HWM beyond which the threads are forced to wait. What is a good value is separate issue and opinions can differ on that. May be its my background as civil engineer which makes me conservative when choosing factor of safety :)

Mark - I am bit confused now. Your objection is the extra call to log_checkpoint_margin() or the wait that is forced on T3?

If, in our example, threads T3...Tn are not forced to wait on flush list batch then what will stop them from generating more redo and in the process reuse the redo that is still required for recovery?
Once we cross the threshold everyone must wait. I understand why this is more painful in case of multiple instances. Dimitri suggested something like parallel flushing to mitigate that. We also have some ideas about probably a wait for the whole batch to complete is not required and we can let threads proceed once enough reusable space is available. But none of this changes the fact that at certain point we have to have the logic to force a wait on all threads.

Inaam, I strongly agree that we do need to have a HWM. Just not where. :) Chemistry degree here and I managed not to kill myself. :)

Lets see how we can do discussing possible thresholds. How do these fit with your mental model of how it should work?

1. 30 seconds of estimated usage remaining, hard block on other activity - the freeze Mark is writing about.
2. 60 seconds of estimated usage remaining, still let other threads flush so we don't block and limit the flushing rate, which makes the problem worse and is why Mark is discussing this with us. I'd think of this as the sync flushing point, just not quite so strict. A good time to boost flushing priority also.
3. 120 seconds of estimated usage remaining, the async flushing threshold.

The reason we need to try to come up with better thresholds compared to space is the log file sizing issues and how they interact with other things:

A. adaptive flushing adjusts work rate based on usable space remaining, so setting a threshold too low causes extra writes. That hurts foreground performance (less write combining) and SSD drive life.
B. the log files end up cached by the OS, so there's an incentive to try to make them smaller. Having over-large margins for safety shifts the risk to swapping and out of memory failures instead.

So the safety margin isn't free, in some ways it shifts the problem to a different place. We do have to have a safety margin, just not the ones we get now when we're recommending good sizes for current systems. The old limits have just become out of date as the hardware and server have got better.

Mark, do you have any thoughts on how we might do this, ideally so it just works without much chance of end users having to tune, part of why I like time remaining - it's potentially nicely adaptive to workload? Outside opinions are always interesting...

James Day, MySQL Senior Principal Support Engineer, Oracle

My objection is that when T2 is in log_checkpoint_margin() and its call to log_preflush_pool_modified_pages() returns FALSE, then it sets check_flush_or_checkpoint to TRUE. Setting it there forces other threads to call log_checkpoint_margin when they don't need to call it and then they stall.

check_flush_or_checkpoint is set elsewhere. I am not asking for those sets to be changed.

Just got off the phone with Inaam and now I understand why my patch doesn't do what I think it does. I will repeat the tests. However, one root cause is the sleep in fil_flush. Just looked at the thread stacks again and all threads were stuck on a 20 millisecond sleep. We need something better in that case as my hosts fsync in less than 1 millisecond.

Filed http://bugs.mysql.com/bug.php?id=68588 for the sleep. Will keep this open until I get results with that sleep removed.

---

retry:
                        if (node->n_pending_flushes > 0) {
                                /* We want to avoid calling os_file_flush() on
                                the file twice at the same time, because we do
                                not know what bugs OS's may contain in file
                                i/o; sleep for a while */

                                mutex_exit(&fil_system->mutex);

                                os_thread_sleep(20000);

Mark,

How did testing go with 5.6.12 that has the fix for Bug #68588, removing most of the OS_THREAD_SLEEP calls from InnoDB? Did that resolve the issue so we can close this as a duplicate of that fixed bug?

James Day, MySQL Senior Principal Support Engineer, Oracle

I think it was better but I am taking a break from MySQL perf testing.