MySQL Bugs: #70453: Add hard timeouts to page cleaner flushes

Bug #70453	Add hard timeouts to page cleaner flushes
Submitted:	27 Sep 2013 15:51	Modified:	1 Oct 2013 12:51
Reporter:	Laurynas Biveinis (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	cleaner, flushing, innodb

Description:
Currently the page cleaner thread is designed as follows, if I understand correctly:
- in a loop:
- do LRU tail flushing, controlled by max LRU scan depth;
- do flush list flushing, controlled by I/O capacity;
- sleep the remaining time until 1s.

Now the proper tuning (correct me if I'm wrong) would strive to minimize the sleep time and utilize storage as closely as possible to capacity.

But the flushing time is not only determined by the I/O variable settings, but also by how quickly can cleaner can get access to the shared resources such as buffer pool mutexes too. This means that the page cleaner iteration time, even with properly tuned I/O settings, can exceed 1 second. Under very heavy loads (sysbench, I/O-bound, 512 threads on 32 core) one page cleaner iteration might take as long as minutes. This is bad due to several reasons: if LRU flushing is taking that long, the flush list flushing is not happening, the query threads go to sync preflush. if flush list flushing is taking that long, the LRU tail flushing is not happening, the query threads go to single page LRU flushes. Moreover the page cleaner and adaptive flushing heuristics are designed to be updated roughly every second, and probably will not get a very exact idea of what's going if called once in several minutes.

How to repeat:
Benchmarks, code analysis

Suggested fix:
A partial solution would be to add a hard timeout, i.e. limit LRU tail flush to 1 second, checked after each batch, and flush list flushing to 1 second too. (the exact value of the constant is debatable). It looks it's more important to re-check heuristics and to alternate periodically between the two than to allow one of them to complete fully.

This is not a complete solution for the case of multiple buffer pool instances, because with a timeout, some instances may receive no flushing at all. This will be reported separately.

Credit goes to Alexey Stroganov.

Hi Laurynas,

First of all, give my best regards to Alexey. Have not heard from him for a long time. Now to the subject of this bug.

Page cleaner thread is so designed to run once per second and NEVER longer then one second. Let me be more verbose on the subject ...

Currently, entire adaptive flushing heuristics are designed based on this assumption.  This means that the page cleaner thread behaves as follows:

flushing + sleeping = 1 second (approximately)

Controlling the flushing activity of the page cleaner thread is possible by adjusting the innodb_io_capacity.  This indirectly affects the sleeping time of the page cleaner thread.  For higher innodb_io_capacity, the page cleaner thread will be doing more flushing
per second and hence the time available for it to sleep will be reduced.  

It is, indeed, designed in such a manner that even sleeping time can not be configurable, as it would make flushing occur more then once per second. Also, sleeping always occur, unless server is idle. Heuristics of adaptive flushing are designed so that page cleaner thread is allowed to run only once per second and never longer then one second. There are plans to reconsider this algorithm, but it will not happen so soon.

Let me know if you observe different behavior.

Sinisa -

Best regards passed!

You are right that page cleaner thread is designed to run once per second and never more.  But the design is not necessarily fully implemented, is not necessarily bug-free, and indeed we have seen what we described to happen under the described conditions.  Neither buf_flush_LRU_tail() nor page_cleaner_flush_pages_if_needed() actually track their running time.

Say you properly tune your server and set innodb_io_capacity, innodb_lru_scan_depth.  Each iteration takes under 1 second as designed.  Now let's say 500 extra new connections come.  The cleaner thread still tries to flush using these two parameters, but it will get less CPU time to run and will have to wait for mutexes more.  So what previously took less than 1 second now is taking significantly more (we have seen up to 1 minute) than 1 second and there is nothing in the code to prevent this.

Laurynas,

Theoretically what you are saying is correct. IIRC it is possible that page_cleaner ends up taking much more than 1 second (and there are cases where you really can't tune to tame it down. For example, when doing async flushing of dirty pages from flush_list. In this case it is totally governed by the state of redo logs). I think Dimitri has put in stuff there that heuristics don't go completely whacked when this happens but it is very likely that adaptive flushing recommendations can be off the mark for some time after such a huge batch.

Having said this, my feeling is that we won't be able to get remarkably better performance by fine tuning page_cleaner. I think if we are having multi-second batches then we are already in a truly IO bound state. And when a workload becomes IO-bound in true sense it doesn't make much difference whether the IO is happening in the foreground or background. So if page_cleaner is stuck in flush_list batch then we'll have single page flushing in LRU or if page_cleaner is stuck in LRU batch then we'll have sync flushing in user threads. In either case toggling between the two is not going to help us in any big way (unless DBA has set disproportionally higher values for some parameters).

The way I look at it is that whole async background IO is meant to take advantage of the idle IO capacity to avoid such bursts. If, despite that, bursts happen then during that fine details will become irrelevant.

Multiple page_cleaners is one feature that I'd like to see as it overcomes one potential design limitation of current implementation.

As it was highlighted before page_cleaner 'rhythm' consists from two parts: flushing of LRU list and flushing of flush list. Indeed flush list flusher is not something that can be tuned much in order to match 1 sec requirement. But it's actually not a major violator of this 'rhythm'. LRU flusher is. I believe we will file another issue where we will describe what is wrong there. In short most important aspect is that it often does _very_ excessive scanning in order to fulfill  flushing requirements. LRU flusher may perform flushing/eviction more than 30-60-120 seconds under high RW load. Such way it's not only notably affects uniformity of innodb-buffer-pool instances(as LRU flusher processed them one by one) and causes sporadic stalls but also disruptively affects flush list flusher and page cleaner AF heuristics. Blocking flush list flusher for a long time prevents page cleaner to control checkpoint age and as results if redo log size is not enough we will be subject of massive sync flushing that basically should and may be avoided if both flushers will try to maintain required 'rhythm'.

I would note that adding such timeouts is not possible without changing logic in both flushers - from sequential scanning of buffer pool instances to pseudo-parallel.

Alexey,

Your points are valid. I believe some of these are already addressed in 5.7. I'd like to request InnoDB team to consider backporting some of this stuff to 5.6 as it prevents serious issues with flushing.

The excessive flushing was spotted by Dimitri in the context of flush list batches. In 5.6 a fix went in using 'hazard pointers' concept. But the same issue can rear its head in four different locations:
1) flush list batch scanning (fixed in 5.6)
2) In LRU list batch scanning
3) In LRU list search for free page
4) In single page flush scanning

In 5.7 all of the above were fixed using the same concept of hazard pointers.

Another piece of code that was of concern was in buf_flush_LRU_tail():

2091                 /* We divide LRU flush into smaller chunks because
2092                 there may be user threads waiting for the flush to
2093                 end in buf_LRU_get_free_block(). */
2094                 for (ulint j = 0;
2095                      j < scan_depth;
2096                      j += PAGE_CLEANER_LRU_BATCH_CHUNK_SIZE) {
2097 
2098                         ulint   n_flushed = 0;
2099 
2100                         /* Currently page_cleaner is the only thread
2101                         that can trigger an LRU flush. It is possible
2102                         that a batch triggered during last iteration is
2103                         still running, */
2104                         if (buf_flush_LRU(buf_pool,
2105                                           PAGE_CLEANER_LRU_BATCH_CHUNK_SIZE,
2106                                           &n_flushed)) {
2107 
2108                                 /* Allowed only one batch per
2109                                 buffer pool instance. */
2110                                 buf_flush_wait_batch_end(
2111                                         buf_pool, BUF_FLUSH_LRU);
2112                         }

What above meant was that we divided LRU batch into smaller chunks and then we wait for each chunk to end before issuing next mini-batch. However, in 5.7 this is fixed as well and now the LRU flusher can post one single big batch and won't wait for it to finish before moving on to the next buffer pool instance.

I think all of this put together should alleviate many woes of LRU flushing and that the following commit is worth of being seriously considered for backporting to 5.6:

revno: 6228
committer: Sunny Bains <Sunny.Bains@Oracle.Com>
branch nick: trunk
timestamp: Fri 2013-08-09 08:37:45 +1000
message:
  WL#7047 - Optimize buffer pool list scans and related batch processing code
  
  Reduce excessive scanning of pages when doing flush list batches. The
  fix is to introduce the concept of "Hazard Pointer", this reduces the
  time complexity of the scan from O(n*n) to O(n).
  
  The concept of hazard pointer is reversed in this work.  Academically a
  hazard pointer is a pointer that the thread working on it will declare as
  such and as long as that thread is not done no other thread is allowed to
  do anything with it.
  
  In this WL we declare the pointer as a hazard pointer and then if any other
  thread attempts to work on it, it is allowed to do so but it has to adjust
  the hazard pointer to the next valid value. We use hazard pointer solely for
  reverse traversal of lists within a buffer pool instance.
  
  Add an event to control the background flush thread. The background flush
  thread wait has been converted to an os event timed wait so that it can be
  signalled by threads that want to kick start a background flush when the
  buffer pool is running low on free/dirty pages.

Inaam,

Yes, changing LRU flusher to perform flushing in 1 pass is really nice(and we going to evaluate that), but still several aspects are not addressed yet. There are still no explicit limit how deep we age going to scan and also no limit to the  total number of scans so it still possible to spend a lot of time in buf_flush_LRU_list_batch(). lru_depth_scan variable has very confusing meaning there - actually it limits amount of free pages we need to get for the ibpi but not about how deep we age going to scan.

Also another important aspect that relates to this bug report is that if we have N buffer pool instances we will process them one by one sequentially, so imagine if we will issue large batch for individual ibpi and all others bpi just have no free pages(that is almost true in IO bound scenario) and such they will be forced to wait until that started batch will finish/processed. That leads to bad uniformity, stalls, etc. So processing ibpi in batches will hep to smooth uniformity and also will allow to have timeout to keep page_cleaner 'rhythm' stable disregard any complexities that may arise during flushing.

Alexey,

Why do you think that page_cleaner ends up scanning very deep when doing LRU flushing? Scanning from the tail a block will be either clean (candidate for eviction) or dirty (candidate for flushing). Exception should be buffer fixed and IO fixed blocks which should be rare. If page_cleaner is on average flushing much deeper than innodb_lru_scan_depth to process an LRU batch then there must be a lingering bug somewhere.

About stalls in case of multiple instances you are right. I guess the most elegant solution is to have multiple page_cleaner threads. One per instance.

Inaam,

> Why do you think that page_cleaner ends up scanning very deep when doing LRU flushing?

Ok, very deep or not that is separate question. From my observations in 5.6 (i tracked avg/max depth that we have in LRU flusher) it is not much deep actually. The problem in 5.6 is scan restarts and total number of scans. It is not limited anyhow and we may perform quite notable amount of scans due to that. I believe in 5.7 with flushing/eviction in 1 pass that should work much better.

Inaam -

Hazard pointers aside (they are nice and I expect they will help, agree that a 5.6 backport should be considered), I slightly disagree that by the time we get to query threads doing flush/LRU flushes, it doesn't matter who is doing them, because it's I/O bound anyway.

We found that with a tuned page cleaner (timeouts, AF, furious flushing, pseudo parallel across buffer pool instances, even incompletely tuned, i.e. no parallel threads), allowing query threads to do any flushing job (sync preflush or LRU single page flush) is redundant and even harmful. Redundant because cleaner already operating at I/O capacity with the above tuning will do what's needed for the query thread soon.  Harmful, because now the query threads in their attempts to do cleaner's job will spend CPU, attempt to grab mutexes, issue I/O - and since we are already in sync preflush or empty free list zone, the above resources are probably not in surplus.

Thus we found that making query threads to simply wait for the required resource (free page, log space) to become available instead of attempting to get one by themselves performs better. But for that the timeout fix and other cleaner fixes are needed.

Based on the entire discussion, I conclude that, after all, this is a feature request, but fully verified one. That means that this FR would be beneficial for the entire page cleaning process and would improve performance.

In 5.7 I changed the sleep() to a timed event. I think we can use that to trigger the background flush from user threads. Currently this is triggered from buf_LRU_get_free_block(). Inaam et al, let me know from where all you think this should be triggered.

Back porting the flush changes is a separate problem, lets try and solve this in 5.7 first.

Sunny,

The current code is very conservative in triggering the event.

1337         /* If we have scanned the whole LRU and still are unable to
1338         find a free block then we should sleep here to let the
1339         page_cleaner do an LRU batch for us. */
1340 
1341         if (n_iterations > 1) {
1342                 os_event_set(buf_flush_event);
1343 
1344                 os_thread_sleep(10000);
1345         }
1346 

I have never seen a single_page_flush failure ever. I short I have never observed n_iterations > 1. 

I think a more appropriate place to trigger can be:

1293                 if (!freed && n_iterations == 0) {
1294                         /* Tell other threads that there is no point
1295                         in scanning the LRU list. This flag is set to
1296                         TRUE again when we flush a batch from this
1297                         buffer pool. */
1298                         buf_pool->try_LRU_scan = FALSE;
1299                 }

That is when a user thread first discovers that LRU tail is full of dirty pages.

Sunny,

While Laurynas and Ranger seem to have done benchmarks that can point out potential issues, here is my POV about user thread waits (based on benchmarking when writing this code):

In terms of underlying IO subsystem we can categorize LRU issues in two different types:

When running on high end IO subsystem (SSD) I have seen it really doesn't matter much if we do single page flushing or batch flushing. Typically it would give better performance if we do single page flushing instead of waiting for a currently running LRU batch to end and then looking for a free page in the free list.

When running on regular IO subsystem the problem can be further divided in two sub-categories. When doublewrite buffer is enabled we are constrained by it's size. We can only push 120 page batch to the InnoDB IO layer (8 slots a reserved for single page flushes). What that means is that even if we wake up the page_cleaner and request it to trigger a large batch it will still be chopped into smaller pieces at dlbwr level i.e.: we'll do flush 120 pages and then wait for them to make it to the disk and then start working on the next 120 pages. This should change if and when we have multiple dblwr buffers. Currently in this scenario it may make sense to actually wait for the batch to end and try luck again.

When running without dblwr on a regular IO subsystem, I am not sure which strategy is better. I have seen mixed results.

What is said above is about whether or not to wait for LRU batch instead of doing single page flushing. I believe signalling page_cleaner to trigger a batch whenever the tail is dirty is a good option in all cases.

Inaam, ok, I will revisit the dblwr buffer patch so that the bottleneck is fixed.