Bug #69170 buf_flush_LRU is lazy
Submitted: 7 May 2013 22:50 Modified: 13 May 2013 15:30
Reporter: Mark Callaghan Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.6.11 OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[7 May 2013 22:50] Mark Callaghan
This might be more serious than S3. From benchmarks and then instrumenting the code I see that when buf_flush_page_cleaner_thread calls buf_flush_LRU_tail, it flushes ~100 pages per buffer pool instance per call despite there being thousands of dirty pages in the LRU and no free pages. 

I don't have a benchmark result to show that the same problem exists for the flush list.

With the changes here the update/second rate for my benchmark is tripled.

How to repeat:
Run sysbench where each transaction updates 1 row by PK and the table is much larger than InnoDB. I set innodb_io_capacity and innodb_lru_scan_depth to a value >> 1000. My storage is fast. I also use 4 or 8 buffer pool instances. 

I instrumented code to print how many pages are flushed from the LRU each time buf_flush_LRU_tail runs and the value was about 100 per iteration.

There might be a few problems here:
1) buf_flush_common always calls buf_dblwr_flush_buffered_writes. Shouldn't this only be done when page_count > 0?

2) buf_flush_start returns FALSE when buf_pool->n_flush > 0 and when that happens, buf_flush_LRU returns early (does nothing). This is part of my problem.

3) buf_flush_dblwr_flush_buffered_writes doesn't call buf_flush_sync_datafiles. So after the first call to buf_flush_LRU that really finds pages to flush, the remaining calls for that buffer pool instance won't do anything because n_flush is > 0 (or is likely to be > 0).

Suggested fix:
1) change buf_flush_common to only call buf_dblwr_flush_buffered_writes when page_count > 0. Not sure this helps for this benchmark though.

2) change buf_flush_start to not check that buf_pool->n_flush[flush_type] > 0

3) change buf_flush_LRU_tail to call buf_flush_sync_datafiles if total_flushed > 0

Note that srv_LRU_scan_depth set to a large value can burn a lot of CPU. This variable has two uses:
1) Foreground threads use that to determine how far into the LRU they can search for a free page. 
2) the buf_flush_page_cleaner_thread uses this to determine how far into each LRU to search

It is good for this to be large for #2 above, but making it large for #1 will cause problems unless the page cleaner thread stays ahead. And with the bug described here it definitely does not.
[8 May 2013 4:09] Inaam Rana

Before going into details of the bug let me describe the background a bit. (yes, indeed there is a bug):
* Why do we do LRU scan in smaller chunks?
- This is because when an LRU batch is happening we are doing two things in the same scan.
  (a) If the page is replaceable we evict it and put it in free list
  (b) If the page is flushable we flush it and leave it in the LRU for the next iteration to put it on the free list
  We have this rather 'not so obvious' behavior because whenever we evict or flush a page we release the buffer pool mutex. That means we have to start the scan again from the tail of the LRU. 
For example, consider we are working with HDD and the lru_scan_depth is, say, 20,000. If we try to do a batch in a single pass what will typically happen is:
* Assume all pages are dirty and flushable i.e.: not IO or buffer fixed.
* When we deal with the first page we IO fix it, dispatch the IO request and restart the scan
* In new scan we see the page at tail IO fixed, we skip it, IO fix the next one, dispatch IO request and restart the scan
* For third page we have to skip the first two. So this whole thing becomes a O(n*n)
Now if we consider very fast storage it is quite likely that by the time we are starting our new scan the page at the tail had already been flushed. This changes the whole picture. Now we'll simply evict the page and start scan again.

Hence with larger LRU batches scans can turn O(n*n) for slower IO subsystem but will be likely O(n) for very fast storage systems. Given this your solution to "change buf_flush_start to not check that buf_pool->n_flush[flush_type] > 0" may work for high end systems but not for HDD driven systems. In fact, in case of SSD instead of removing that check it may just be better to do the whole batch in one single sweep instead of doing it in smaller chunks.

I hope above explanation gives you some insight into why the smaller chunk batch was introduced in the first place. Having said that, I fully agree there is a bug here. Because smaller chunk size is 100 (this value is driven by available space in dblwr buffer) all of the subsequent calls for the same instance will return immediately without doing work.

Bug Fix:

I have already pushed a fix for this. It'll be available in 5.6.12. However, I am not sure if you'd like the fix or not. The fix is to wait for small chunk LRU batch to end before starting a new one. The idea is that it will work on both slow and fast systems. This is, however, a short term fix. What I am working on currently is to have the ability to do a scan of LRU in O(n) even when buf_pool::mutex is released. If that goes through then the need to have smaller chunk LRU batch splitting won't be needed at all and the page_cleaner won't have to wait for the batch to end.

Calling buf_flush_sync_datafiles():

It is not required because syncing is now done in the IO helper thread.
[8 May 2013 13:13] Mark Callaghan
Would this change be useful?

> 1) change buf_flush_common to only call buf_dblwr_flush_buffered_writes when page_count > 0. Not sure this helps for this benchmark though.
[9 May 2013 2:59] Inaam Rana

Yes, I think the change you mentioned is correct but I am not sure if it can help in any big way. This gets called once per batch (or a chunk of it). Still it won't hurt as well. In fact, I think it will be better to have a boolean parameter which tells whether or not to wait if there is a currently running batch. In the above case we can pass it as false.
[9 May 2013 14:14] Inaam Rana
Assigning to Vasil to look into buf_dblwr_flush_buffered_writes() changes.

The core issue reported i.e., buf_flush_LRU_tail() not working properly has already been taken care of.
[27 Sep 2013 10:15] Laurynas Biveinis
Re. Mark's problem 1), merely skipping buf_dblwr_flush_buffered_writes() if page_count == 0 will be only a partial fix in the presence of compressed pages for LRU flushes, because this value will originate from buf_do_LRU_batch():

	if (buf_LRU_evict_from_unzip_LRU(buf_pool)) {
		count += buf_free_from_unzip_LRU_list_batch(buf_pool, max);

	if (max > count) {
		count += buf_flush_LRU_list_batch(buf_pool, max - count);

Notice a mixup of evicted and flushed pages.  In the case we will satisfy this LRU flush batch purely by unzip_LRU eviction and no flushing, we will still get page_count > 0 in buf_flush_common().

We are exploring a fix here and elsewhere of replacing single value counters with two-field structs for flushed and evicted pages.
[23 Jan 2014 9:31] Laurynas Biveinis
Related bug 71411.