MySQL Bugs: #69174: the page cleaner thread may sleep too long time.

Bug #69174	the page cleaner thread may sleep too long time.
Submitted:	8 May 2013 5:55	Modified:	24 Sep 2013 8:07
Reporter:	zhai weixiang (OCA)	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.6.11	OS:	Linux
Assigned to:	Vasil Dimov	CPU Architecture:	Any

Description:

after closing the benchmark and then set innodb_max_dirty_pages_pct to 0 to force all dirty pages flushed before shutdown the instance, I noticed the number of flushed pages every second is not stable :

 time      flushed    Innodb_data_written
11:49:57  3692        117.3m
11:49:58  1125        33.5m|
11:49:59  4187        134.6m
11:50:00  1241         35.0m
11:50:01  4076        127.4m
11:50:02  1250        39.3m
……

I sampled the output of pstack  and frequently see such backtrace:

buf_flush_page_cleaner_thread->page_cleaner_sleep_if_needed->os_thread_sleep

So I added two variables to monitor the sleep time of page cleaner thread:

after closing the benchmark  and set innodb_max_dirty_pages_pct = 0;

mysql> show status like '%pc%';
+---------------------+-----------+
| Variable_name       | Value     |
+---------------------+-----------+
| Innodb_pc_sleep_num | 675       |
| Innodb_pc_sleep_sum | 556946000 |
+---------------------+-----------+
2 rows in set (0.00 sec)

 a few minutes later (still have a lot of dirty page)

mysql> show status like '%pc%';
+---------------------+-----------+
| Variable_name       | Value     |
+---------------------+-----------+
| Innodb_pc_sleep_num | 757       |
| Innodb_pc_sleep_sum | 616655000 |
+---------------------+-----------+
2 rows in set (0.00 sec)

mysql> select (616655000-556946000)/(757-675) as avg_sleep_time;
+----------------+
| avg_sleep_time |
+----------------+
|    728158.5366 |
+----------------+
1 row in set (0.00 sec)

quoted code:

2326 static
2327 void
2328 page_cleaner_sleep_if_needed(
2329 /*=========================*/
2330     ulint   next_loop_time) /*!< in: time when next loop iteration
2331                 should start */
2332 {
2333     ulint   cur_time = ut_time_ms();
2334
2335     if (next_loop_time > cur_time) {
2336         /* Get sleep interval in micro seconds. We use
2337         ut_min() to avoid long sleep in case of
2338         wrap around. */
2339         os_thread_sleep(ut_min(1000000,
2340                 (next_loop_time - cur_time)
2341                  * 1000));
2342     }
2343 }

I then make the max sleep time (default 1000000) configurable:

set global innodb_page_cleaner_sleep_max = 10000;

Then I can get a stable  page flush rate:

 time      flushed    Innodb_data_written

11:51:43   3971         120.6m
11:51:44   3987         124.6m
11:51:45   3859         124.6m
11:51:46   3992         121.0m
11:51:47   3855         120.7m
 

How to repeat:
1.execute DML to generate enough dirty pages;
2.stop workload , set global innodb_max_dirty_pages_pct = 0;
3. observe the flushed pages every second

Suggested fix:
make the max  sleep time configurable or reduce the max sleep time

a rough patch

Attachment: patch.diff (application/octet-stream, text), 3.07 KiB.

Is it sleeping more than requested per iteration or not flushing innodb_io_capacity pages (per buffer pool instance) per iteration of the loop?
- What is innodb_buffer_pool_instances?
- Do the variables you added measure the actual or requested sleep time?

My assumptions are:
1) you set innodb_io_capacity = 4000
2) the "else" branch executes in buf_flush_page_cleaner_thread and calls page_cleaner_do_flush_batch

                } else {
                        n_flushed = page_cleaner_do_flush_batch(
                                                        PCT_IO(100),
                                                        LSN_MAX);

1.
innodb_buffer_pool_instances=8
innodb_io_capacity=1000

2. the variables were updated in page_cleaner_sleep_if_needed  and meansured the actual sleep time:

quoted code :

2337         if (next_loop_time > cur_time) {
2338           /* Get sleep interval in micro seconds. We use
2339             ut_min() to avoid long sleep in case of
2340             wrap around. */
2341                 ulint min_time = ut_min(srv_page_cleaner_sleep_max,(next_loop_time - cur_time)*1000);
2342 
2343                 export_vars.innodb_pc_sleep_sum+=min_time;
2344                 
2345                 export_vars.innodb_pc_sleep_num++;
2346 
2347                 os_thread_sleep(min_time);
2348         }

We have a heatbeat script that will update one row (and only one row in the heatbeat table) every two second , and that will lead to  srv_check_activity(last_activity) != 0　

I think it's not reasonable for page cleaner thread to sleep so long time.

I can get a stable number of flushed pages  immediately after set srv_page_cleaner_sleep_max to a very small value (such as 10000)

Or maybe a more reasonable way is needed to check activity ?

It is true that page_cleaner won't go full speed if there is sever activity happening. In fact, the page_cleaner thread will only try to maintain innodb_lru_scan_depth clean pages in the free list. Beyond this point it won't attempt to flush any pages. The variation that you are seeing is probably because in one second the page_cleaner, after seeing server activity is doing only flush_list flushing, while the next second, not seeing any activity it does background flushing which is full speed flushing.

I am not sure what can be a better way to measure server activity count. Ideally, we can trace any IOs as page_cleaner is only concerned with IO. But internal background activities like purge and ibuf merge can also generate IO. This will unnecessarily throttle page_cleaner thread.

Making page_cleaner sleep time configurable is not going to be as simple as exposing a new variable. adaptive flushing heuristics depends on the assumption that it gets called every second. If it gets called 'n' number of times each second then adaptive flushing recommendations have to be adapted to that.

I think in short term we can think about making server activity count more realistic. But I am afraid this is unlikely to be on top of my TODO list.

Assigning this to Vasil.

This:

1. Looks like a regression bug from 5.5 to 5.6.
2. Seems to have broken a setting that's supposed to force high rate flushing if it's exceeded, innodb_max_dirty_pages_pct.
3. Also breaking common usage of innodb_max_dirty_pages_pct to reduce outage times during server restarts, by flushing the buffer pool while letting the server continue to accept connections, unlike a server shutdown request. Gradual reductions in innodb_max_dirty_pages_pct are the standard way to do this.
4. Seems also to break use of innodb_max_dirty_pages_pct to manage crash recovery time, by capping how much work can be involved based on the number of dirty pages, as a more firm cap than normal flushing or adaptive flushing but a less firm one than using small log files to force async flushing.

It's good to see that the page cleaner thread in normal circumstances with the server having some foreground work is trying not to keep more than innodb_lru_scan_depth free pages. Background threads doing too much work is a problem, particularly when that rate suddenly changes from slow to fast, and monitoring activity and responding accordingly for discretionary work is good.

But exceeding innodb_max_dirty_pages_pct isn't discretionary work. InnoDB is supposed to be doing high rate flushing to get down to that limit regardless of whether the server is active or not.

I suppose only cutting the flushing rate instead of sleeping when the server is both active and innodb_max_dirty_pages_pct is exceeded would help. Prudent DBAs are likely to try to have at least 20% margin to handle load surges, so cutting to 20% of usual would be a reasonable sort of reduction.

If monitoring of i/o gets improved then adding any additional i/o not used by the foreground to get below innodb_max_dirty_pages_pct could be ideal. Though with purge having most of any allocation if both want it.

That improved monitoring of foreground activity so background threads can do a better job of adjusting their work rates could be a very good thing that would help to reduce production outages (where outages is customer-defined as serious production disruption, not necessarily completely down).

So, suggested possible quick partial fix:

Cut the flushing rate from 100 to 20 instead of sleeping when the server is both active and innodb_max_dirty_pages_pct is exceeded.

Thinking more about this, 20% is not enough.

The problem comes when you've done a proper tuning job, so the InnoDB log files are big enough and flushing based on the amount of space used in them is not going to happen much. That's standard tuning: make the log files so big that you never get anywhere near close to 75%/async flushing and so you maximise the amount of write combining that can be done.

Once you've got that right, the dirty page percentage will grow, constrained by innodb_io_capacity and innodb_max_dirty_pages_pct, so you set innodb_max_dirty_pages_pct to the desirable cap and adjust innodb_io_capacity to avoid getting there. The cap will be high for systems with lots of writes and high write combining potential, much lower for systems with low write loads. And the system with lots of writes will have higher innodb_io_capacity as well.

Then you might get a batch of inserts or updates and then you rely on innodb_max_dirty_pages_pct to cause flushing to keep up with the burst load rate. Log-based flushing won't because you're not going to deliberately make the log files too small and waste io (unless you're working around this bug, then you might have to tolerate the risk of disruption by deliberately using log files that are too small).

But that normal tuning relies on innodb_max_dirty_pages_pct working. If it's capped at 20% write load then that forces innodb_io_capacity to be set five times as high as it would need to be set.

Yet it is also desirable to have a more gentle start to innodb_io_capacity based flushing, so starting at 20% is a good thing. Just not enough if it can't keep up.

So that leads to a revised suggested fix:

Check dirty page percentage.
Is innodb_max_dirty_pages_pct set to 1 or less?
Yes: write at 100% (assume shutdown case)
No: is dirty page count over innodb_max_dirty_pages_pct by more than 0.5%?
Yes: write at 100%
No: write at 20%

And of course not limited in any way be the free list count that the page cleaner thread looks at, because that's irrelevant to the target for dirty page count, which is more about ensuring the writes don't take too much of the buffer pool away from read caching.

Why plus 0.5%? Because production systems almost never get to even 1% above innodb_max_dirty_pages_pct, because they tend to be well adjusted on disk throughput. So when you get to even 0.5% above it you know it's a failure to keep up. The rare exceptions are systems with grossly inadequate disk that just need better disks (or bigger log files, so they can do more write combining and waste less of their limited io). And if set below 1% it's a safe assumption that that is the shutdown case and even if not, there isn't much margin available to allow for a more gentle start.

Skipping the 0.5% test is fine if that turns out to be too slow. 100% is in practice gentle enough to avoid reports of disruption. Not perfect but it gets the job done, particularly when flushing smoothing is in effect.

James Day, MySQL Senior Principal Support Engineer, Oracle

Here are some of the workarounds available, limited though they sometimes are:

1. In past versions the limit on innodb_log_file_size sometimes made it necessary to use this setting to avoid hitting 75% of log space use and having a production disruption incident due to hitting async flushing at 75%. The much more gentle flushing batches from innodb_max_dirty_pages_pct were normally acceptable and it wasn't uncommon for systems with large buffer pools and high needs to have innodb_max_dirty_pages_pct set to values in the 2-5% range just for this reason. In 5.6 you have two possibilities that should work better:

1a. You can use larger values for innodb_log_file_size. That will let you use more of your buffer pool for write combining and reduce total io operations, instead of being forced to do lots of avoidable ones just to avoid reaching 75% of the log file use. Be sure you allow for the RAM your OS will use for buffering the log files, assume as much RAM use as the total log file space you set. This should greatly increase the value of larger buffer pools for high write load workloads.

1b. You can set innodb_adaptive_flushing_lwm to avoid reaching 75% of log space use. The lowest permitted value is 70%, so adaptive flushing will start to increase flushing rate before the server gets to 75% of the log file use. 70% is a good setting for systems with low write rates or very fast disk systems that can easily handle a burst of writes. For others you should adjust to whatever lower value it takes to produce a nice and smooth transition from innodb_io_capacity based level flushing to adaptive flushing. 10% is the default but that is probably too low for most production systems, just what we need for a default that has to handle a wide range of possible cases.

This isn't a workaround, but rather significant improvements: the ability to use larger log files and make better use of bigger buffer pools and the way adaptive flushing can be used to avoid the sudden bursts of flushing activity at 75% log space use.

2. You can't effectively use the normal practice of gradually reducing innodb_max_dirty_pages_pct before a shutdown, to reduce outage duration. The best workaround at present is to set innodb_io_capacity to high values so it will cause more flushing.

3. You can't use innodb_max_dirty_pages_pct to manage crash recovery time, something it could do with less disruptive writing than the alternative of letting the server hit async flushing at 75% of log file space use, after deliberately setting innodb_log_file_size too low. The workarounds are to use higher than desirable innodb_io_capacity and smaller than desirable innodb_log_file_size. Both cause unnecessary flushing compared to using innodb_max_dirty_pages_pct for this task. Before using a too small innodb_log_file_size, experiment with innodb_io_capacity and innodb_adaptive_flushing_lwm. Also ensure that innodb_io_capacity_max is set to around twice innodb_io_capacity, rarely up to four or more times. This may eliminate the issue with less redundant io than very constrained log file sizes because adaptive flushing will increase the writing rate as the percentage of log space used increases. So you should be able to reach almost any target recovery time limit, though still at the cost of more io than using innodb_max_dirty_pages_pct to do it only when a hard cap is reached. If you do find that you need to use a smaller than desirable log file size, you should be able to use a larger one than in 5.5, since the objective ends up changing to making it only small enough so that adaptive flushing will do enough writing, not to letting it get to 75% to force async flushing.

The innodb_max_dirty_pages_pct setting just establishes a target for the
flushing activity.  It will not increase the flushing rate.  To increase
the flushing rate we need to tune innodb_io_capacity and
innodb_io_capacity_max.  The size of the redo log (innodb_log_file_size)
will also play a role in this.  Can you please provide the values for
innodb_io_capacity_max and innodb_log_file_size?  Does increasing
innodb_io_capacity help to reach the target (zero dirty pages) soon?

Also, how are you calculating the flush rate?

The page cleaner thread is designed to run once every second.  And this 1 second is divided as follows: flushing time + sleep time = 1 second.

If the page cleaner thread finished its flushing activity within a second, then it will sleep.  This happens when the innodb_io_capacity is set lower than the actual throughput possible with the given hardware or if the hardware is slower.  To reduce the sleep time of the page cleaner thread, the correct approach will be to increase the innodb_io_capacity and the innodb_io_capacity_max (as long as the hardware is capable of delivering it). We shouldn't reduce the sleep time in such a way that the page cleaner thread will run more than once per second.

I'll be closing this bug as 'Not a Bug'.

Closing based on Annamalai's remarks above.

Annamalai, you have correctly described the broken way this no longer works in 5.6. Now look at how it works in 5.5 to see why this bug report was filed.

In 5.5 and earlier innodb_max_dirty_pages_pct is a hard limit that increases flushing as much as required to do the job. innodb_io_capacity and innodb_io_capacity_max did not affect that flushing rate.

You should also consider the logic: dirty pages are made by an active server. The new flushing only does things if the server is idle. So the innodb_max_dirty_pages_pct flushing no longer does anything at the times it is needed. In 5.5 and earlier it did exactly what was intended and needed: keeping within that limit on a loaded server. Reliably, without causing production disruption.

The formerly reliable innodb_max_dirty_pages_pct is completely broken in 5.6 compared to 5.5 and earlier.

James,

if you think innodb_max_dirty_pages_pct is broken why didn't you re-open the report?