| Bug #68555 | thread convoys from log_checkpoint_margin with innodb_buffer_pool_instances > 1 | ||
|---|---|---|---|
| Submitted: | 3 Mar 2013 21:34 | Modified: | 11 Nov 2013 15:53 |
| Reporter: | Mark Callaghan | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S5 (Performance) |
| Version: | 5.6.10 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[3 Mar 2013 21:55]
Mark Callaghan
More results at http://mysqlha.blogspot.com/2013/03/mysql-56-cached-update-only-workload.html
[4 Mar 2013 1:55]
Inaam Rana
Mark, log_free_check() is our guard against overwriting redo log. check_flush_or_checkpoint flag tells a thread that it cannot safely proceed without first making sure if there is enough room in the redo logs. In log_checkpoint_margin() we'll only go to preflush if we have crossed sync limit. What is important to note here is that redo log is a system wide construct. Even with multiple buffer pools if the overall oldest modification tells us that there is not enough redo space usable then we have no choice but to wait for LSN to move forward. So the wait in this case seems reasonable choice but there is an issue with the current design. Dimitri nailed it down that if we have multiple buffer pools the page_cleaner thread which is responsible for flushing in background will do flush batches for each buffer pool instance serially. This means that if the oldest modification is in, say, the buffer pool instance every thread will remain stuck unless the page cleaner is able to clear that. Also the stack you mentioned above should not be for more threads then buffer pool instances. This stack means a thread doing flush batch but there is only one batch allowed per instance. If you see it in more threads then there is a bug which needs to be fixed.
[4 Mar 2013 2:22]
Mark Callaghan
Note that I doubled QPS by changing the code to prevent all threads from getting stuck in this code. I don't think a foreground thread in log_checkpoint_margin() should set check_flush_or_checkpoint when log_preflush_pool_modified_pages returned FALSE. That thread is about to do "goto loop" and will attempt to do flushing if needed. Why force the other threads to do the same only because FALSE was returned in this case? Note that FALSE is returned because flush was in progress by another thread for at least one of buffer pool instances, so FALSE doesn't mean a flush is more urgent and all threads should stop.
[4 Mar 2013 3:46]
Inaam Rana
Mark, How can we safely let other threads do the work? * Suppose T1 is doing the flush. * T2 goes to log_preflush_pool_modified_pages() and returns FALSE. * Lets assume check_flush_or_checkpoint is not set by T2. * Now T3 comes in log_checkpoint_margin() and seeing check_flush_or_checkpoint == FALSE it continues to do its work * T3 can now generate redo and if for some reason T1 hasn't flushed pages yet we'll have a situation where redo can potentially be overwritten before the corresponding dirty pages have made it to the disk.
[4 Mar 2013 16:45]
James Day
Lets consider a 4G combined log file size. 85% of that is 600 megabytes from using all of the available log space. Even with our new default log file size the 15% remaining is 14.4 megabytes, 1.5 times the old total amount of redo log space. Is it really necessary to block when so much redo space is available?
[4 Mar 2013 17:01]
Mark Callaghan
Inaam - T3 should decide in log_checkpoint_margin whether or not to flush based on the comparisons between checkpoint_age, max_checkpoint_age and max_checkpoint_age_async. The problem is that the current code can force T3 into log_checkpoint_margin when it sets check_flush_or_checkpoint for no good reason. T3 can also decide elsewhere, like in log_free_check whether or not to call log_checkpoint_margin. By "for no good reason" I mean that log_preflush_pool_modified_pages returned FALSE. There is code elsewhere that sets check_flush_or_checkpoint. That code is fine. It is only the set in log_checkpoint_margin that is the problem. I don't think that a return of FALSE from log_preflush_pool_modified_pages, which is really a return of FALSE from buf_flush_start for one of the buf pool instances, which is really an indicator that a flush was in progress for that buf pool instance... ... I don't think that anything in the previous paragraph is a reason to force all threads into log_checkpoint_margin. But that is what the code does today and it is much more of a problem when innodb_buffer_pool_instances > 1
[5 Mar 2013 15:35]
Inaam Rana
James - may be it is a conservative threshold. In any case, we need to have a HWM beyond which the threads are forced to wait. What is a good value is separate issue and opinions can differ on that. May be its my background as civil engineer which makes me conservative when choosing factor of safety :)
[5 Mar 2013 15:42]
Inaam Rana
Mark - I am bit confused now. Your objection is the extra call to log_checkpoint_margin() or the wait that is forced on T3? If, in our example, threads T3...Tn are not forced to wait on flush list batch then what will stop them from generating more redo and in the process reuse the redo that is still required for recovery? Once we cross the threshold everyone must wait. I understand why this is more painful in case of multiple instances. Dimitri suggested something like parallel flushing to mitigate that. We also have some ideas about probably a wait for the whole batch to complete is not required and we can let threads proceed once enough reusable space is available. But none of this changes the fact that at certain point we have to have the logic to force a wait on all threads.
[6 Mar 2013 13:35]
James Day
Inaam, I strongly agree that we do need to have a HWM. Just not where. :) Chemistry degree here and I managed not to kill myself. :) Lets see how we can do discussing possible thresholds. How do these fit with your mental model of how it should work? 1. 30 seconds of estimated usage remaining, hard block on other activity - the freeze Mark is writing about. 2. 60 seconds of estimated usage remaining, still let other threads flush so we don't block and limit the flushing rate, which makes the problem worse and is why Mark is discussing this with us. I'd think of this as the sync flushing point, just not quite so strict. A good time to boost flushing priority also. 3. 120 seconds of estimated usage remaining, the async flushing threshold. The reason we need to try to come up with better thresholds compared to space is the log file sizing issues and how they interact with other things: A. adaptive flushing adjusts work rate based on usable space remaining, so setting a threshold too low causes extra writes. That hurts foreground performance (less write combining) and SSD drive life. B. the log files end up cached by the OS, so there's an incentive to try to make them smaller. Having over-large margins for safety shifts the risk to swapping and out of memory failures instead. So the safety margin isn't free, in some ways it shifts the problem to a different place. We do have to have a safety margin, just not the ones we get now when we're recommending good sizes for current systems. The old limits have just become out of date as the hardware and server have got better. Mark, do you have any thoughts on how we might do this, ideally so it just works without much chance of end users having to tune, part of why I like time remaining - it's potentially nicely adaptive to workload? Outside opinions are always interesting... James Day, MySQL Senior Principal Support Engineer, Oracle
[6 Mar 2013 15:06]
Mark Callaghan
My objection is that when T2 is in log_checkpoint_margin() and its call to log_preflush_pool_modified_pages() returns FALSE, then it sets check_flush_or_checkpoint to TRUE. Setting it there forces other threads to call log_checkpoint_margin when they don't need to call it and then they stall. check_flush_or_checkpoint is set elsewhere. I am not asking for those sets to be changed.
[6 Mar 2013 22:00]
Mark Callaghan
Just got off the phone with Inaam and now I understand why my patch doesn't do what I think it does. I will repeat the tests. However, one root cause is the sleep in fil_flush. Just looked at the thread stacks again and all threads were stuck on a 20 millisecond sleep. We need something better in that case as my hosts fsync in less than 1 millisecond. Filed http://bugs.mysql.com/bug.php?id=68588 for the sleep. Will keep this open until I get results with that sleep removed. --- retry: if (node->n_pending_flushes > 0) { /* We want to avoid calling os_file_flush() on the file twice at the same time, because we do not know what bugs OS's may contain in file i/o; sleep for a while */ mutex_exit(&fil_system->mutex); os_thread_sleep(20000);
[11 Nov 2013 13:39]
James Day
Mark, How did testing go with 5.6.12 that has the fix for Bug #68588, removing most of the OS_THREAD_SLEEP calls from InnoDB? Did that resolve the issue so we can close this as a duplicate of that fixed bug? James Day, MySQL Senior Principal Support Engineer, Oracle
[11 Nov 2013 15:53]
Mark Callaghan
I think it was better but I am taking a break from MySQL perf testing.

Description: Thread convoys can form in log_preflush_pool_modified_pages when check_flush_or_checkpoint is set to TRUE after buf_flush_list returns FALSE. The problem is that buf_flush_list returns FALSE when a flush is in progress for any of the buffer pool instances -- which happens when buf_flush_start returns FALSE. On the return of FALSE, a thread in log_checkpoint_margin resets check_flush_or_checkpoint to TRUE, which has global visibility, and then starts the loop over. Setting that to TRUE forces other threads to call into log_checkpoint_margin which is a side-effect I don't like as all of those threads end up waiting for no good reason. Given the thread that set check_flush_or_checkpoint to TRUE is about to repeat the loop why invite other threads to do the same? The problem is worse when innodb_buffer_pool_instances > 1. I would see many threads with this stack: os_thread_sleep fil_flush fil_flush_file_spaces buf_flush_common buf_flush_list log_preflush_pool_modified_pages log_checkpoint_margin log_check_margins log_free_check row_upd row_upd_step row_update_for_mysql ha_innobase::update_row handler::ha_update_row mysql_update mysql_execute_command mysql_parse mysql_parse How to repeat: Run an update-intensive workload for a cached database and set innodb_io_capacity to a smaller than optimal value so that foreground threads have to do fuzzy checkpoint page writes. This shows the performance regression from using multiple buffer pool instances. The numbers are QPS where each query is update 1 row by PK. The blog post has more details. updates/second for innodb_buffer_pool_instances=1 8 16 32 64 128 256 concurrent clients 36691 54741 62789 62262 61208 60884 orig5610, iocap=1000 34860 55060 61938 62270 61661 61107 orig5610, iocap=2048 40105 56780 64174 64257 62996 62208 orig5610, iocap=4096 43293 61882 66132 64147 63124 62533 orig5610, iocap=8192 updates/second for innodb_buffer_pool_instances=8 8 16 32 64 128 256 concurrent clients 30076 30535 39266 33280 37135 31415 orig5610, iocap=1000 30151 32422 38110 39712 37551 34242 orig5610, iocap=2048 32092 38397 43014 44260 42148 41299 orig5610, iocap=4096 38302 51486 55676 55044 54883 54032 orig5610, iocap=8192 ---- This is QPS after apply the diff provided in this bug report. QPS is much better. updates/second for innodb_buffer_pool_instances=8 8 16 32 64 128 256 concurrent clients 46726 70465 57360 49329 49120 48730 orig5610, iocap=1024 42710 70487 65408 57632 57587 57067 orig5610, iocap=2048 42464 66687 70299 63091 62940 61831 orig5610, iocap=4096 39295 69216 72740 66955 65382 65283 orig5610, iocap=8192 41798 68033 71868 66918 64496 64865 orig5610, iocap=16384 And QPS improves a bit more after switching to innodb_flush_method=O_DIRECT_NO_FSYNC. updates/second for innodb_buffer_pool_instances=8 8 16 32 64 128 256 concurrent clients 44092 69082 73446 69393 68118 67118 orig5610, iocap=1024 42445 67282 74463 69547 68324 67443 orig5610, iocap=2048 40601 69185 75044 69907 68406 67640 orig5610, iocap=4096 35835 68327 74777 69765 68069 67286 orig5610, iocap=8192 41997 66772 75226 70166 68531 67716 orig5610, iocap=16384 Suggested fix: --- orig5610/storage/innobase/log/log0log.cc 2013-02-26 14:29:03.000000000 -0800 +++ mysql-5.6.10/storage/innobase/log/log0log.cc 2013-01-22 08:54:49.000000000 -0800 @@ -2101,7 +2101,6 @@ ibool checkpoint_sync; ibool do_checkpoint; bool success; - bool force = FALSE; loop: checkpoint_sync = FALSE; do_checkpoint = FALSE; @@ -2110,12 +2109,11 @@ mutex_enter(&(log->mutex)); ut_ad(!recv_no_log_write); - if (!force && log->check_flush_or_checkpoint == FALSE) { + if (log->check_flush_or_checkpoint == FALSE) { mutex_exit(&(log->mutex)); return; } - force = FALSE; oldest_lsn = log_buf_pool_get_oldest_modification(); @@ -2159,8 +2157,7 @@ if (!success) { mutex_enter(&(log->mutex)); - /* log->check_flush_or_checkpoint = TRUE; */ - force = TRUE; + log->check_flush_or_checkpoint = TRUE; mutex_exit(&(log->mutex)); goto loop;