Bug #29847 Large CPU usage of InnoDB crash recovery with a big buf pool
Submitted: 17 Jul 2007 15:38 Modified: 13 May 2010 14:34
Reporter: Heikki Tuuri Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:All OS:Any
Assigned to: Inaam Rana CPU Architecture:Any
Tags: Contribution

[17 Jul 2007 15:38] Heikki Tuuri
Description:
This was spotted by Peter Zaitsev.

buf_flush_insert_sorted_into_flush_list() unneccessarily puts the blocks into the flush list sorted by the lsn, though that is only needed if InnoDB does redo log application in background, in connection with other processing.

How to repeat:
See above.

Suggested fix:
Remove the sort in current code. Remove possible assertions that fail.
[1 Sep 2007 17:18] Heikki Tuuri
An observation from Inaam:

We *cannot* simply remove the sorting logic because though we do relog application before bringing server online, we don't flush the list at the end of the log application.

A new suggested fix:

In crash recovery, using a red-black tree or some other sorted data structure to insert into the flush list in O(log(n)) time is the best solution. This involves some extra code, but we preserve the nice simple feature that the flush list is always sorted on the 'oldest modification' lsn.
[4 Mar 2008 17:25] Heikki Tuuri
Reclassifying as a feature request, because this requires substantial new code.
[27 Oct 2008 7:46] Domas Mituzas
Feature request, 'make crash recovery work' ?
[27 Oct 2008 11:53] Žilvinas Šaltys
You should really fix this. No one want's to wait for their servers to recover.
[28 Oct 2008 20:40] James Day
One reported effect of this performance limitation is that a system with 24GB buffer pool size could only recover 10% after 2 hour. With a 4G buffer pool and innodb_flush_method=O_DIRECT removed the system recovered completely in 30 minutes.

Partial workarounds.

1. During recovery, temporarily reduce innodb_buffer_pool_size to force InnoDB to flush pages from the flush list. A value of 4G is likely to be reasonable.

2. During recovery, temporarily remove O_DIRECT so that the operating system can cache changes during recovery.

Users who have RAID setups with many drives (at least 4-6) should investigate this for normal use because O_DIRECT on Linux serialises writes and can sometimes reduce performance in RAID setups. Those using innodb_file_per_table are less likely to be affected because the many different files increase the chance of multiple writes being possible. Experiment to determine the best setting for your system.

2. At performance cost during normal operations, decrease the maximum number of dirty pages with changes to apply by reducing one or more of these settings, in approximate best to worst performance effect order:

 innodb_max_dirty_pages_pct
 innodb_log_file_size
 innodb_buffer_pool_size

This performance improvement request is discussed at:

http://www.mysqlperformanceblog.com/2007/07/17/innodb-recovery-is-large-buffer-pool-always...

http://www.mysqlperformanceblog.com/2008/09/04/how-quickly-you-should-expect-to-see-bugs-f...

http://dammit.lt/2008/10/26/innodb-crash-recovery/
[7 May 2009 16:42] harry wang
I think I have found a solution to this problem. Maintain an array list, which contains some pages in flush-list. And periodically update the array list. when insert a block into flush list , compare the block with pages in array list. At the end return a page whose oldest_modification is older than the block’s. So we could start looking up through the flush_list from the page we just found, not always start from the beginning.
I guess the pages contained in the array list will not be flushed and move to other list (LRU ,FREE)list during recovery and then my solution could work. If it is not then the value of in_flush_list should be maintained(need another changes), and compare block with pages which are in flush-list. 
I have tested it. I set buffer pool to 32G and the recovery time decrease from 3 hours to 3 minutes.
Is there some thing I misunderstand?  my code are only about 50 lines and should be very easy to understand.^_^
[7 May 2009 16:44] harry wang
add a function and modify buf_flush_insert_sorted_into_flush_list

Attachment: fast_insert_into_flush_list.txt (text/plain), 3.55 KiB.

[8 May 2009 13:38] Heikki Tuuri
Thank you for your suggested patch! Inaam has worked on this problem on our side and can comment further on this.
[9 Jul 2009 12:13] Lenz Grimmer
Please also see
http://www.mysqlperformanceblog.com/2009/07/07/improving-innodb-recovery-time/ for a patch that dramatically speeds up InnoDB recovery. It would be great if it could be incorporated into InnoDB shortly.
[19 Jul 2009 0:37] MySQL Verification Team
Hi,

Why don't we flush pages when a flush list gets large? Flushing very 1000 pages or similar may be a reasonable workaround, although it's not a true fix.
[19 Jul 2009 0:42] MySQL Verification Team
Another temporary workaround is to add a new parameter like innodb_max_dirty_pages_pct_recovery to cap an amount of dirty pages only during recovery. Both may be easy to implement. Setting low innodb_max_dirty_pages_pct may improve the recovery performance, but may spoil normal operating performance. So, we need a parameter which only affects during recovery.
[11 Dec 2009 17:10] Ken Jacobs
Valerie, 

As you now, recovery is a very critical function, and improvements in this area will require significant design and testing.  Users don't typically "test" recovery.  Instead, they RELY on it.

Innobase has done a great deal of work in the past several years to improve testing focused on various failure scenarios (for example, crashes during recovery).   We've found and fixed bugs that users would never found or been able to reproduce, but that could have been disastrous had they encountered them.  So, we do understand the importance of very diligent work in this area.

Note that with the InnoDB Plugin, it is now possible to improve the way checkpoints are performed, both to smooth out performance (and avoid bursts of redo log write activity), while also keeping the buffer pool "cleaner".   Users with very large buffer pools should experiment with adaptive flushing and appropriate settings of innodb_io_capacity.  With appropriate tuning, these parameters (and the ability to have smaller redo logs) can improve recovery time in many situations.

With all of the uncertainties surrounding our companies at the moment, and the state of product planning for future releases of InnoDB and MySQL, we are not prepared to make a projection of when we might be able to address this feature request ("improve performance of recovery").

Thanks for your understanding.
[14 Apr 2010 3:12] James Day
See http://blogs.innodb.com/wp/2010/04/innodb-performance-recovery/ for the planned fix for this and bug #49535 for some related work that together are expected to make crash recovery as much as thirty times as fast for large recovery jobs.
[13 May 2010 12:40] Inaam Rana
fixed in plugin 1.0.7
[13 May 2010 14:34] Paul DuBois
Noted in 5.1.46, 5.5.4 changelogs.

The redo scan during InnoDB recovery used excessive CPU. The efficiency of this scan was
improved for InnoDB Plugin, significantly speeding up crash recovery.