MySQL Bugs: #29847: Large CPU usage of InnoDB crash recovery with a big buf pool

Bug #29847	Large CPU usage of InnoDB crash recovery with a big buf pool
Submitted:	17 Jul 2007 15:38	Modified:	13 May 2010 14:34
Reporter:	Heikki Tuuri	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	All	OS:	Any
Assigned to:	Inaam Rana	CPU Architecture:	Any
Tags:	Contribution

Description:
This was spotted by Peter Zaitsev.

buf_flush_insert_sorted_into_flush_list() unneccessarily puts the blocks into the flush list sorted by the lsn, though that is only needed if InnoDB does redo log application in background, in connection with other processing.

How to repeat:
See above.

Suggested fix:
Remove the sort in current code. Remove possible assertions that fail.

An observation from Inaam:

We *cannot* simply remove the sorting logic because though we do relog application before bringing server online, we don't flush the list at the end of the log application.

A new suggested fix:

In crash recovery, using a red-black tree or some other sorted data structure to insert into the flush list in O(log(n)) time is the best solution. This involves some extra code, but we preserve the nice simple feature that the flush list is always sorted on the 'oldest modification' lsn.

Reclassifying as a feature request, because this requires substantial new code.

Feature request, 'make crash recovery work' ?

You should really fix this. No one want's to wait for their servers to recover.

One reported effect of this performance limitation is that a system with 24GB buffer pool size could only recover 10% after 2 hour. With a 4G buffer pool and innodb_flush_method=O_DIRECT removed the system recovered completely in 30 minutes.

Partial workarounds.

1. During recovery, temporarily reduce innodb_buffer_pool_size to force InnoDB to flush pages from the flush list. A value of 4G is likely to be reasonable.

2. During recovery, temporarily remove O_DIRECT so that the operating system can cache changes during recovery.

Users who have RAID setups with many drives (at least 4-6) should investigate this for normal use because O_DIRECT on Linux serialises writes and can sometimes reduce performance in RAID setups. Those using innodb_file_per_table are less likely to be affected because the many different files increase the chance of multiple writes being possible. Experiment to determine the best setting for your system.

2. At performance cost during normal operations, decrease the maximum number of dirty pages with changes to apply by reducing one or more of these settings, in approximate best to worst performance effect order:

innodb_max_dirty_pages_pct
innodb_log_file_size
innodb_buffer_pool_size

This performance improvement request is discussed at:

http://www.mysqlperformanceblog.com/2007/07/17/innodb-recovery-is-large-buffer-pool-always...

http://www.mysqlperformanceblog.com/2008/09/04/how-quickly-you-should-expect-to-see-bugs-f...

http://dammit.lt/2008/10/26/innodb-crash-recovery/

I think I have found a solution to this problem. Maintain an array list, which contains some pages in flush-list. And periodically update the array list. when insert a block into flush list , compare the block with pages in array list. At the end return a page whose oldest_modification is older than the block’s. So we could start looking up through the flush_list from the page we just found, not always start from the beginning.
I guess the pages contained in the array list will not be flushed and move to other list (LRU ,FREE)list during recovery and then my solution could work. If it is not then the value of in_flush_list should be maintained(need another changes), and compare block with pages which are in flush-list. 
I have tested it. I set buffer pool to 32G and the recovery time decrease from 3 hours to 3 minutes.
Is there some thing I misunderstand?  my code are only about 50 lines and should be very easy to understand.^_^

add a function and modify buf_flush_insert_sorted_into_flush_list

Attachment: fast_insert_into_flush_list.txt (text/plain), 3.55 KiB.

Thank you for your suggested patch! Inaam has worked on this problem on our side and can comment further on this.

Please also see
http://www.mysqlperformanceblog.com/2009/07/07/improving-innodb-recovery-time/ for a patch that dramatically speeds up InnoDB recovery. It would be great if it could be incorporated into InnoDB shortly.

Hi,

Why don't we flush pages when a flush list gets large? Flushing very 1000 pages or similar may be a reasonable workaround, although it's not a true fix.

Another temporary workaround is to add a new parameter like innodb_max_dirty_pages_pct_recovery to cap an amount of dirty pages only during recovery. Both may be easy to implement. Setting low innodb_max_dirty_pages_pct may improve the recovery performance, but may spoil normal operating performance. So, we need a parameter which only affects during recovery.

Valerie, 

As you now, recovery is a very critical function, and improvements in this area will require significant design and testing.  Users don't typically "test" recovery.  Instead, they RELY on it.

Innobase has done a great deal of work in the past several years to improve testing focused on various failure scenarios (for example, crashes during recovery).   We've found and fixed bugs that users would never found or been able to reproduce, but that could have been disastrous had they encountered them.  So, we do understand the importance of very diligent work in this area.

Note that with the InnoDB Plugin, it is now possible to improve the way checkpoints are performed, both to smooth out performance (and avoid bursts of redo log write activity), while also keeping the buffer pool "cleaner".   Users with very large buffer pools should experiment with adaptive flushing and appropriate settings of innodb_io_capacity.  With appropriate tuning, these parameters (and the ability to have smaller redo logs) can improve recovery time in many situations.

With all of the uncertainties surrounding our companies at the moment, and the state of product planning for future releases of InnoDB and MySQL, we are not prepared to make a projection of when we might be able to address this feature request ("improve performance of recovery").

Thanks for your understanding.

See http://blogs.innodb.com/wp/2010/04/innodb-performance-recovery/ for the planned fix for this and bug #49535 for some related work that together are expected to make crash recovery as much as thirty times as fast for large recovery jobs.

fixed in plugin 1.0.7

Noted in 5.1.46, 5.5.4 changelogs.

The redo scan during InnoDB recovery used excessive CPU. The efficiency of this scan was
improved for InnoDB Plugin, significantly speeding up crash recovery.