Bug #45015 InnoDB buffer pool can be severely affected by table scans
Submitted: 21 May 20:10 Modified: 3 Nov 22:31
Reporter: Harrison Fisk
Status: Closed
Category:Server: InnoDB Severity:S4 (Feature request)
Version:5.0/5.1 OS:Any
Assigned to: Marko Mäkelä Target Version:
Tags: buffer pool, LRU, table scan, innodb
Triage: Needs Triage: D5 (Feature request)

[21 May 20:10] Harrison Fisk
Description:
Large full table scans can have the affect of evicting a large portion of the useful
pages out of the InnoDB buffer pool.  This lowers your buffer pool hit rate and can
severely reduce performance.

mysqldump (or similar tools) are very bad at doing this, since it will do many full table
scans and blow out your cache.

How to repeat:
1.  Have a large busy well buffered database in InnoDB.
2.  Run mysqldump
3.  Watch your buffer pool hit rate decrease a great deal and spike disk I/O

Suggested fix:
There are a few potential ways to resolve this:

1.  Allow for some hint in a table scan query to pass to the storage engine to not cache
data.

2.  Change the buffer pool algorithm to not use pure LRU, but something that is table
scan resistant.  Options would be to use the mid-point insertion strategy that MyISAM can
do or something like the ARC algorithm.

3.  Allow for multiple buffer pools where you can assign certain tables to different
caches to better control this.
[21 May 20:21] Harrison Fisk
Some URLs which are relevant:

Midpoint insertion strategy used by the MyISAM key cache:

http://dev.mysql.com/doc/refman/5.0/en/midpoint-insertion.html

ARC systems:

http://en.wikipedia.org/wiki/Adaptive_Replacement_Cache

Q&A with Heikki saying there is no table scan resistance in InnoDB:

http://www.mysqlperformanceblog.com/2007/10/26/heikki-tuuri-innodb-answers-part-i/
[16 Jun 10:35] Marko Mäkelä
InnoDB already implements an LRU cache that is divided into two parts, like MyISAM.

Some more pointers:

http://en.wikipedia.org/wiki/Page_replacement_algorithm
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-11.pdf

The latter introduces LIRS (Low Inter-reference Recency Set), an improvement over LRU.
There would be a few challenges in an InnoDB implementation:

1. How to store the non-resident set of blocks (and how much of it to store)? 

Many advanced algorithms (LIRS, LRU-K, ARC, …) keep track of some non-resident blocks,
to have access to more history.

2. How to efficiently compute the IRR (Inter-Reference Recency, number of other
*distinct* blocks accessed between two consecutive references to the block)?

3. How to modify table and index scans so that they only invoke buf_page_get_gen() once
per B-tree page, as required by the IRR computation?

Without this change, table scans would get IRR=0 instead of the desired IRR=∞.
Currently, InnoDB will store the persisent cursor position, commit the mini-transaction
(releasing the page) and restore the cursor (re-latching the page) each time it returns a
record to the user. The prebuilt->fetch_cache[] alleviates this a little, but it is not
tightly coupled with B-tree pages.
[27 Aug 9:22] Marko Mäkelä
InnoDB Plugin 1.0.5 will address this by introducing two settable global variables,
innodb_old_blocks_pct and innodb_old_blocks_time.

The parameter innodb_old_blocks_pct (5..95) controls the desired amount of "old" blocks
in the LRU list.  The default is 37, corresponding to the old fixed ratio of 3/8.

Each time a block is accessed, it will be moved to the "new" blocks if its first access
was at least innodb_old_blocks_time milliseconds ago (default 0, meaning every block). 
The idea is that in index scans, blocks will be accessed a few times within
innodb_old_blocks_time, and they will remain in the "old" section of the LRU list.  Thus,
when innodb_old_blocks_time is nonzero, blocks retrieved for one-time index scans will be
more likely candidates for eviction than blocks that are accessed in random patterns.
[8 Sep 13:20] James Day
Marko, am I right in thinking that innodb_old_blocks_time = 1 would often be reasonably
close in meaning to "if table scans are uncommon, only move non-leaf nodes to the old
block list"?

If so, 1 seems like a better default than 0 because it would reduce the surprise effect
that unexpected table scans produce for end users, by greatly decreasing the number of
pages that become old during a scan. Maybe better to suggest it in documentation than
change the default behavior at this time.
[8 Sep 14:15] Marko Mäkelä
James, you are probably right. When the adaptive hash index works efficiently, read-heavy
workloads should access mostly the leaf pages. However, we are somewhat reluctant to
change the default behaviour in the built-in InnoDB of MySQL 5.1. A change of the default
setting could be considered for the InnoDB Plugin.
[14 Oct 16:39] Bugs System
Pushed into 5.1.41 (revid:joro@sun.com-20091014143611-cphb0enjlx6lpat1) (version source
revid:satya.bn@sun.com-20091008130559-4b6sduldhka30zuz) (merge vers: 5.1.40) (pib:13)
[22 Oct 8:35] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091022063126-l0qzirh9xyhp0bpc) (version
source revid:alik@sun.com-20091019135554-s1pvptt6i750lfhv) (merge vers: 6.0.14-alpha)
(pib:13)
[22 Oct 9:08] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091022060553-znkmxm0g0gm6ckvw) (version
source revid:alik@sun.com-20091019131022-2o2ymjfjjoraq833) (merge vers: 5.5.0-beta)
(pib:13)
[3 Nov 22:31] Paul DuBois
Noted in 5.1.41, 5.5.0 changelogs. (Feature is not in 6.0.14)

In the default operation of the InnoDB buffer pool, a block is loaded
at the midpoint for the first access and then moved immediately to
the head of the list as soon as another access occurs. In the case of
a table scan (such as performed for a mysqldump operation), each
block read by the scan ends up moving to the head of the list because
multiple rows are accessed from each block. This occurs even for a
one-time scan, where the blocks are not otherwise used by other 
queries. Blocks may also be loaded by the read-ahead background
thread and then moved to the head of the new sublist by a single 
access. These effects are disadvantageous because they push blocks
that are in heavy use by other queries out of the new sublist to the
old sublist where they become subject to eviction.

InnoDB Plugin now provides two system variables that enable LRU
algorithm tuning: 

innodb_old_blocks_pct: Specifies the approximate percentage of the
buffer pool used for the old block sublist. The range of values is 5
to 95. The default value is 37 (that is, 3/8 of the pool).

innodb_old_blocks_time: Specifies how long in milliseconds (ms) a
block inserted into the old sublist must stay there before it can be
moved to the new sublist. The default value is 0, so after a block is
inserted into the old sublist, it moves immediately to the new
sublist the next time it is accessed, no matter how soon after 
insertion the next access occurs. If the value is greater than 0,
blocks remain in the old sublist until an access occurs at least that
many ms after initial insertion. For example, a value of 1000 causes
blocks to stay in the old sublist for 1 second before moving to the
new sublist.
[4 Nov 16:56] Paul DuBois
For additional information about the new system variables, see:

http://dev.mysql.com/doc/refman/5.1/en/innodb-buffer-pool.html