MySQL Bugs: #45015: InnoDB buffer pool can be severely affected by table scans

Bug #45015	InnoDB buffer pool can be severely affected by table scans
Submitted:	21 May 2009 18:10	Modified:	20 Jun 2010 17:18
Reporter:	Harrison Fisk	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.0/5.1	OS:	Any
Assigned to:	Marko Mäkelä	CPU Architecture:	Any
Tags:	buffer pool, innodb, LRU, table scan

Description:
Large full table scans can have the affect of evicting a large portion of the useful pages out of the InnoDB buffer pool.  This lowers your buffer pool hit rate and can severely reduce performance.

mysqldump (or similar tools) are very bad at doing this, since it will do many full table scans and blow out your cache.

How to repeat:
1.  Have a large busy well buffered database in InnoDB.
2.  Run mysqldump
3.  Watch your buffer pool hit rate decrease a great deal and spike disk I/O

Suggested fix:
There are a few potential ways to resolve this:

1.  Allow for some hint in a table scan query to pass to the storage engine to not cache data.

2.  Change the buffer pool algorithm to not use pure LRU, but something that is table scan resistant.  Options would be to use the mid-point insertion strategy that MyISAM can do or something like the ARC algorithm.

3.  Allow for multiple buffer pools where you can assign certain tables to different caches to better control this.

Some URLs which are relevant:

Midpoint insertion strategy used by the MyISAM key cache:

http://dev.mysql.com/doc/refman/5.0/en/midpoint-insertion.html

ARC systems:

http://en.wikipedia.org/wiki/Adaptive_Replacement_Cache

Q&A with Heikki saying there is no table scan resistance in InnoDB:

http://www.mysqlperformanceblog.com/2007/10/26/heikki-tuuri-innodb-answers-part-i/

InnoDB already implements an LRU cache that is divided into two parts, like MyISAM.

Some more pointers:

http://en.wikipedia.org/wiki/Page_replacement_algorithm
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-11.pdf

The latter introduces LIRS (Low Inter-reference Recency Set), an improvement over LRU. There would be a few challenges in an InnoDB implementation:

1. How to store the non-resident set of blocks (and how much of it to store)? 

Many advanced algorithms (LIRS, LRU-K, ARC, …) keep track of some non-resident blocks, to have access to more history.

2. How to efficiently compute the IRR (Inter-Reference Recency, number of other *distinct* blocks accessed between two consecutive references to the block)?

3. How to modify table and index scans so that they only invoke buf_page_get_gen() once per B-tree page, as required by the IRR computation?

Without this change, table scans would get IRR=0 instead of the desired IRR=∞. Currently, InnoDB will store the persisent cursor position, commit the mini-transaction (releasing the page) and restore the cursor (re-latching the page) each time it returns a record to the user. The prebuilt->fetch_cache[] alleviates this a little, but it is not tightly coupled with B-tree pages.

InnoDB Plugin 1.0.5 will address this by introducing two settable global variables, innodb_old_blocks_pct and innodb_old_blocks_time.

The parameter innodb_old_blocks_pct (5..95) controls the desired amount of "old" blocks in the LRU list.  The default is 37, corresponding to the old fixed ratio of 3/8.

Each time a block is accessed, it will be moved to the "new" blocks if its first access was at least innodb_old_blocks_time milliseconds ago (default 0, meaning every block).  The idea is that in index scans, blocks will be accessed a few times within innodb_old_blocks_time, and they will remain in the "old" section of the LRU list.  Thus, when innodb_old_blocks_time is nonzero, blocks retrieved for one-time index scans will be more likely candidates for eviction than blocks that are accessed in random patterns.

Marko, am I right in thinking that innodb_old_blocks_time = 1 would often be reasonably close in meaning to "if table scans are uncommon, only move non-leaf nodes to the old block list"?

If so, 1 seems like a better default than 0 because it would reduce the surprise effect that unexpected table scans produce for end users, by greatly decreasing the number of pages that become old during a scan. Maybe better to suggest it in documentation than change the default behavior at this time.

James, you are probably right. When the adaptive hash index works efficiently, read-heavy workloads should access mostly the leaf pages. However, we are somewhat reluctant to change the default behaviour in the built-in InnoDB of MySQL 5.1. A change of the default setting could be considered for the InnoDB Plugin.

Pushed into 5.1.41 (revid:joro@sun.com-20091014143611-cphb0enjlx6lpat1) (version source revid:satya.bn@sun.com-20091008130559-4b6sduldhka30zuz) (merge vers: 5.1.40) (pib:13)

Pushed into 6.0.14-alpha (revid:alik@sun.com-20091022063126-l0qzirh9xyhp0bpc) (version source revid:alik@sun.com-20091019135554-s1pvptt6i750lfhv) (merge vers: 6.0.14-alpha) (pib:13)

Pushed into 5.5.0-beta (revid:alik@sun.com-20091022060553-znkmxm0g0gm6ckvw) (version source revid:alik@sun.com-20091019131022-2o2ymjfjjoraq833) (merge vers: 5.5.0-beta) (pib:13)

Noted in 5.1.41, 5.5.0 changelogs. (Feature is not in 6.0.14)

In the default operation of the InnoDB buffer pool, a block is loaded
at the midpoint for the first access and then moved immediately to
the head of the list as soon as another access occurs. In the case of
a table scan (such as performed for a mysqldump operation), each
block read by the scan ends up moving to the head of the list because
multiple rows are accessed from each block. This occurs even for a
one-time scan, where the blocks are not otherwise used by other 
queries. Blocks may also be loaded by the read-ahead background
thread and then moved to the head of the new sublist by a single 
access. These effects are disadvantageous because they push blocks
that are in heavy use by other queries out of the new sublist to the
old sublist where they become subject to eviction.

InnoDB Plugin now provides two system variables that enable LRU
algorithm tuning: 

innodb_old_blocks_pct: Specifies the approximate percentage of the
buffer pool used for the old block sublist. The range of values is 5
to 95. The default value is 37 (that is, 3/8 of the pool).

innodb_old_blocks_time: Specifies how long in milliseconds (ms) a
block inserted into the old sublist must stay there before it can be
moved to the new sublist. The default value is 0, so after a block is
inserted into the old sublist, it moves immediately to the new
sublist the next time it is accessed, no matter how soon after 
insertion the next access occurs. If the value is greater than 0,
blocks remain in the old sublist until an access occurs at least that
many ms after initial insertion. For example, a value of 1000 causes
blocks to stay in the old sublist for 1 second before moving to the
new sublist.

For additional information about the new system variables, see:

http://dev.mysql.com/doc/refman/5.1/en/innodb-buffer-pool.html

Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091218102229-64tk47xonu3dv6r6) (version source revid:jonas@mysql.com-20091218095730-26gwjidfsdw45dto) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)

Pushed into 5.1.41-ndb-6.2.19 (revid:jonas@mysql.com-20091218100224-vtzr0fahhsuhjsmt) (version source revid:jonas@mysql.com-20091217101452-qwzyaig50w74xmye) (merge vers: 5.1.41-ndb-6.2.19) (pib:15)

Pushed into 5.1.41-ndb-6.3.31 (revid:jonas@mysql.com-20091218100616-75d9tek96o6ob6k0) (version source revid:jonas@mysql.com-20091217154335-290no45qdins5bwo) (merge vers: 5.1.41-ndb-6.3.31) (pib:15)

Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20091218101303-ga32mrnr15jsa606) (version source revid:jonas@mysql.com-20091218064304-ezreonykd9f4kelk) (merge vers: 5.1.41-ndb-7.0.11) (pib:15)

Pushed into 5.1.47 (revid:joro@sun.com-20100505145753-ivlt4hclbrjy8eye) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
Re-closing.

Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (pib:16)

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
Re-closing.

Pushed into 5.5.5-m3 (revid:alik@sun.com-20100615080459-smuswd9ooeywcxuc) (version source revid:mmakela@bk-internal.mysql.com-20100415070122-1nxji8ym4mao13ao) (merge vers: 5.1.47) (pib:16)

Pushed into mysql-next-mr (revid:alik@sun.com-20100615080558-cw01bzdqr1bdmmec) (version source revid:mmakela@bk-internal.mysql.com-20100415070122-1nxji8ym4mao13ao) (pib:16)

Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Take a look at this:
http://www.mysqlperformanceblog.com/2011/11/13/side-load-may-massively-impact-your-mysql-p...

Yes, setting innodb_old_blocks_time to 1000 to 5000 is a good thing and likely to be helpful for quite a lot of workloads.