Bug #45015 InnoDB buffer pool can be severely affected by table scans
Submitted: 21 May 2009 18:10 Modified: 20 Jun 2010 17:18
Reporter: Harrison Fisk Email Updates:
Status: Closed Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:5.0/5.1 OS:Any
Assigned to: Marko Mäkelä CPU Architecture:Any
Tags: buffer pool, innodb, LRU, table scan

[21 May 2009 18:10] Harrison Fisk
Large full table scans can have the affect of evicting a large portion of the useful pages out of the InnoDB buffer pool.  This lowers your buffer pool hit rate and can severely reduce performance.

mysqldump (or similar tools) are very bad at doing this, since it will do many full table scans and blow out your cache.

How to repeat:
1.  Have a large busy well buffered database in InnoDB.
2.  Run mysqldump
3.  Watch your buffer pool hit rate decrease a great deal and spike disk I/O

Suggested fix:
There are a few potential ways to resolve this:

1.  Allow for some hint in a table scan query to pass to the storage engine to not cache data.

2.  Change the buffer pool algorithm to not use pure LRU, but something that is table scan resistant.  Options would be to use the mid-point insertion strategy that MyISAM can do or something like the ARC algorithm.

3.  Allow for multiple buffer pools where you can assign certain tables to different caches to better control this.
[21 May 2009 18:21] Harrison Fisk
Some URLs which are relevant:

Midpoint insertion strategy used by the MyISAM key cache:


ARC systems:


Q&A with Heikki saying there is no table scan resistance in InnoDB:

[16 Jun 2009 8:35] Marko Mäkelä
InnoDB already implements an LRU cache that is divided into two parts, like MyISAM.

Some more pointers:


The latter introduces LIRS (Low Inter-reference Recency Set), an improvement over LRU. There would be a few challenges in an InnoDB implementation:

1. How to store the non-resident set of blocks (and how much of it to store)? 

Many advanced algorithms (LIRS, LRU-K, ARC, …) keep track of some non-resident blocks, to have access to more history.

2. How to efficiently compute the IRR (Inter-Reference Recency, number of other *distinct* blocks accessed between two consecutive references to the block)?

3. How to modify table and index scans so that they only invoke buf_page_get_gen() once per B-tree page, as required by the IRR computation?

Without this change, table scans would get IRR=0 instead of the desired IRR=∞. Currently, InnoDB will store the persisent cursor position, commit the mini-transaction (releasing the page) and restore the cursor (re-latching the page) each time it returns a record to the user. The prebuilt->fetch_cache[] alleviates this a little, but it is not tightly coupled with B-tree pages.
[27 Aug 2009 7:22] Marko Mäkelä
InnoDB Plugin 1.0.5 will address this by introducing two settable global variables, innodb_old_blocks_pct and innodb_old_blocks_time.

The parameter innodb_old_blocks_pct (5..95) controls the desired amount of "old" blocks in the LRU list.  The default is 37, corresponding to the old fixed ratio of 3/8.

Each time a block is accessed, it will be moved to the "new" blocks if its first access was at least innodb_old_blocks_time milliseconds ago (default 0, meaning every block).  The idea is that in index scans, blocks will be accessed a few times within innodb_old_blocks_time, and they will remain in the "old" section of the LRU list.  Thus, when innodb_old_blocks_time is nonzero, blocks retrieved for one-time index scans will be more likely candidates for eviction than blocks that are accessed in random patterns.
[8 Sep 2009 11:20] James Day
Marko, am I right in thinking that innodb_old_blocks_time = 1 would often be reasonably close in meaning to "if table scans are uncommon, only move non-leaf nodes to the old block list"?

If so, 1 seems like a better default than 0 because it would reduce the surprise effect that unexpected table scans produce for end users, by greatly decreasing the number of pages that become old during a scan. Maybe better to suggest it in documentation than change the default behavior at this time.
[8 Sep 2009 12:15] Marko Mäkelä
James, you are probably right. When the adaptive hash index works efficiently, read-heavy workloads should access mostly the leaf pages. However, we are somewhat reluctant to change the default behaviour in the built-in InnoDB of MySQL 5.1. A change of the default setting could be considered for the InnoDB Plugin.
[14 Oct 2009 14:39] Bugs System
Pushed into 5.1.41 (revid:joro@sun.com-20091014143611-cphb0enjlx6lpat1) (version source revid:satya.bn@sun.com-20091008130559-4b6sduldhka30zuz) (merge vers: 5.1.40) (pib:13)
[22 Oct 2009 6:35] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091022063126-l0qzirh9xyhp0bpc) (version source revid:alik@sun.com-20091019135554-s1pvptt6i750lfhv) (merge vers: 6.0.14-alpha) (pib:13)
[22 Oct 2009 7:08] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091022060553-znkmxm0g0gm6ckvw) (version source revid:alik@sun.com-20091019131022-2o2ymjfjjoraq833) (merge vers: 5.5.0-beta) (pib:13)
[3 Nov 2009 21:31] Paul DuBois
Noted in 5.1.41, 5.5.0 changelogs. (Feature is not in 6.0.14)

In the default operation of the InnoDB buffer pool, a block is loaded
at the midpoint for the first access and then moved immediately to
the head of the list as soon as another access occurs. In the case of
a table scan (such as performed for a mysqldump operation), each
block read by the scan ends up moving to the head of the list because
multiple rows are accessed from each block. This occurs even for a
one-time scan, where the blocks are not otherwise used by other 
queries. Blocks may also be loaded by the read-ahead background
thread and then moved to the head of the new sublist by a single 
access. These effects are disadvantageous because they push blocks
that are in heavy use by other queries out of the new sublist to the
old sublist where they become subject to eviction.

InnoDB Plugin now provides two system variables that enable LRU
algorithm tuning: 

innodb_old_blocks_pct: Specifies the approximate percentage of the
buffer pool used for the old block sublist. The range of values is 5
to 95. The default value is 37 (that is, 3/8 of the pool).

innodb_old_blocks_time: Specifies how long in milliseconds (ms) a
block inserted into the old sublist must stay there before it can be
moved to the new sublist. The default value is 0, so after a block is
inserted into the old sublist, it moves immediately to the new
sublist the next time it is accessed, no matter how soon after 
insertion the next access occurs. If the value is greater than 0,
blocks remain in the old sublist until an access occurs at least that
many ms after initial insertion. For example, a value of 1000 causes
blocks to stay in the old sublist for 1 second before moving to the
new sublist.
[4 Nov 2009 15:56] Paul DuBois
For additional information about the new system variables, see:

[18 Dec 2009 10:35] Bugs System
Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091218102229-64tk47xonu3dv6r6) (version source revid:jonas@mysql.com-20091218095730-26gwjidfsdw45dto) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)
[18 Dec 2009 10:51] Bugs System
Pushed into 5.1.41-ndb-6.2.19 (revid:jonas@mysql.com-20091218100224-vtzr0fahhsuhjsmt) (version source revid:jonas@mysql.com-20091217101452-qwzyaig50w74xmye) (merge vers: 5.1.41-ndb-6.2.19) (pib:15)
[18 Dec 2009 11:06] Bugs System
Pushed into 5.1.41-ndb-6.3.31 (revid:jonas@mysql.com-20091218100616-75d9tek96o6ob6k0) (version source revid:jonas@mysql.com-20091217154335-290no45qdins5bwo) (merge vers: 5.1.41-ndb-6.3.31) (pib:15)
[18 Dec 2009 11:20] Bugs System
Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20091218101303-ga32mrnr15jsa606) (version source revid:jonas@mysql.com-20091218064304-ezreonykd9f4kelk) (merge vers: 5.1.41-ndb-7.0.11) (pib:15)
[5 May 2010 15:17] Bugs System
Pushed into 5.1.47 (revid:joro@sun.com-20100505145753-ivlt4hclbrjy8eye) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[6 May 2010 17:50] Paul DuBois
Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
[28 May 2010 5:49] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (pib:16)
[28 May 2010 6:19] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[28 May 2010 6:46] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[29 May 2010 23:24] Paul DuBois
Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
[15 Jun 2010 8:16] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100615080459-smuswd9ooeywcxuc) (version source revid:mmakela@bk-internal.mysql.com-20100415070122-1nxji8ym4mao13ao) (merge vers: 5.1.47) (pib:16)
[15 Jun 2010 8:32] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100615080558-cw01bzdqr1bdmmec) (version source revid:mmakela@bk-internal.mysql.com-20100415070122-1nxji8ym4mao13ao) (pib:16)
[17 Jun 2010 11:50] Bugs System
Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 12:28] Bugs System
Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 13:15] Bugs System
Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[2 Mar 2012 8:52] MySQL Verification Team
Take a look at this:
[6 Mar 2012 23:53] James Day
Yes, setting innodb_old_blocks_time to 1000 to 5000 is a good thing and likely to be helpful for quite a lot of workloads.