Bug #53122 InnoDB recovery uses too big a hash table for redo log records
Submitted: 23 Apr 2010 17:15 Modified: 19 Jun 2010 17:56
Reporter: Mikhail Izioumtchenko Email Updates:
Status: Closed Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.1+ OS:Any
Assigned to: Marko Mäkelä CPU Architecture:Any
Triage: Triaged: D3 (Medium)

[23 Apr 2010 17:15] Mikhail Izioumtchenko
InnoDB uses a hash table to store redo log records during recovery.
It is currently allocated as follows:

recv_sys->addr_hash = hash_create(available_memory / 64);

in recv_sys_init()

where available_memory is InnoDB buffer pool size in bytes.
The hash table is allocated outside of the buffer pool.
Considering the hash table element is a pointer this is 1G of hash table
per 8G of buffer pool on a 64 bit system which is a lot. 
In one case I couldn't recover a crash with the same size buffer pool
because of this and had to reduce innodb-buffer-pool-size for the recovery
to succeed.
The dimension of the hash table is excessive. Considering that redo log
records are hashed on space id and page offset we need at most as many elements
as we had dirty pages in the buffer pool at the moment of the crash.
For the common case of the same buffer pool size before and after the crash,
this can be estimated as the total number of pages in the buffer pool.

How to repeat:
see the code. The case where I couldn't recover the dataset with the same
buffer pool was with 29G buffer pool, 32G RAM altogether, 2G swap.
Not very reasonable in a customer environment, I hit it accidentally
while trying to create a dataset with a lot of dirty pages, for recovery tests.

Suggested fix:
64 in the formula above can be replaced with (16*1024) for InnoDB builtin,
and 1024 for the plugin to account for the case of all 1K (compressed) 
data pages. 1024 is 16 times as little memory as the current case and
works fine in my tests.
[28 Apr 2010 8:54] Marko Mäkelä
The recv_sys->addr_hash is allocated in two places of log0recv.c, in both the InnoDB Plugin and the built-in InnoDB in MySQL 5.1:

recv_sys_init(available_memory = buf_pool_get_curr_size()): available_memory / 64
recv_sys_empty_hash(): buf_pool_get_curr_size() / 256

Curiously, recv_sys_empty_hash() will create a smaller hash table. In InnoDB, recv_sys_init() is always invoked with available_memory = buf_pool_get_curr_size(). I believe that the parameter available_memory is passed differently in InnoDB Hot Backup.
[28 Apr 2010 12:34] Marko Mäkelä
If we make recv_addr_t:space,page_no bit-fields, then sizeof(recv_addr_t)==48 on 64-bit systems. On 32-bit systems, it is 28 bytes.

The reasonable divider would have to lie somewhere between 1024/48 == 21 and 16384/28 == 585. I would suggest to replace the /64 and /256 with /512.
[28 Apr 2010 12:55] Marko Mäkelä
Sorry, sizeof(hash_cell_t) == sizeof(void*), that is, 4 or 8 bytes. The reasonable range for the divider would be between 1024/8 == 128 and 16384/4 == 4096. I hope that 512 provides a reasonable middle ground.
[28 Apr 2010 13:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

[28 Apr 2010 13:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

[29 Apr 2010 13:04] Heikki Tuuri
I agree that InnoDB used excessively large hash table arrays, and it makes sense to make them smaller.
[5 May 2010 15:19] Bugs System
Pushed into 5.1.47 (revid:joro@sun.com-20100505145753-ivlt4hclbrjy8eye) (version source revid:kristofer.pettersson@sun.com-20100503172109-f9hracq5pqsaomb1) (merge vers: 5.1.47) (pib:16)
[12 May 2010 5:26] Marko Mäkelä
For the documentation:

InnoDB stores redo log records in a hash table during recovery. On 64-bit systems, this hash table was 1/8 of the buffer pool size. The dimension of the hash table was reduced to 1/64 of the buffer pool size (or 1/128 on 32-bit systems).
[12 May 2010 19:28] Paul Dubois
Noted in 5.1.47, 5.5.5 changelogs.
[28 May 2010 6:03] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:alik@sun.com-20100512070920-xgpmqeytp0gc183c) (pib:16)
[28 May 2010 6:31] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:alik@sun.com-20100507093037-7cykrx1n73v0tetc) (merge vers: 6.0.14-alpha) (pib:16)
[28 May 2010 6:59] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:alexey.kopytov@sun.com-20100507164602-8w09samq3mpvbxbn) (merge vers: 5.5.5-m3) (pib:16)
[29 May 2010 22:42] Paul Dubois
Noted in 6.0.14 changelog.
[17 Jun 2010 12:07] Bugs System
Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:martin.skold@mysql.com-20100616204905-jxjg342w35ks9vfy) (merge vers: 5.1.47-ndb-7.0.16) (pib:16)
[17 Jun 2010 12:52] Bugs System
Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:martin.skold@mysql.com-20100615090726-jotpykke96le59w5) (merge vers: 5.1.47-ndb-6.2.19) (pib:16)
[17 Jun 2010 13:34] Bugs System
Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:martin.skold@mysql.com-20100616120453-jh7wr05z1vf7r8pm) (merge vers: 5.1.47-ndb-6.3.35) (pib:16)