Bug #81615 oltp run will hang while resizing innodb buffer pool size online
Submitted: 26 May 2016 13:56 Modified: 8 Jul 2016 8:09
Reporter: Shahriyar Rzayev Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.7.12, 5.7.13 OS:Any
Assigned to: CPU Architecture:Any

[26 May 2016 13:56] Shahriyar Rzayev
Description:
Running sysbench:

sysbench --db-driver=mysql --mysql-table-engine=InnoDB --mysql-db=dbtest --mysql-user=root --mysql-password=Baku12345#  --num-threads=10 --test=/usr/share/doc/sysbench/tests/db/oltp.lua --max-requests=0 --report-interval=2 run

If you try to resize buffer pool size:

mysql> set global innodb_buffer_pool_size=107374182400;
Query OK, 0 rows affected (0.02 sec)

The result from sysbench side you will see some drops:

[ 288s] threads: 10, tps: 446.35, reads: 6307.62, writes: 1792.86, response time: 29.20ms (95%), errors: 3.99, reconnects:  0.00
[ 290s] threads: 10, tps: 429.37, reads: 6041.26, writes: 1729.01, response time: 30.64ms (95%), errors: 2.51, reconnects:  0.00
[ 292s] threads: 10, tps: 24.01, reads: 288.65, writes: 80.04, response time: 54.22ms (95%), errors: 0.00, reconnects:  0.00
[ 294s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 296s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 298s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 300s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 302s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 304s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 306s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 308s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 310s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 312s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 314s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 316s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 318s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 320s] threads: 10, tps: 270.37, reads: 3874.16, writes: 1097.98, response time: 46.45ms (95%), errors: 2.50, reconnects:  0.00
[ 322s] threads: 10, tps: 436.64, reads: 6159.96, writes: 1753.06, response time: 31.50ms (95%), errors: 3.50, reconnects:  0.00

From error log:

2016-05-26T13:49:49.926640Z 25 [Note] InnoDB:  (new size: 107374182400 bytes)
2016-05-26T13:49:49.938676Z 0 [Note] InnoDB: Resizing buffer pool from 96636764160 to 107374182400 (unit=134217728).
2016-05-26T13:49:49.938707Z 0 [Note] InnoDB: Disabling adaptive hash index.
2016-05-26T13:50:17.953850Z 0 [Note] InnoDB: disabled adaptive hash index.
2016-05-26T13:50:17.961212Z 0 [Note] InnoDB: Withdrawing blocks to be shrunken.
2016-05-26T13:50:17.961229Z 0 [Note] InnoDB: Latching whole of buffer pool.
2016-05-26T13:50:17.963972Z 0 [Note] InnoDB: buffer pool 0 : resizing with chunks 720 to 800.
2016-05-26T13:50:18.178812Z 0 [ERROR] InnoDB: mmap(135069696 bytes) failed; errno 12
2016-05-26T13:50:18.178858Z 0 [ERROR] InnoDB: buffer pool 0 : failed to allocate new memory.
2016-05-26T13:50:18.178865Z 0 [Note] InnoDB: buffer pool 0 : 22 chunks (45056 blocks) were added.
2016-05-26T13:50:18.390841Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 27033ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)
2016-05-26T13:50:18.476769Z 0 [Note] InnoDB: Completed to resize buffer pool from 96636764160 to 107374182400.
2016-05-26T13:50:18.476791Z 0 [Note] InnoDB: Re-enabled adaptive hash index.
2016-05-26T13:50:18.476826Z 0 [Note] InnoDB: Resizing buffer pool failed, finished resizing at 160526 17:50:18.

So based on this we can say that, this is a "online" resize impacting performance.

How to repeat:
See description

Suggested fix:
Maybe a doc note should be added based on this issue, to be clear how actual resizing impacts running statements?
[26 May 2016 19:56] MySQL Verification Team
>
> mmap(135069696 bytes) failed; errno 12
>

does it work better if you have enough free memory ;) ?
[27 May 2016 11:13] Shahriyar Rzayev
Hehe, okey, checked :)

Here is the result while trying to decrease buffer pool size. Resizing from 50G to 1G:

From error log:

2016-05-27T11:09:52.069811Z 0 [Note] InnoDB: Resizing buffer pool from 53687091200 to 1073741824 (unit=134217728).
2016-05-27T11:09:52.082130Z 0 [Note] InnoDB: Disabling adaptive hash index.
2016-05-27T11:09:52.069866Z 2 [Note] InnoDB: Resizing buffer pool from 53687091200 to 1073741824 (unit=134217728). (new size: 1073741824 bytes)
2016-05-27T11:10:05.308972Z 0 [Note] InnoDB: disabled adaptive hash index.
2016-05-27T11:10:05.334533Z 0 [Note] InnoDB: Withdrawing blocks to be shrunken.
2016-05-27T11:10:05.356329Z 0 [Note] InnoDB: buffer pool 0 : start to withdraw the last 802816 blocks.
2016-05-27T11:10:05.570576Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8723ms. The settings might not be optimal. (flushed=139 and evicted=0, during the time.)
2016-05-27T11:10:05.635378Z 0 [Note] InnoDB: buffer pool 0 : withdrawing blocks. (802816/802816)
2016-05-27T11:10:05.635410Z 0 [Note] InnoDB: buffer pool 0 : withdrew 802816 blocks from free list. Tried to relocate 0 pages (802816/802816).
2016-05-27T11:10:05.645168Z 0 [Note] InnoDB: buffer pool 0 : withdrawn target 802816 blocks.
2016-05-27T11:10:05.645289Z 0 [Note] InnoDB: Latching whole of buffer pool.
2016-05-27T11:10:05.645312Z 0 [Note] InnoDB: buffer pool 0 : resizing with chunks 400 to 8.
2016-05-27T11:10:16.449675Z 0 [Note] InnoDB: buffer pool 0 : 392 chunks (802816 blocks) were freed.
2016-05-27T11:10:16.449745Z 0 [Note] InnoDB: Resizing hash tables.
2016-05-27T11:10:17.162075Z 0 [Note] InnoDB: buffer pool 0 : hash tables were resized.
2016-05-27T11:10:17.162272Z 0 [Note] InnoDB: Resizing also other hash tables.
2016-05-27T11:10:19.947222Z 0 [Note] InnoDB: Resized hash tables at lock_sys, adaptive hash index, dictionary.
2016-05-27T11:10:19.947607Z 0 [Note] InnoDB: Completed to resize buffer pool from 53687091200 to 1073741824.
2016-05-27T11:10:19.947624Z 0 [Note] InnoDB: Re-enabled adaptive hash index.
2016-05-27T11:10:19.947653Z 0 [Note] InnoDB: Completed resizing buffer pool at 160527 15:10:19.

From sysbench side:

[  40s] threads: 10, tps: 458.27, reads: 6426.76, writes: 1822.10, response time: 31.16ms (95%), errors: 1.50, reconnects:  0.00
[  42s] threads: 10, tps: 70.16, reads: 1005.34, writes: 276.14, response time: 61.08ms (95%), errors: 1.00, reconnects:  0.00
[  44s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  46s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  48s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  50s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  52s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  54s] threads: 10, tps: 0.00, reads: 0.98, writes: 0.98, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  56s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  58s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  60s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  62s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  64s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  66s] threads: 10, tps: 0.00, reads: 0.00, writes: 1.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  68s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[  70s] threads: 10, tps: 398.66, reads: 5644.14, writes: 1599.14, response time: 34.44ms (95%), errors: 5.99, reconnects:  0.00
[  72s] threads: 10, tps: 453.08, reads: 6427.13, writes: 1829.32, response time: 28.92ms (95%), errors: 4.50, reconnects:  0.00
[30 May 2016 11:16] zhai weixiang
According the description, i think it's possible that the hang may happen if there are too many blocks that need to be deleted from buffer pool. Because before deleting blocks, it will acquire all buf_pool_mutex/hash_lock

Then it will free block->mutex and block->lock for all the blocks that need to be deleted, which may be a slow progress. And during this progress, the user clients will get blocked if it wants to read any blocks from buffer pool.
[30 May 2016 11:22] Shahriyar Rzayev
Thanks Zhai for comment :)
Pasting here link for pt-pmp output:

https://gist.github.com/ShahriyarR/c3452f5b25b969daa1779f4d70445871
[7 Jul 2016 22:10] Roel Van de Paar
Please verify/update?
[8 Jul 2016 8:09] MySQL Verification Team
Thank you for the report.

Thanks,
Umesh
[21 Jul 2016 3:15] Roel Van de Paar
Also, http://bugs.mysql.com/bug.php?id=82301
[19 May 2019 9:36] WANG GUANGYOU
Simple and direct

From the doc: 
Active transactions and operations performed through InnoDB APIs should be completed before resizing the buffer pool. When initiating a resizing operation, the operation does not start until all active transactions are completed. Once the resizing operation is in progress, new transactions and operations that require access to the buffer pool must wait until the resizing operation finishes. The exception to the rule is that concurrent access to the buffer pool is permitted while the buffer pool is defragmented and pages are withdrawn when buffer pool size is decreased. A drawback of allowing concurrent access is that it could result in a temporary shortage of available pages while pages are being withdrawn.

From the code:
/* Acquire all buffer pool mutexes and hash table locks */
  /* TODO: while we certainly lock a lot here, it does not necessarily
  buy us enough correctness, see a comment at buf_block_align. */
  for (ulint i = 0; i < srv_buf_pool_instances; ++i)
    mutex_enter(&(buf_pool_from_array(i)->LRU_list_mutex));
  for (ulint i = 0; i < srv_buf_pool_instances; ++i)
    hash_lock_x_all(buf_pool_from_array(i)->page_hash);
  for (ulint i = 0; i < srv_buf_pool_instances; ++i)
    mutex_enter(&(buf_pool_from_array(i)->zip_free_mutex));
  for (ulint i = 0; i < srv_buf_pool_instances; ++i)
    mutex_enter(&(buf_pool_from_array(i)->free_list_mutex));
  for (ulint i = 0; i < srv_buf_pool_instances; ++i)
    mutex_enter(&(buf_pool_from_array(i)->zip_hash_mutex));
  for (ulint i = 0; i < srv_buf_pool_instances; ++i)
    mutex_enter(&(buf_pool_from_array(i)->flush_state_mutex));