MySQL Bugs: #68077: Mixing 4K/16K blocks increases 8K free blocks and holds buf pool mutex longer

Bug #68077	Mixing 4K/16K blocks increases 8K free blocks and holds buf pool mutex longer
Submitted:	14 Jan 2013 3:37	Modified:	3 Apr 2013 11:33
Reporter:	Yoshinori Matsunobu (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
1. InnoDB buffer pool is almost filled with 4KB compressed pages
2. Inserting into 16KB (uncompressed/compact) tables

If the above conditions are met, pages_free in 8KB page increases a lot.

mysql> select * from innodb_cmpmem;
+-----------+----------------------+------------+------------+----------------+-----------------+
| page_size | buffer_pool_instance | pages_used | pages_free | relocation_ops | relocation_time |
+-----------+----------------------+------------+------------+----------------+-----------------+
|      1024 |                    0 |          0 |          0 |              0 |               0 |
|      2048 |                    0 |          0 |          0 |              0 |               0 |
|      4096 |                    0 |     453427 |          1 |         606502 |               0 |
|      8192 |                    0 |          0 |      62170 |              0 |               0 |
|     16384 |                    0 |          0 |          0 |              0 |               0 |
....

This is actually a big problem. Very large zip_free list causes both stalls and insert slowdown. When this problem happened, one user thread spent quite a long time for buf_buddy_free_low(). buffer_pool_mutex is held during the whole process. This blocks almost all operations for a long time, including transaction commit (that locks log_sys and buffer_pool_mutex). cpu util also dropped to ~5%.

Here is an example stack trace.

buf_buddy_free_low:buf=0x7f13ea86a000,,buf_buddy_free,buf_LRU_block_remove_hashed_page:zip=1),buf_LRU_free_block,buf_flush_LRU_list_batch,buf_do_LRU_batch:out>,,buf_flush_batch:flush_type<optimized,page_cleaner_flush_LRU_tail,buf_flush_page_cleaner_thread:out>),start_thread,clone

#0  buf_buddy_free_low (buf_pool=0x1404530, buf=0x7f13ea86a000, i=3) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0buddy.cc:482
#1  0x0000000000a6d773 in buf_buddy_free (size=<optimized out>, buf=0xffffff00, buf_pool=<optimized out>) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/include/buf0buddy.ic:137
#2  buf_LRU_block_remove_hashed_page (bpage=0x7f076e07aa50, zip=1) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0lru.cc:2283
#3  0x0000000000a6ed0b in buf_LRU_free_block (bpage=0x7f076e07aa50, zip=1) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0lru.cc:1855
#4  0x0000000000a6a78d in buf_flush_LRU_list_batch (max=100, buf_pool=<optimized out>) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0flu.cc:1453
#5  buf_do_LRU_batch (max=<optimized out>, buf_pool=<optimized out>) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0flu.cc:1514
#6  buf_flush_batch (buf_pool=0x1404530, flush_type=<optimized out>, min_n=100, lsn_limit=0) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0flu.cc:1667
#7  0x0000000000a6bd07 in page_cleaner_flush_LRU_tail () at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0flu.cc:1830
#8  buf_flush_page_cleaner_thread (arg=<optimized out>) at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/storage/innobase/buf/buf0flu.cc:2372
#9  0x0000003eef0062f7 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003eee4d1e3d in clone () from /lib64/libc.so.6

How to repeat:
innodb options:
innodb_file_format=Barracuda
innodb_file_per_table=1
innodb_log_compressed_pages=0
innodb_flush_neighbors=0
innodb_buffer_pool_size=54G
innodb_log_file_size=2000M
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=256
thread_cache_size=2000
innodb_flush_log_at_trx_commit=0

fill buffer pool with 4KB compressed pages:
- create database db1 .. db50
- for each database, create a 4KB compressed table and insert many rows
 (Until LRU len:unzip_LRU len == 10:1)
- Stop inserting 4KB compressed tables. And do the same thing for 16KB compact tables.
- select * from information_schema.innodb_cmpmem and see pages_free increases. And check innodb_rows_inserted drops.

Suggested fix:
a) I do not understand why 8KB pages_free increased even though I didn't use 8KB pages at all. If the pages_free length is small enough, this problem won't happen.

b) Around buf0buddy.cc:482:
---
for (bpage = UT_LIST_GET_FIRST(buf_pool->zip_free[i]); bpage; ) {
  ...
  bpage = UT_LIST_GET_NEXT(list, bpage);
}
---
This is O(N). By using tree (Olog(N)) instead of list will mitigate the problem.

The allocation of InnoDB compressed pages is based on a binary buddy system, impelemented in buf0buddy.c. When the first 4k compressed page is allocated, buf0buddy.c will request a 16k page frame from the main buffer pool. This will then be split to 8k pages and further to 4k pages, to fulfill the request.

When a page is freed, the buddy system can try to relocate blocks in order to create bigger free blocks. If you are only using 4k compressed pages, it probably does not seem to make that much sense to try to relocate blocks in order to get a free 8k block when releasing a 4k block.

However, this relocation could still make some sense. If you release 4*n*4k blocks, the current system could release a full n*16k pages to the buffer pool. With the relocation disabled, in the worst case you would have one 4k block allocated in each 16k block, and the buffer pool would be underutilized. It is somewhat tricky to test and tweak this, because pages are not freed directly, but through the buffer pool LRU and unzip_LRU mechanism.

Thank you for the great bug report, Yoshinori! 

I have marked this as verified and moved it into the internal bugs DB for the InnoDB team to examine further.

Note: I have fixed this issue in our branch by changing zip_free from a list to a rb tree.

Added changelog entry to 5.6.11, 5.7.1:

"When the InnoDB buffer pool is almost filled with 4KB compressed pages, inserting into 16KB compact tables would cause 8KB >pages_free to increase, which could potentially slow or stall inserts."