Bug #59727 LRU search can be inefficient with compression
Submitted: 25 Jan 2011 21:36
Reporter: Inaam Rana Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.1 plugin, 5.5, 5.6 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: compression, LRU, performance
Triage: Triaged: D3 (Medium)

[25 Jan 2011 21:36] Inaam Rana
Description:
Assume we are using compression on an IO bound workload. This means that our unzip_LRU will be roughly 10% of the common_LRU which implies that when we have to choose a victim for replacement we'll are very likely to find a compressed only page at the tail of LRU.

1) We need to read in a compressed page (say all compressed pages are 8K)
2) Seeing there are no available 8K pages buf_buddy_alloc() calls buf_LRU_get_free_block()
3) Seeing no free pages in the free_list buf_LRU_get_free_block() triggers a scan of common LRU list
4) Finding a clean page in the tail of LRU it calls buf_LRU_free_block()
5) This in turn calls buf_LRU_block_remove_hashed_page()
6) As we remove the page from page_hash and LRU and seeing the page as BUF_BLOCK_ZIP_PAGE we call buf_buddy_free() on it
7) I presume buf_buddy_free() will put the page somewhere on a list of free zip pages
8) But now as we return to buf_LRU_get_free_block() we retry to find block in the buf_pool::free_list.
9) However, the block we have just freed was a zipped block and is not on buf_pool::free_list
10) We keep on doing this until we find a block that has an uncompressed frame with it or buf_buddy_free() is able to put a block back on the free list.

How to repeat:
see above.

Suggested fix:
From Marko:

This sounds like a plausible scenario. When buf_buddy_alloc_zip() fails, buf_buddy_alloc_low() is falling through to the LRU list scan, and will not stop until a complete 16k block has been freed. (As far as I understand, it could be freed as a result of buddy merging, e.g., freeing two 8k blocks, so it should not be as bad as your point 10 says.)

The fix would be to add another step to buf_buddy_alloc_low() after the failing buf_buddy_alloc_zip() call. That step should scan the LRU list for replaceable blocks of the requested or larger size.
[4 Jun 2011 0:13] Mark Callaghan
I think I get this problem while dropping tables. All tables are compressed and the buffer pool is ~60G.  Table use key_block_size=8. There are 6 tables to drop, I used "drop database" and this has taken 30+ minutes. One server crashed from a long lock wait, so I think this is more serious than S5 (performance).

The call stack from MySQL 5.1.52 + Facebook patch is:
buf_buddy_free_low,buf_LRU_block_remove_hashed_page,buf_LRU_invalidate_tablespace,fil_delete_tablespace,row_drop_table_for_mysql,row_drop_tables_for_mysql_in_background,srv_master_thread,start_thread,clone

The first long lock wait message is:

InnoDB: Warning: a long semaphore wait:
--Thread 1201695040 has waited at buf/buf0buf.c line 1671 for 247.00 seconds the semaphore:
Mutex at 0x37fcc00 created file buf/buf0buf.c line 1050, lock var 0
waiters flag 0
wait has ended

Thread 13 (Thread 1246906688 (LWP 31166)):
#0  buf_buddy_free_low (buf=0x2ab7c4382600, i=2) at buf/buf0buddy.c:575
#1  0x0000000000832594 in buf_LRU_block_remove_hashed_page (bpage=0x2ab7c4382700, zip=1) at ./include/buf0buddy.ic:121
#2  0x000000000083467c in buf_LRU_invalidate_tablespace (id=96) at buf/buf0lru.c:469
#3  0x000000000085352f in fil_delete_tablespace (id=96) at fil/fil0fil.c:2401
#4  0x00000000007d300d in row_drop_table_for_mysql (name=0x2ab93809d1d8 "test/purchases_index6", trx=0x2ab97af3daf0, drop_db=<value optimized out>, delayed_drop=0) at row/row0mysql.c:3409
#5  0x00000000007d3f24 in row_drop_tables_for_mysql_in_background (wait=0) at row/row0mysql.c:2153
#6  0x00000000007e9a57 in srv_master_thread (arg=<value optimized out>) at srv/srv0srv.c:3080
#7  0x00000031c6c062f7 in start_thread () from /lib64/libpthread.so.0
#8  0x00000031c64d1e3d in clone () from /lib64/libc.so.6
[4 Jun 2011 0:43] Mark Callaghan
Maybe this isn't the bug, but I noticed this because of buf_LRU_block_remove_hashed_page. I will file a new bug.
[6 Jun 2011 19:22] Inaam Rana
Mark,

As you mentioned the drop table might be a different bug from this one. This concerns mostly with how we get a free page from LRU whereas the drop issue is more about how we free buddies when a compressed table is dropped. Both issues are related to same area but IMO they are distinct issues.
[4 Jan 2013 9:25] zhai weixiang
I had got a similar stall of 'DROP TABLE'  and it lead to crash because of long time semaphore waiting.

backtrace:
#0 0x00000000008a8c2d in buf_buddy_free_low ()
#1 0x0000000000815234 in buf_LRU_block_remove_hashed_page ()
#2 0x0000000000815fb3 in buf_LRU_free_block ()
#3 0x0000000000817143 in buf_LRU_search_and_free_block ()
#4 0x00000000008187bc in buf_LRU_get_free_block ()
#5 0x000000000080bf0e in buf_page_get_gen ()
#6 0x00000000007fdc58 in btr_search_drop_page_hash_when_freed ()
#7 0x000000000084c489 in fseg_free_extent ()
#8 0x000000000084d284 in fseg_free_step ()
#9 0x00000000007ec4a9 in btr_free_but_not_root ()
#10 0x000000000081d50e in dict_drop_index_tree ()
#11 0x00000000007b6035 in row_upd_clust_step ()
#12 0x00000000007b69de in row_upd_step ()
#13 0x0000000000899b88 in que_run_threads ()
#14 0x000000000089a286 in que_eval_sql ()
#15 0x00000000007a5ce9 in row_drop_table_for_mysql ()
#16 0x00000000007918ae in ha_innobase::delete_table(char const*) ()

buf_LRU_get_free_block was always  called in buf_page_get_gen because we are using compressed table.
 
I think we can 'reserve' some blocks (can only be accessed and reused by the owner )for such operation  to avoid frequent call of buf_LRU_get_free_block.