MySQL Bugs: #59733: Possible deadlock when buffered changes are to be discarded in buf_page

Bug #59733	Possible deadlock when buffered changes are to be discarded in buf_page_create()
Submitted:	26 Jan 2011 7:32	Modified:	7 Oct 2011 20:59
Reporter:	Marko Mäkelä	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	3.23.53,4.0,4.1,5.0,5.1,5.5	OS:	Any
Assigned to:	Marko Mäkelä	CPU Architecture:	Any
Tags:	deadlock, innodb_change_buffering, UNIV_SYNC_DEBUG

Description:
This UNIV_SYNC_DEBUG assertion failure was observed during a test:

InnoDB: Failing assertion: sync_thread_levels_g(array, SYNC_TREE_NODE - 1, TRUE)

 #9  0x00000000009144c7 in rw_lock_s_lock_func (lock=0x18fdd0d8, pass=0,
     file_name=0xbc41f0 "btr/btr0cur.c", line=523)
     at include/sync0rw.ic:412
 #10 0x000000000091444d in mtr_s_lock_func (lock=0x18fdd0d8,
     file=0xbc41f0 "btr/btr0cur.c", line=523, mtr=0x49b90930)
     at include/mtr0mtr.ic:252
 #11 0x0000000000912bb1 in btr_cur_search_to_nth_level (index=0x18fdd058,
     level=0, tuple=0x2aaab41198a8, mode=2, latch_mode=2, cursor=0x49b90e00,
     has_search_latch=0,
     file=0xbddbd0 "ibuf/ibuf0ibuf.c", line=4543, mtr=0x49b90930)
     at btr/btr0cur.c:523
 #12 0x000000000092720c in btr_pcur_open_func (index=0x18fdd058,
     tuple=0x2aaab41198a8, mode=2, latch_mode=2, cursor=0x49b90e00,
     file=0xbddbd0 "ibuf/ibuf0ibuf.c", line=4543, mtr=0x49b90930)
     at include/btr0pcur.ic:498
 #13 0x00000000009270dc in btr_pcur_open_on_user_rec_func (index=0x18fdd058,
     tuple=0x2aaab41198a8, mode=2, latch_mode=2, cursor=0x49b90e00,
     file=0xbddbd0 "ibuf/ibuf0ibuf.c", line=4543, mtr=0x49b90930)
     at btr/btr0pcur.c:573
 #14 0x000000000099e078 in ibuf_merge_or_delete_for_page (block=0x0, space=0,
     page_no=2860, zip_size=0, update_ibuf_bitmap=1)
     at ibuf/ibuf0ibuf.c:4541
 #15 0x000000000093c9de in buf_page_create (space=0, offset=2860, zip_size=0,
     mtr=0x49b915c0)
     at buf/buf0buf.c:3890
 #16 0x00000000009831f8 in fsp_alloc_free_page (space=0, zip_size=0, hint=0,
     mtr=0x49b915c0)
     at fsp/fsp0fsp.c:1669
 #17 0x0000000000985b85 in fseg_alloc_free_page_low (space=0, zip_size=0,
     seg_inode=0x2aaaacc30ab2 "", hint=1557, direction=111 'o', mtr=0x49b915c0)
     at fsp/fsp0fsp.c:2708
 #18 0x000000000098604a in fseg_alloc_free_page_general (
     seg_header=0x2aaaac84804a "", hint=1557, direction=111 'o',
     has_done_reservation=1, mtr=0x49b915c0)
     at fsp/fsp0fsp.c:2868
 #19 0x00000000009058fe in btr_page_alloc (index=0x2aaab414ce68,
     hint_page_no=1557, file_direction=111 'o', level=0, mtr=0x49b915c0)
     at btr/btr0btr.c:388
 #20 0x000000000090c8a8 in btr_page_split_and_insert (cursor=0x49b91aa0,
     tuple=0x194b61c0, n_ext=0, mtr=0x49b915c0)
     at btr/btr0btr.c:1976
 #21 0x0000000000919874 in btr_cur_pessimistic_insert (flags=0,
     cursor=0x49b91aa0, entry=0x194b61c0, rec=0x49b91a98, big_rec=0x49b91a90,
     n_ext=0, thr=0x1963f828, mtr=0x49b915c0)
     at btr/btr0cur.c:1535
 #22 0x0000000000a0e12f in row_ins_index_entry_low (mode=33,
     index=0x2aaab414ce68, entry=0x194b61c0, n_ext=0, thr=0x1963f828)
     at row/row0ins.c:2102
 #23 0x0000000000a0e6fb in row_ins_index_entry (index=0x2aaab414ce68,
     entry=0x194b61c0, n_ext=0, foreign=1, thr=0x1963f828)
     at row/row0ins.c:2181
 #24 0x0000000000a0ea59 in row_ins_index_entry_step (node=0x1963f378,
     thr=0x1963f828)
     at row/row0ins.c:2257
 #25 0x0000000000a0eb75 in row_ins (node=0x1963f378, thr=0x1963f828)
     at row/row0ins.c:2389
 #26 0x0000000000a0f322 in row_ins_step (thr=0x1963f828)
     at row/row0ins.c:2499
 #27 0x000000000089e89d in row_insert_for_mysql (
     mysql_rec=0x1974c380 "\205", <incomplete sequence \366\200>,
     prebuilt=0x19292ff8)
     at row/row0mysql.c:1207

The reason is that InnoDB is trying to mtr_s_lock(&ibuf->index->lock, mtr) while already holding a file space latch on the user table (from the btr_page_alloc()) and not holding the insert buffer mutex.

During the tests, also the debug printout that I added in Bug #58212 was triggered. The printout can now be removed. I added the printout to verify that my assumptions about that (for me one-time-occurrence) bug were correct.

How to repeat:
Run suitable tests with UNIV_SYNC_DEBUG.

Suggested fix:
This particular failure seems to be caused by this call in buf_page_create():

	/* Delete possible entries for the page from the insert buffer:
	such can exist if the page belonged to an index which was dropped */

	ibuf_merge_or_delete_for_page(NULL, space, offset, zip_size, TRUE);

Investigate if we can do this earlier, either in one of the fseg_alloc* functions, or when dropping an index tree.

This call could also be removed if we switched to a new format where each index has its own change buffer.

Are there other failures than those caused by this call in buf_page_create()?

/* Delete possible entries for the page from the insert buffer:
such can exist if the page belonged to an index which was dropped */

ibuf_merge_or_delete_for_page(NULL, space, offset, zip_size, TRUE);

This comment and call have been there from the beginning (MySQL 3.23.53).

The comment looks somewhat bogus under MySQL. Until InnoDB Plugin implemented fast index creation and dropping, it was not possible to drop indexes without dropping the entire table.

I am starting to think that the statement is unnecessary. There are three cases:

(1) dropping a table with innodb_file_per_table=0
(2) dropping a table with innodb_file_per_table=1
(3) dropping an index with MySQL 5.1 InnoDB Plugin or later

Obviously, case (2) is safe. When a table is dropped and re-created with innodb_file_per_table=1, the insert buffer bitmap pages for the old space_id will be dropped, and new spaces will see an empty insert buffer bitmap.

Case (3) might be unsafe. An interrupted operation will be retried on crash recovery, but the problem appears to be that dict_drop_index_tree() does not clear the insert buffer bitmap bits or the buffered entries when freeing the pages that belonged to the index.

I did not investigate case (1). It suffices that there is one case where the insert buffer may end up containing garbage.

The correct place for this clearing seems to be in the page allocation. We will already have to latch the allocation bitmap page and the allocated page. We might as well latch the insert buffer bitmap page. If the insert buffer bitmap indicates that there may be some garbage entries in the insert buffer tree, we should delete that garbage after committing the mtr. We should do that even if the page is allocated for something else than a B-tree, such as a BLOB.

In the long term, a new storage format for the change buffer would solve this garbage collection problem. We should make it cheap to drop the change buffer that belongs to an index, e.g., by creating one insert buffer file or file segment per index that contains buffered entries.

from performance perspective, cleaning on allocation doesn't seem the right
thing to do because for the most part there will be no work yet we'll have
to go through what essentially is a DELETE FROM ibuf WHERE space = 0 and page = y,
for the worst case scenario of innodb-file-per-table=0.
The right thing to do is to delete from ibuf asynchronously somehow.
This is another thing that calls for a separate ibuf merging thread. 
It will get a request from the DROP INDEX thread and remove the entries.
This should all be transactional in such a way that the request itself be 
persistent so if we crash we still drop those pages in the background.

Michael, there already is a background merge process, but I guess that it is not aware of pages that have been freed. Because we do not zero out pages that are freed, the insert buffer merge will happily apply the buffered changes, no matter if the page has been freed or it is still in use. No harm is done, except for some CPU and I/O wasted on freed pages.

When we allocate a freed page for something else, we must delete the buffered entries if there were any. Usually, there are none, and this only costs us one additional buffer-fixing of a busy page (the insert buffer bitmap page next to the allocation bitmap page).

If our change buffer records contained the index->id and if all versions of InnoDB had initialized FIL_PAGE_TYPE in all cases, insert buffer merge could check FIL_PAGE_TYPE and PAGE_INDEX_ID and determine that the changes can be discarded after the page was allocated for something else than the old, dropped index.

Again, the cleanest fix would be to organize the change buffer so that all buffered entries for a given index can be dropped quickly. Then, we could trim the change buffer already when dropping an index.

Workaround: use innodb_file_per_table and old_alter_table=0.

Most of the time, the ibuf_merge_or_delete_for_page() call in buf_page_create() would just latch the insert buffer bitmap page and note that there are no buffered changes. The latching order violation occurs when buffered changes exist and they should be discarded.

To fix the latching order violation, we could move the insert buffer tree changes to a separate operation, after the first mtr_commit(). This might break crash recovery, however. The to-be-discarded buffered entries could survive a crash and be accidentally merged to the created page. A possible solution would be to back off and retry after acquiring ibuf_mutex or something that takes care of potential deadlocks.

if we can determine that there are no ibuf entries for a page by just one extra
operation it may be tolerable but I'm not even sure of that.
As for quickly dropping all ibuf entries for an index the best way
could be invalidating them without doing anything. In fact when we do the ibuf
merge, we must already check if the space if is valid. All we need is 
an equivalent of the space id for an index. iirc ibuf is a btree
on something that starts with space id and page number, but the space id 
should really have been index id.

see also bug#59047, an observation by Vadim.

Bug #54583 looks similar/related to me.

Some (not all) possible duplicates: Bug #29068, Bug #31511, Bug #35291, Bug #38750, Bug #47183, Bug #49780, Bug #59047.

A duplicate (bogus "fix" by suppressing the check at startup): Bug #54583.

Again, other known causes of long semaphore waits are CHECK TABLE (Bug
#50723) and (already fixed) Bug #29560 InnoDB >= 5.0.30 hangs on adaptive hash rw-lock 'waiting for an X-lock'.

Marko,

the ibuf is strictly lower than the fsp in the latching order.

00647 #define SYNC_EXTERN_STORAGE 500
00648 #define SYNC_FSP    400
00649 #define SYNC_FSP_PAGE   395
00650 /*------------------------------------- Insert buffer headers */
00651 /*------------------------------------- ibuf_mutex */
00652 /*------------------------------------- Insert buffer tree */
00653 #define SYNC_IBUF_BITMAP_MUTEX  351
00654 #define SYNC_IBUF_BITMAP  350

If we make a pessimistic insert to the ibuf tree, then we have BEFOREHAND allocated pages to the private free list of the ibuf. See the function below.

Please check that the current ibuf code obeys these rules.

http://bugs.mysql.com/bug.php?id=59733 is probably just too strict an assertion, if ibuf still follows the rules given in sync.0sync.h.

Regards,

Heikki

http://drizzle.org/doxygen/d9/d8b/btr0btr_8c_source.html

00313 /**************************************************************/
00317 static
00318 buf_block_t*
00319 btr_page_alloc_for_ibuf(

The ibuf->index->lock is at SYNC_INDEX_TREE (900). I guess that we could introduce a SYNC_IBUF_INDEX_TREE for it that would sit right above SYNC_IBUF_BITMAP_MUTEX in the latching order.

Added to changelog:

This fix corrects cases where the MySQL server could hang or abort
with a long semaphore wait message. (This is a different issue than
when these symptoms occurred during a CHECK TABLE statement.)

The changelog entry went into 5.1.60, 5.5.17, 5.6.4.