MySQL Bugs: #98295: Innodb internal deadlock cause a long wait semaphore, and a mysqld crash

Bug #98295	Innodb internal deadlock cause a long wait semaphore, and a mysqld crash
Submitted:	20 Jan 2020 11:00	Modified:	5 Feb 2020 9:19
Reporter:	Haixing Weng (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.7.23	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	deadlock, Semaphore

Description:
This year we observed that some salve mysqld instance crashed 4 times by a long semaphore time, the master/slave is running under a low workload; the master node is health as well when slave crashed after waiting 600 seconds.

The last semaphore infomation is post by attached file:

we can easily find deadlock of semaphore between thread 140244477683456 and 140298507421440, just like:

--Thread 140298507421440 has waited at fsp0fsp.cc line 3661 for 241.00 seconds the semaphore:
SX-lock on RW-latch at 0x7f946dd31ad8 created in file buf0buf.cc line 1460
a writer (thread id 140244477683456) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file btr0pcur.cc line 448
Last time write locked in file /home/upsql/chsun/build/upsql2/upsql-2/storage/innobase/include/fut0fut.ic line 57
2020-01-20T02:06:45.251192+08:00 0 [Warning] InnoDB: A long semaphore wait:

--Thread 140244477683456 has waited at fsp0fsp.cc line 656 for 241.00 seconds the semaphore:
SX-lock on RW-latch at 0x7f9835db4938 created in file buf0buf.cc line 1460
a writer (thread id 140298507421440) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file not yet reserved line 0
Last time write locked in file /home/upsql/chsun/build/upsql2/upsql-2/storage/innobase/fsp/fsp0fsp.cc line 167
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

BTW:
we fork mysql's repo as upsql, and some code block may be miss-match on v5.7.23
(Not a big deal because we didn't make change about innodb's B+ Tree or trx system)

Here is our coding block about semaphore:

fsp0fsp.cc line 3661:

bool
fsp_reserve_free_extents(
        ulint*          n_reserved,
        ulint           space_id,
        ulint           n_ext,
        fsp_reserve_t   alloc_type,
        mtr_t*          mtr,
        ulint           n_pages)
|- fil_space_t*            space = mtr_x_lock_space(space_id, mtr);
/////////////////////////////////////////////
buf0buf.cc line 1460:

static
void
buf_block_init(
/*===========*/
        buf_pool_t*     buf_pool,       /*!< in: buffer pool instance */
        buf_block_t*    block,          /*!< in: pointer to control block */
        byte*           frame)          /*!< in: pointer to buffer frame */

|- rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);

////////////////////////////////////////////

btr0pcur.cc line 448:

void
btr_pcur_move_to_next_page(
/*=======================*/
        btr_pcur_t*     cursor, /*!< in: persistent cursor; must be on the
                                last record of the current page */
        mtr_t*          mtr)    /*!< in: mtr */
|- next_block = btr_block_get(
                page_id_t(block->page.id.space(), next_page_no),
                block->page.size, mode,
                btr_pcur_get_btr_cur(cursor)->index, mtr);

//////////////////////////////////////////////
fsp0fsp.cc line 656 

xdes_t*
xdes_get_descriptor(
        ulint                   space_id,
        ulint                   offset,
        const page_size_t&      page_size,
        mtr_t*                  mtr)
|- block = buf_page_get(page_id_t(space_id, 0), page_size,
                             RW_SX_LATCH, mtr);

//////////////////////////////////////////////
fsp0fsp.cc line 167:

UNIV_INLINE
fsp_header_t*
fsp_get_space_header(
        ulint                   id,
        const page_size_t&      page_size,
        mtr_t*                  mtr)
|- block = buf_page_get(page_id_t(id, 0), page_size, RW_SX_LATCH, mtr);

How to repeat:
I don't know how to repeat it, but we observed 4 times crash yet.

I am still trying to reading source code to make out what happend in slave node, and how to reproduct the issue.

Hope any comments about it.

Hi Mr. Weng,

Thank you for your bug report.

However, we can not analyse it nor verify it.

A code has changed dramatically between 5.7.23 and current 5.7, so that many of those locks have simply disappeared.

Let us know if you get still semaphore waits on latest 5.7 or 8.0.

Not a bug.

We make a panic patch on our source code.

Thank you Mr. Weng.