MySQL Bugs: #99243: Don't need to get LRU_list_mutex in buf_page_io_complete if io

Bug #99243	Don't need to get LRU_list_mutex in buf_page_io_complete if io_type is read
Submitted:	13 Apr 2020 21:53	Modified:	16 Apr 2020 13:15
Reporter:	Zongzhi Chen (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0.*	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	buffer pool

Description:
In buf_page_io_complete function(), if the complete io_type is BUF_IO_READ, then we need get the LRU_list_mutex and release the LRU_list_mutex soon.

I know that if the io_type is BUF_IO_WRITE and if the flush type is BUF_FLUSH_LRU or BUF_FLUSH_SINGLE_PAGE, InnoDB may change the lru list, then it need hold the LRU_list_mutex.

```
  mutex_enter(&buf_pool->LRU_list_mutex);
  // I thought it doesn't need to hold the LRU_list_mutex to protect the BPageMutex
  // And it can get the page_mutex directly since the io_type is fix, the buf_block_t can't be repace or remove
  BPageMutex *page_mutex = buf_page_get_mutex(bpage);
  mutex_enter(page_mutex);

  if (io_type == BUF_IO_WRITE &&
      (
#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
          /* to keep consistency at buf_LRU_insert_zip_clean() */
          buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY ||
#endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
          buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU ||
          buf_page_get_flush_type(bpage) == BUF_FLUSH_SINGLE_PAGE)) {

    have_LRU_mutex = true; /* optimistic */
  } else {
    mutex_exit(&buf_pool->LRU_list_mutex);
  }
```

As I explained above, I though the code can change to this way to avoid hold the LRU_list_mutex when the io_type is BUF_IO_READ in buf_page_io_complete function.
And it is a bit costly to hold the LRU_list_mutex when the server is in heavy load..

```
  BPageMutex *page_mutex = buf_page_get_mutex(bpage);
  mutex_enter(page_mutex);

  if (io_type == BUF_IO_WRITE &&
      (
#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
          /* to keep consistency at buf_LRU_insert_zip_clean() */
          buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY ||
#endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
          buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU ||
          buf_page_get_flush_type(bpage) == BUF_FLUSH_SINGLE_PAGE)) {
      mutex_enter(&buf_pool->LRU_list_mutex);
    have_LRU_mutex = true; /* optimistic */
  } else {
      // do nothing...
  }

How to repeat:
read the code

If I still remember the code correctly, taking an individual buffer page mutex before the LRU list mutex would be a lock order violation

No, I don't think there is lock order between LRU_list_mutex and buffer page block mutex..
Since there is lots of code path that won't hold the LRU_list_mutex when it get the buffer page block mutex.. 

such as the code path:
buf_flush_page_and_try_neighbors will release the LRU_list_mutex/flush_list_mutex when calling buf_flush_page. In buf_flush_page function it will get the buffer page block mutex.

Actually, in most of code, if InnoDB won't get the  LRU_list_mutex of flush_list_mutex if it doesn't  move page in list..
I know we need lock order of buffer page block mutex and buffer page frame rw_lock, but I don't think we need the lock order between LRU_list_mutex and buffer page mutex..

Hello Mr. zongzhi,

Thank you for your bug report.

I think that I agree with you, but I do not think that this is a bug.

I think that this would make a good feature request or performance enhancement.

Hence, I would like to have you choose one of these two.

Thanks in advance.

I believe Laurynas Biveinis is right.

As of today, the latch_level_t enum defines sync level of LRU_list_mutex (which has LATCH_ID_BUF_POOL_LRU_LIST and is mapped) as SYNC_BUF_LRU_LIST level, which is higher than SYNC_BUF_BLOCK used for LATCH_ID_BUF_BLOCK_MUTEX:

  SYNC_BUF_BLOCK,
  SYNC_BUF_PAGE_HASH,
  SYNC_BUF_LRU_LIST,

This means that unless we have some clever hacks around latching order validation specific to these two kinds of latches, then we risk creating a deadlock cycle.
That's theory.
In practice, if I simply swap the order of these two lines in buf_page_io_complete():

5268:    mutex_enter(&buf_pool->LRU_list_mutex);
5269:
5270:    mutex_enter(page_mutex);

so that the page_mutex is taken before buf_pool->LRU_list_mutex, and then run:
cd ..;make -j30 && cd - && ./mtr --mem --parallel=20 --retry=0 --force --max-test-fail=0 --testcase-timeout=10 --mysqld=--innodb-sync-debug=1 --suite=innodb | tee mtr.log

Then I soon get assertion failure originating from buf_page_io_complete():
2020-04-15T07:31:49.621569Z 0 [ERROR] [MY-012988] [InnoDB] Thread 139625888737024 already owns a latch BUF_BLOCK_MUTEX at level 18 (SYNC_BUF_BLOCK ), which is at a lower/same level than the requested latch: 20 (SYNC_BUF_LRU_LIST). Mutex BUF_BLOCK_MUTEX created buf0buf.cc:0 addr: 0x7efd42745488 acquired: buf0buf.cc:5268
2020-04-15T07:31:49.622006Z 0 [ERROR] [MY-012986] [InnoDB] Latches already owned by this thread:
2020-04-15T07:31:49.622442Z 0 [ERROR] [MY-012987] [InnoDB] BUF_BLOCK_MUTEX -> 18 (SYNC_BUF_BLOCK)

One could be tempted to also swap the order of latches in latch_level_t enum.
We have to somehow reorder:
  SYNC_BUF_BLOCK,
  SYNC_BUF_PAGE_HASH,
  SYNC_BUF_LRU_LIST,
with the constraint that we want SYNC_BUF_LRU_LIST to have lower value than SYNC_BUF_BLOCK, which gives us three possibilities to experiment with:
A)
  SYNC_BUF_LRU_LIST,
  SYNC_BUF_BLOCK,
  SYNC_BUF_PAGE_HASH,

B)
  SYNC_BUF_LRU_LIST,
  SYNC_BUF_PAGE_HASH,
  SYNC_BUF_BLOCK,

C)
  SYNC_BUF_PAGE_HASH,
  SYNC_BUF_LRU_LIST,
  SYNC_BUF_BLOCK,

A) fails on assertion failure in buf_page_init_for_read():
2020-04-15T07:37:35.650285Z 1 [ERROR] [MY-011825] [InnoDB] LatchDebug::relock() latch order violation. level=20, back_latch_level=18, back_level=18.

B) fails on assertion failure in buf_page_init_for_read():
2020-04-15T07:40:55.647675Z 1 [ERROR] [MY-011825] [InnoDB] LatchDebug::relock() latch order violation. level=19, back_latch_level=18
, back_level=18.

C) fails on assertion failure in buf_page_init_for_read():
2020-04-15T07:43:33.460903Z 1 [ERROR] [MY-012988] [InnoDB] Thread 140389591049984 already owns a latch BUF_POOL_LRU_LIST at level 19 (SYNC_BUF_LRU_LIST ), which is at a lower/same level than the requested latch: 20 (SYNC_BUF_BLOCK). Mutex BUF_POOL_LRU_LIST created buf0buf.cc:1237 addr: 0x7faef83317b0 acquired: buf0buf.cc:4661
2020-04-15T07:43:33.461220Z 1 [ERROR] [MY-012986] [InnoDB] Latches already owned by this thread:
2020-04-15T07:43:33.461520Z 1 [ERROR] [MY-012987] [InnoDB] BUF_POOL_LRU_LIST -> 19 (SYNC_BUF_LRU_LIST)
2020-04-15T07:43:33.461827Z 1 [ERROR] [MY-012987] [InnoDB] HASH_TABLE_MUTEX -> 18 (SYNC_BUF_PAGE_HASH)

Indeed buf_page_init_for_read() contains a subsequence of operations:
4661:   mutex_enter(&buf_pool->LRU_list_mutex);
...
4703:    buf_page_mutex_enter(block);

So, at this point, to argue that we can avoid a deadlock cycle anyway, one has to use some less trivial arguments.
And I am not even sure if we want to have any non-trivial behaviour in this area.

Thank you very much, Jakub.

Hi,

I guess original idea of reporter is a right, there is no need to get the heavy `buf_pool->LRU_list_mutex` if io_type is read, but the proposed fix violates the latch order (we can fix this).

If you look at this snippet inside buf_page_io_complete() (on 8.0.19 tag),

```
5254  mutex_enter(&buf_pool->LRU_list_mutex);
5255
5256  BPageMutex *page_mutex = buf_page_get_mutex(bpage);
5257  mutex_enter(page_mutex);
5258
5259  if (io_type == BUF_IO_WRITE &&
5260      (
5261#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
5262          /* to keep consistency at buf_LRU_insert_zip_clean() */
5263          buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY ||
5264#endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
5265          buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU ||
5266          buf_page_get_flush_type(bpage) == BUF_FLUSH_SINGLE_PAGE)) {
5267
5268    have_LRU_mutex = true; /* optimistic */
5269  } else {
5270    mutex_exit(&buf_pool->LRU_list_mutex);
5271  }
```

If the io_type is BUF_IO_READ, the LRU_list_mutex is acquired and released soon (it's no required for later logic). So for the BUF_IO_READ type, acquiring page_mutex is enough, and there is no order violation.

But I guess we can improve it further here, cause there is not necessary to get block mutex here, we could acquire this block mutex later
```
5257  mutex_enter(page_mutex);
```

The reason we don't need page_mutex here:

1. io_type is local var, so there is no need to protect it;

2. check the comment about io_fix at beginning of buf_page_io_complete, no other thread can change the io_fix of current block. Cause there is ongoing io_complete performed by current thread.

3. same reason as 2 that flush_type can not changed by other thread

4. what about the buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY ? Can anyone else change bpage->state to or from BUF_BLOCK_ZIP_DIRTY while current thread is doing buf_page_io_complete on it, if we don't hold block mutex? I guess no

   4.1) from other state to BUF_BLOCK_ZIP_DIRTY
   
   on buf_LRU_free_page() can do this, but buf_LRU_free_page is shortcuted if the block can not be reallocated, if it's buf-fixed or io-fixed. So this could not happen.

   4.2) from BUF_BLOCK_ZIP_DIRTY to other state
   a) changed to BUF_BLOCK_FILE_PAGE (a decompressed block frame) on read, i.e. by buf_page_get_gen, this will not happen if current bpage is io_fixed, check Buf_fetch<T>::zip_page_handler()
   b) change to BUF_BLOCK_ZIP_PAGE (a clean compresssed block frame), this can only caused by page flush, that is current thread, after dirty compressed page to storage

So it's not necessary to get page_mutex here, and we could change the code as:

```
diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index 983603e..362f6f5 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -5251,11 +5251,6 @@ bool buf_page_io_complete(buf_page_t *bpage, bool evict) {
     }
   }

-  mutex_enter(&buf_pool->LRU_list_mutex);
-
-  BPageMutex *page_mutex = buf_page_get_mutex(bpage);
-  mutex_enter(page_mutex);
-
   if (io_type == BUF_IO_WRITE &&
       (
 #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
@@ -5265,11 +5260,13 @@ bool buf_page_io_complete(buf_page_t *bpage, bool evict) {
           buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU ||
           buf_page_get_flush_type(bpage) == BUF_FLUSH_SINGLE_PAGE)) {

+    mutex_enter(&buf_pool->LRU_list_mutex);
     have_LRU_mutex = true; /* optimistic */
-  } else {
-    mutex_exit(&buf_pool->LRU_list_mutex);
   }

+  BPageMutex *page_mutex = buf_page_get_mutex(bpage);
+  mutex_enter(page_mutex);
+
 #ifdef UNIV_IBUF_COUNT_DEBUG
   if (io_type == BUF_IO_WRITE || uncompressed) {
     /* For BUF_IO_READ of compressed-only blocks, the

```

Thank you Mr. Wang,

I have to side with Laurynas and our Jakub.

Also, there are some plans of improving current code, but they do not go in the direction that you are describing.

Hi,

Indeed your proposal avoid latching order issues.
However I realized that this code already looks different in mysql-trunk (than in the version you are looking at), so probably the whole discussion is unnecessary, as the new code already looks quite similar to what you propose.

As of today the mysql-trunk looks like this:

  auto page_mutex = buf_page_get_mutex(bpage);

  if (io_type == BUF_IO_WRITE) {
    mutex_enter(&buf_pool->LRU_list_mutex);

    mutex_enter(page_mutex);

    if (
#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
        /* to keep consistency at buf_LRU_insert_zip_clean() */
        buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY ||
#endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
        buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU ||
        buf_page_get_flush_type(bpage) == BUF_FLUSH_SINGLE_PAGE) {

      have_LRU_mutex = true; /* optimistic */
    } else {
      mutex_exit(&buf_pool->LRU_list_mutex);
    }
  } else {
    mutex_enter(page_mutex);
  }

Hi Jakub,

Thanks for your info! I guess this new  code will come with the upcoming 8.0.20?

Maybe the 

```
mutex_enter(page_mutex);
```

can be acquired later as I suggested (and looks like more tidy), if my analysis is right.

I think it's okay to leave it as it is in mysql-trunk,  as block->mutex is a small mutex, and won't give too much improvement :)

Thanks!

Thank you, Mr. Wang.