MySQL Bugs: #114660: miss notification when write larger than 1MB size data

Bug #114660	miss notification when write larger than 1MB size data
Submitted:	16 Apr 2024 21:37	Modified:	20 May 2024 22:07
Reporter:	Zongzhi Chen (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.3	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	innodb, REDO Log

Description:
in function notify_about_advanced_write_lsn()

```
static inline void notify_about_advanced_write_lsn(log_t &log,
                                                   lsn_t old_write_lsn,
                                                   lsn_t new_write_lsn) {
    ...
    const auto first_slot =
        log_compute_write_event_slot(log, old_write_lsn + 1);

    const auto last_slot = log_compute_write_event_slot(log, new_write_lsn);

    if (first_slot == last_slot) {
      log_sync_point("log_write_before_users_notify");
      os_event_set(log.write_events[first_slot]);
    } else {
      log_sync_point("log_write_before_notifier_notify");
      os_event_set(log.write_notifier_event);
    }
    ...
  }

}
```

if the new_write_lsn - old_write_lsn is >= 1MB and < 1MB + 512, then in function  log_compute_write_event_slot() the first_slot will equal to last_slot

funtion log_compute_write_event_slot(const log_t &log, lsn_t lsn) can be simplify like this:

((lsn - 1) / OS_FILE_LOG_BLOCK_SIZE) & (log.write_events_size - 1);

then the log_writer will only wake up one write_events, some other wait event will not be wake up.

the call path looks like this:

```
(gdb) bt
#0  notify_about_advanced_write_lsn () at /home/zongzhi.czz/git/mysql-8.0/storage/innobase/log/log0write.cc:1591
#1  log_write_buffer () at /home/zongzhi.czz/git/mysql-8.0/storage/innobase/log/log0write.cc:1803
#2  log_writer_write_buffer () at /home/zongzhi.czz/git/mysql-8.0/storage/innobase/log/log0write.cc:2195
#3  0x000000000217a608 in log_writer () at /home/zongzhi.czz/git/mysql-8.0/storage/innobase/log/log0write.cc:2292
```

How to repeat:
just read the code

Suggested fix:
change this condition from 

```
    if (first_slot == last_slot) {
      log_sync_point("log_write_before_users_notify");
      os_event_set(log.write_events[first_slot]);

```

to

```
    if (first_slot == last_slot && (new_write_lsn - old_write_lsn) < OS_FILE_LOG_BLOCK_SIZE) {
      log_sync_point("log_write_before_users_notify");
      os_event_set(log.write_events[first_slot]);

```

First of all thanks for reporting this!
Indeed this is a conceptual bug and we should eventually fix it.
But the urgency of it depends on how much it affects you.
Did it cause any problem to you in real deployment, though?
Have you observed this happening ever?

> just read the code

OK, let's do this :)
Let me read aloud the call path, adding my observations as comments.
We start with this call:
```
log_write_buffer(log, 
  /* this is within the log.buf, and aligned down to block boundary */
  buffer = buf_begin,  
  /* this is also within buffer, no cyclic-wrap, no alignment (this is end of "real" data) */
  buffer_size = buf_end - buf_begin, 
  /* this is the lsn corresponding to buf_begin (thus perhaps already written some time ago) */
  start_lsn = align_down(last_write_lsn, OS_FILE_LOG_BLOCK_SIZE) 
) 

/* log.buf has log.buf_size which is configurable, and also can grow on its own if 
there's a big mtr to process. Thus buffer_size argument can be indeed larger than 1MB.*/

static dberr_t log_write_buffer(log_t &log, byte *buffer, size_t buffer_size,
                                lsn_t start_lsn) {
...
  auto write_size = compute_how_much_to_write(log, 
    /* log.m_current_file.offset(start_lsn), just the 
    offset of last_write_lsn within current file */
    real_offset,
    /* passed by caller (can be as much as log.buf_size) */
    buffer_size,
    /* output: */
    write_from_log_buffer);
...
}    
/* We call compute_how_much_to_write to get first guess on how much to write.
It will be later refined by other means. If we strip all comments and asserts,
this is the merit of the function (all comments are mine): */
static inline size_t compute_how_much_to_write(const log_t &log,
                                               os_offset_t real_offset,
                                               size_t buffer_size,
                                               bool &write_from_log_buffer) {
...
  if (!current_file_has_space(log, real_offset, buffer_size)) {
    if (!current_file_has_space(log, real_offset, 1)) {
      write_from_log_buffer = false;
      return 0;
    } else {
      write_size = (log.m_current_file.m_size_in_bytes - real_offset);
    }
  } else {
    write_size = buffer_size;
  } 
  /* at this point write_size is not larger than log.buf_size, which may be huge */
  
  write_from_log_buffer = write_size >= OS_FILE_LOG_BLOCK_SIZE;
  if (!current_write_ahead_enough(log, real_offset, write_size)) {
    /* The write-ahead mechanism conceptually divides the lsn axis into "wa
    regions" of srv_log_write_ahead_size bytes, which is never larger than 16kb.
    Suppose that the buffer_size and then write_size was indeed on the order of
    1MB. Then current_write_ahead_enough(log, real_offset, write_size) will
    return false, so we are here.
    Now we check if at least one byte fits in the current "wa region" - in other
    words, have we finished the "wa region" and should use the next one? */
    if (!current_write_ahead_enough(log, real_offset, 1)) {
      /* The current "wa region" has finished, so lets look where the next one
      finishes and see if the write would fit into it entirely */
      const auto next_wa = compute_next_write_ahead_end(real_offset);
      if (!write_ahead_enough(next_wa, real_offset, write_size)) {
        /* The write size was larger than "wa region", we clamp it */
        write_size = next_wa - real_offset;
        /* The write_size <= 16kb. */
      } else {
        /* It fits, so write_size <= 16kb. */
        write_from_log_buffer = false;
      }
    } else {
      /* There's still something to write in current "wa region", so clamp 
      the write_size to the end of current "wa region": */
      write_size = (log.write_ahead_end_offset - real_offset);
      /* Again, write_size <= 16kb. */      
    }
  } else {
    /* If we are here it means the write_size has fit into current "wa region",
    so must already be <= 16kb */
    if (write_from_log_buffer) {
      write_size = ut_uint64_align_down(write_size, OS_FILE_LOG_BLOCK_SIZE);
    }
  }
  return write_size;
}

// After we exit from compute_how_much_to_write, we do:

static dberr_t log_write_buffer(log_t &log, byte *buffer, size_t buffer_size,
                                lsn_t start_lsn) {
...
  auto write_size = compute_how_much_to_write(log, real_offset, buffer_size,
              write_from_log_buffer);
...
  lsn_t lsn_advance = write_size;
...
  const lsn_t old_write_lsn = log.write_lsn.load();
  const lsn_t new_write_lsn = start_lsn + lsn_advance;
...
  /* At this point it holds that start_lsn <= old_write_lsn and 
  lsn_advance <= 16kb, so new_write_lsn <= start_lsn+16kb <= old_write_lsn + 16kb
  notify_about_advanced_write_lsn(log, old_write_lsn, new_write_lsn);
```       
So, it looks to me that it is impossible to ever have to notify about an advancement larger than 16kb, and you are right that it would take 1MB to wrap around.

Now, the puzzling part to me is this case:
```
      if (!write_ahead_enough(next_wa, real_offset, write_size)) {
        /* ... and also the next write-ahead is too small.
        Therefore we have more data to write than size of
        the write-ahead. We write from the log buffer,
        skipping last fragment for which the write ahead
        is required. */

        ut_a(write_from_log_buffer);

        write_size = next_wa - real_offset;

        ut_a((real_offset + write_size) % srv_log_write_ahead_size == 0);

        ut_a(write_size % OS_FILE_LOG_BLOCK_SIZE == 0);

      } else {
```
the code comment doesn't really match the behaviour here, and the behaviour doesn't make much sense to me.
The way I understand the code comment, the intention here was to see how many full "wa regions" are covered by the write_size, and then full "wa regions" from log.buf to disc,
ignoring the leftover part from the last unfinished "wa region".
That is the comment sounds like it permits/understands the possibility that the number of "wa regions" involved can be larger than two.
But then, what the code actually does is that it is setting the write_size to just one "wa region".
Why not more?
Is there any conceptual problem with writing more than one "wa region" at once?
One possible issue is that the interface only has a method for `compute_next_write_ahead_end` but not for "align down to 'wa region' boundary".
But, I know the author was not shy of adding more functions if needed, so this isn't it.
Honestly, I see no good reason why we clamp it to just one "wa region" here, so it is conceivable that in future we will change this, and then the bug you've reported will definitely manifest.

Hi Mr. Chen,

Thank you for your bug report.

In the future, if the bug manifests itself, please create a new bug report.

Thank you very much for the analysis.

Hi,

This is now a verified bug.

Posted by developer:
 
Fixed as of the upcoming MySQL Server 9.0.0 release, and here's the proposed changelog entry from the documentation team:

Improved the notify_about_advanced_write_lsn() logic to prevent potential log notification delays.

This fix is based on a patch from Zongzhi Chen with Alibaba, thank you for the contribution.

Thank you for the bug report.

Thank you, Philip.