Bug #114660 | miss notification when write larger than 1MB size data | ||
---|---|---|---|
Submitted: | 16 Apr 21:37 | Modified: | 20 May 22:07 |
Reporter: | Zongzhi Chen (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S3 (Non-critical) |
Version: | 8.3 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | innodb, REDO Log |
[16 Apr 21:37]
Zongzhi Chen
[17 Apr 9:50]
Jakub Lopuszanski
First of all thanks for reporting this! Indeed this is a conceptual bug and we should eventually fix it. But the urgency of it depends on how much it affects you. Did it cause any problem to you in real deployment, though? Have you observed this happening ever? > just read the code OK, let's do this :) Let me read aloud the call path, adding my observations as comments. We start with this call: ``` log_write_buffer(log, /* this is within the log.buf, and aligned down to block boundary */ buffer = buf_begin, /* this is also within buffer, no cyclic-wrap, no alignment (this is end of "real" data) */ buffer_size = buf_end - buf_begin, /* this is the lsn corresponding to buf_begin (thus perhaps already written some time ago) */ start_lsn = align_down(last_write_lsn, OS_FILE_LOG_BLOCK_SIZE) ) /* log.buf has log.buf_size which is configurable, and also can grow on its own if there's a big mtr to process. Thus buffer_size argument can be indeed larger than 1MB.*/ static dberr_t log_write_buffer(log_t &log, byte *buffer, size_t buffer_size, lsn_t start_lsn) { ... auto write_size = compute_how_much_to_write(log, /* log.m_current_file.offset(start_lsn), just the offset of last_write_lsn within current file */ real_offset, /* passed by caller (can be as much as log.buf_size) */ buffer_size, /* output: */ write_from_log_buffer); ... } /* We call compute_how_much_to_write to get first guess on how much to write. It will be later refined by other means. If we strip all comments and asserts, this is the merit of the function (all comments are mine): */ static inline size_t compute_how_much_to_write(const log_t &log, os_offset_t real_offset, size_t buffer_size, bool &write_from_log_buffer) { ... if (!current_file_has_space(log, real_offset, buffer_size)) { if (!current_file_has_space(log, real_offset, 1)) { write_from_log_buffer = false; return 0; } else { write_size = (log.m_current_file.m_size_in_bytes - real_offset); } } else { write_size = buffer_size; } /* at this point write_size is not larger than log.buf_size, which may be huge */ write_from_log_buffer = write_size >= OS_FILE_LOG_BLOCK_SIZE; if (!current_write_ahead_enough(log, real_offset, write_size)) { /* The write-ahead mechanism conceptually divides the lsn axis into "wa regions" of srv_log_write_ahead_size bytes, which is never larger than 16kb. Suppose that the buffer_size and then write_size was indeed on the order of 1MB. Then current_write_ahead_enough(log, real_offset, write_size) will return false, so we are here. Now we check if at least one byte fits in the current "wa region" - in other words, have we finished the "wa region" and should use the next one? */ if (!current_write_ahead_enough(log, real_offset, 1)) { /* The current "wa region" has finished, so lets look where the next one finishes and see if the write would fit into it entirely */ const auto next_wa = compute_next_write_ahead_end(real_offset); if (!write_ahead_enough(next_wa, real_offset, write_size)) { /* The write size was larger than "wa region", we clamp it */ write_size = next_wa - real_offset; /* The write_size <= 16kb. */ } else { /* It fits, so write_size <= 16kb. */ write_from_log_buffer = false; } } else { /* There's still something to write in current "wa region", so clamp the write_size to the end of current "wa region": */ write_size = (log.write_ahead_end_offset - real_offset); /* Again, write_size <= 16kb. */ } } else { /* If we are here it means the write_size has fit into current "wa region", so must already be <= 16kb */ if (write_from_log_buffer) { write_size = ut_uint64_align_down(write_size, OS_FILE_LOG_BLOCK_SIZE); } } return write_size; } // After we exit from compute_how_much_to_write, we do: static dberr_t log_write_buffer(log_t &log, byte *buffer, size_t buffer_size, lsn_t start_lsn) { ... auto write_size = compute_how_much_to_write(log, real_offset, buffer_size, write_from_log_buffer); ... lsn_t lsn_advance = write_size; ... const lsn_t old_write_lsn = log.write_lsn.load(); const lsn_t new_write_lsn = start_lsn + lsn_advance; ... /* At this point it holds that start_lsn <= old_write_lsn and lsn_advance <= 16kb, so new_write_lsn <= start_lsn+16kb <= old_write_lsn + 16kb notify_about_advanced_write_lsn(log, old_write_lsn, new_write_lsn); ``` So, it looks to me that it is impossible to ever have to notify about an advancement larger than 16kb, and you are right that it would take 1MB to wrap around. Now, the puzzling part to me is this case: ``` if (!write_ahead_enough(next_wa, real_offset, write_size)) { /* ... and also the next write-ahead is too small. Therefore we have more data to write than size of the write-ahead. We write from the log buffer, skipping last fragment for which the write ahead is required. */ ut_a(write_from_log_buffer); write_size = next_wa - real_offset; ut_a((real_offset + write_size) % srv_log_write_ahead_size == 0); ut_a(write_size % OS_FILE_LOG_BLOCK_SIZE == 0); } else { ``` the code comment doesn't really match the behaviour here, and the behaviour doesn't make much sense to me. The way I understand the code comment, the intention here was to see how many full "wa regions" are covered by the write_size, and then full "wa regions" from log.buf to disc, ignoring the leftover part from the last unfinished "wa region". That is the comment sounds like it permits/understands the possibility that the number of "wa regions" involved can be larger than two. But then, what the code actually does is that it is setting the write_size to just one "wa region". Why not more? Is there any conceptual problem with writing more than one "wa region" at once? One possible issue is that the interface only has a method for `compute_next_write_ahead_end` but not for "align down to 'wa region' boundary". But, I know the author was not shy of adding more functions if needed, so this isn't it. Honestly, I see no good reason why we clamp it to just one "wa region" here, so it is conceivable that in future we will change this, and then the bug you've reported will definitely manifest.
[17 Apr 10:21]
MySQL Verification Team
Hi Mr. Chen, Thank you for your bug report. In the future, if the bug manifests itself, please create a new bug report. Thank you very much for the analysis.
[18 Apr 10:28]
MySQL Verification Team
Hi, This is now a verified bug.
[20 May 22:07]
Philip Olson
Posted by developer: Fixed as of the upcoming MySQL Server 9.0.0 release, and here's the proposed changelog entry from the documentation team: Improved the notify_about_advanced_write_lsn() logic to prevent potential log notification delays. This fix is based on a patch from Zongzhi Chen with Alibaba, thank you for the contribution. Thank you for the bug report.
[21 May 10:07]
MySQL Verification Team
Thank you, Philip.