Bug #94448 Rewrite LOG_BLOCK_FIRST_REC_GROUP during recovery may be dangerous.
Submitted: 22 Feb 15:37 Modified: 23 Mar 13:28
Reporter: Kang Wang Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.13 OS:Any
Assigned to: Pawel Olchawa CPU Architecture:Any

[22 Feb 15:37] Kang Wang
Description:
In 8.0, What log_start function does makes me confused, it rewrite LOG_BLOCK_FIRST_REC_GROUP of the block header with the start_lsn, as below:

  log_block_set_hdr_no(block, log_block_convert_lsn_to_no(block_lsn));

  log_block_set_flush_bit(block, true);

  log_block_set_data_len(block, start_lsn - block_lsn);

  log_block_set_first_rec_group(block, start_lsn % OS_FILE_LOG_BLOCK_SIZE);

  /* Do not reorder writes above, below this line. For x86 this
  protects only from unlikely compile-time reordering. */
  std::atomic_thread_fence(std::memory_order_release);
}

In my opinion, It is extremely dangerous to change the  LOG_BLOCK_FIRST_REC_GROUP if it is not zero before.

Since this will change the semantic of LOG_BLOCK_FIRST_REC_GROUP and makes it confusion.

What's more, I doubt the invoking below may encounter correctness problem during recursive crash recovery.

dberr_t recv_recovery_from_checkpoint_start(log_t &log, lsn_t flush_lsn) {
  ...
  
  recv_recovery_begin(log, &contiguous_lsn);
  lsn_t recovered_lsn;
  recovered_lsn = recv_sys->recovered_lsn;
  
  ...
  
  log_start(log, checkpoint_no + 1, checkpoint_lsn, recovered_lsn);
  
  ...
}

How to repeat:
 I add logs in log_start() function and find out it is so common during crash recovery to rewrite LOG_BLOCK_FIRST_REC_GROUP to a larger value.

Suggested fix:
Maybe we could cancel the rewrite of block LOG_BLOCK_FIRST_REC_GROUP in log_start function, at least when it is not zero before.

In my view, it is not inevitable here, right?
[27 Feb 1:18] Pawel Olchawa
Posted by developer:
 
This bug has been present since the first version of the new redo log, so all versions since 8.0.5 are affected.

However it's very hard to be reproduced. The possible data loss is limited to the last block of the redo log (512 bytes).

The first_rec_group field is re-initialized only for the last block of the redo log and it is used only when we start a recovery to locate the beginning of the recovery.

The scenario leading to the data loss is only possible, when:
- redo log is logically non-empty and consists of a single block only (starts and ends within the same block),
- during recovery ibuf merge happens and generates new redo records,
- we crashed after we generated the new redo records but before we flushed the page.

I injected extra debug code, which generates new dummy redo records during recovery, to reproduce the issue.

In theory such records could possibly be generated by the ibuf merges (during recovery).

The fix is not to initialize first_rec_group for the first block in the log_start() unless start_lsn points to the beginning of the block, or it is zero.
[14 Mar 13:51] Pawel Olchawa
Posted by developer:
 
Special thanks for reporting the bug.

The fix we decided for, is not to re-initialize the first_rec_group unless it seems to be corrupted after the recovery. In order to determine that, we track expected values of first_rec_group during the recovery (whenever we progress with recovered_lsn).

Note that in all versions of MySQL earlier than 8.0.5, the bug could manifest in a different way. We were never re-initializing the first_rec_group after recovery ended with success. If the last block had a corrupted first_rec_group on disk, recovery could still end with success if it was started in earlier block. After recovery ended, we would proceed with normal workload, inserting new redo records to the corrupted block. If we crashed, data loss was possible then. All because of the corrupted redo log, which was corrupted before we started the MySQL. That could be misleading because the recovery stated that all was fine within the redo files. This would be unlikely (but still possible to happen) if checksums for redo were enabled. However user could decide to disable those checksums in which case the risk could be higher. The MySQL should not produce such corrupted files itself, so it is not that important to fix this bug for earlier versions of MySQL.
[23 Mar 13:28] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 8.0.17 release, and here's the changelog entry:

There was potential for data loss to occur if the redo log was not
logically empty and comprised a single block, and the server exited during
recovery after an insert buffer merge generated new redo records but
before the new records could be flushed to disk.