MySQL Bugs: #99326: undo truncation might still not be crash safe

Bug #99326	undo truncation might still not be crash safe
Submitted:	23 Apr 2020 2:06	Modified:	25 May 2020 16:18
Reporter:	Zhang JiYang	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
It's a variant of the bug https://bugs.mysql.com/bug.php?id=93170.

Now the undo space id may be reused after 512 truncation iterations. Is possible that the checkpoint is too old so that the space id is reused by undo, and then the page id is unexpectedly used while doing recovery.

How to repeat:
N/A

Hi,

Thanks for the report. I understand the logic but I'm not able to make a test case to reproduce this. Lemme get back on it.

all best
Bogdan

This whole undo truncation process is done while an undo trunc log file exists in the undo directory (or datadir if innodb_undo_directory is not defined). This temporary file, named "undo%lu_trunc.log", is created at the start of the undo truncate process and is deleted at the end. This is what assures that the process is crash safe.  We have test cases that introduce crashed at 9 different places along the process.  The existance of that file at startup will cause its associated undo tablespace to be deleted and replaced, a full truncation, at startup.

It is not possible for undo tablespaces from a previous incarnation (one of the 512 possible space IDs assigned to an undo tablespace) to interfere with another one since the buffer pool is cleaned up of all pages from the old space_id before the tablespace with the new space ID is created during undo truncation. And the space IDs are assigned on a round robin bases each time an undo tablespace is truncated.  Undo truncation currently removes all pages from the old undo tablespace when it is deleted.  Then the new tablespace is flushed to disk before it is put online. 

So the undo truncation process is indeed crash safe.

After and internal discussion with Sunny Bains, I think I understand the concern better.  Let's assume that a redo log is so large that it contains redo entries for all 512 Space IDs of an undo tablespace that is being truncated too often. In other words, even though each truncate removes old pages from the buffer pool and flushes newly created pages, it does not actually cause a checkpoint for each truncation like it did in 5.7.  So the redo log can possibly contain records for more than 512 space IDs.

There is a worklog tested and pushed to the 8.0.21 release branch that fixes this highly unlikely possibility. 

As part of WL#11819, we keep a count of the number of truncations that have happened between checkpoints. So if there is more than (512 / 8) truncations between checkpoints, then no more truncations can happen on that undo space until the next checkpoint happens.

Kevin, thanks for the clarification! This explains why I could not reproduce :)

thanks
Bogdan

How comes a theoretically confirmed problem, fixed in some internal bug report or in frames of some worklog is not a duplicate, but "Not a Bug"? It is a bug in all currently released versions of 8.0. up to 8.0.20. Please, set proper status and document this until the fix is released.

Hi Val,

Thanks, you are right.

Fixed in 8.0.21