MySQL Bugs: #116209: Crash recovery is slow because of repeated scans of double-wrtie

Bug #116209	Crash recovery is slow because of repeated scans of double-wrtie
Submitted:	24 Sep 2024 7:44	Modified:	24 Sep 2024 11:37
Reporter:	Ke Yu (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
My mysql instance is crashed, recovery took 2 hours. It is doing repeated double-write recovery during crash recovery.

When redo of type MLOG_FILE_EXTEND is applied during crash recovery, double-write of this space is restored.

If a large number of redo files of type MLOG_FILE_EXTEND need to be applied to a space, and there are a 
large number of pages corresponding to the space in double-write, crash recovery will be slow even if there is no half-write problem.

How to repeat:
It is difficult to reproduce the problem.

We can descript the problem from the code, every time you do a redo application of MLOG_FILE_EXTEND, you will do a double-write recovery 
of the space, even if there is no half-write problem.

For example, if 1000 redo files of MLOG_FILE_EXTEND of the same space need to be applied, it
will call dblwr_recover_page() 1000 times. Every time dblwr_recover_page() is called, a double-write recovery of the space is done, 
even if no half-write problem has occurred. This causes a lot of disk IO and increases the instance recovery time. Only the first 
time it is necessary to do double-write recovery, the rest is not necessary.

The code path:
fil_tablespace_redo_extend
|-> fil_tablespace_open_for_recovery
|    |-> Fil_system::open_for_recovery
|    |    |->  recover
|    |    |    |->  dblwr::recv::Pages::recover
|    |    |    |    |->  dblwr_recover_page

Suggested fix:
The main reason is that the dblwr_recover_page() function returns false when no page corruption occurs, resulting in repeated
dblwr_recover_page() function calls.

dblwr_recover_page() should return true when no page corruption occurs to avoid repeated dblwr_recover_page() function calls.

Hi Mr. Yu,

Thank you very much for your bug report.

However, let us inform you that this is a forum for the reports with fully repeatable test cases. Each of the test cases should consist of a set of SQL statements that always lead to the problem that is reported.

We also accept bugs based on the code analysis. However, that code analysis should cover all aspects of the problem and should pinpoint all problematic code in the high details.

We have not received nothing of the sort from you.

However, we think that you are capable of doing it and hence, we are waiting on your feedback.

Thanks in advance.

This is my code to reproduce this problem, based on 8.0.38

Attachment: crash_recovery_is_slow.diff (application/octet-stream, text), 4.13 KiB.

I built a test case named crash_recovery_is_slow.diff based on 8.0.38 to illustrate the problem.

The test case requires some code changes to quickly reproduce this problem.
You can compile my code and run the test case in debug mode:
./mtr crash_recovery_is_slow

Then you can check the error log. I printed some information in the error log.
It likes:
...
2024-09-24T11:03:42.668508Z 1 [Note] [MY-011825] [InnoDB] MLOG_FILE_EXTEND: space 6 page 0
2024-09-24T11:03:42.668556Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 1
2024-09-24T11:03:42.668674Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 0
2024-09-24T11:03:42.668789Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 2
2024-09-24T11:03:42.668909Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 3
2024-09-24T11:03:42.669026Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 4
2024-09-24T11:03:42.669804Z 1 [Note] [MY-011825] [InnoDB] MLOG_FILE_EXTEND: space 6 page 0
2024-09-24T11:03:42.669873Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 1
2024-09-24T11:03:42.669996Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 0
2024-09-24T11:03:42.670111Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 2
2024-09-24T11:03:42.670300Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 3
2024-09-24T11:03:42.670415Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 4
2024-09-24T11:03:42.670728Z 1 [Note] [MY-011825] [InnoDB] MLOG_FILE_EXTEND: space 6 page 0
2024-09-24T11:03:42.670777Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 1
2024-09-24T11:03:42.670902Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 0
2024-09-24T11:03:42.671018Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 2
2024-09-24T11:03:42.671133Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 3
2024-09-24T11:03:42.671247Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 4
2024-09-24T11:03:42.671543Z 1 [Note] [MY-011825] [InnoDB] MLOG_FILE_EXTEND: space 6 page 0
2024-09-24T11:03:42.671590Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 1
2024-09-24T11:03:42.671708Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 0
2024-09-24T11:03:42.671881Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 2
2024-09-24T11:03:42.671998Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 3
2024-09-24T11:03:42.672112Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 4
2024-09-24T11:03:42.672415Z 1 [Note] [MY-011825] [InnoDB] MLOG_FILE_EXTEND: space 6 page 0
2024-09-24T11:03:42.672462Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 1
2024-09-24T11:03:42.672604Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 0
2024-09-24T11:03:42.672720Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 2
2024-09-24T11:03:42.672840Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 3
2024-09-24T11:03:42.672958Z 1 [Note] [MY-013553] [InnoDB] do io for dblwr page 6 page 4
...

We can see that every time MLOG_FILE_EXTEND redo is applied, a double-write check will be done, which is not necessary.
When there are too many redo logs to apply, the time of crash recovery is increased.

Hi Mr. Yu,

You have sent us a diff for that crash recovery test.

We do not have such a test case in our test suite.

Hence, we do not have a test to which we could apply that patch.

You can send us entire test and result files.

However, with those numbers of tables, long crash recovery is expected behaviour. Also, it can be tuned, at the cost of ACID security.

Hence, what you are reporting is expected behaviour.

Not a bug.

The test case is named mysql-test/t/crash_recovery_is_slow.test, which is included in the diff I uploaded.
You can compile the code with my diff file, then run the test case and see the results in the error log.

Hi,

Thank you Mr. Yu.

This is now a verified bug report.

It impacts version 8.0 and higher.

Hi,
   Thank you for compiling my code and reproducing the problem.
I will try to fix the problem and submit my fix code.

This is my fix code, based on 8.0.39. (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug116239.diff (application/octet-stream, text), 9.89 KiB.

Thank you, Mr. Yu.