Bug #107594 crash recovery process stuck with "Out of space in the redo log"
Submitted: 19 Jun 2022 21:34 Modified: 27 Jun 2022 18:14
Reporter: Raigedas Radisauskas Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.28 OS:Debian (docker)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: redo

[19 Jun 2022 21:34] Raigedas Radisauskas
Description:
Context:
We had a quite big database - at some point disk usage was ~1TB. Most of the data was in two related tables (lets cal them PARENTS and CHILDS) and a third not related to the previous two (lets call it EFFICIENT). Several weeks ago we redesigned our app to new, more efficient data model which uses ~10 less space. We started doing data migration in steps. We just were doing last of the steps and I was about to remove almost the last data from old tables).

Steps:
* There was an app running which where inserting rows in PARENTS/CHILDS tables at a rate of from several hundred to several thousand records per second.
* During that time I deleted ~40 000 000 records (older ones, not the newest being inserted by the app) from PARENTS table (which also must have deleted several times more records from CHILDS table). No error.
* Immediately I have executed 'SELECT COUNT(1) FROM PARENTS WHERE <condition for old rows>' to make sure that the needed rows were actually deleted. There database behavior started to diverge from the norm - the SELECT kept running and running instead of returning result within 1-5 seconds as usual. I waited for an hour but the SELECT was still running. Then I checked processes with select from `information_schema`.`PROCESSLIST` and I saw two long running processes - the count select and another from the app with insert (or maybe update) in the same PARENTS table. I tried to kill those two processes but they did not want to end - kept staying with the "Kill" state. 
* Next morning I found those two processes still hanging in that state.
* I restarted th MySQL Server. It did not start, i found in log:
--------
2022-06-19T08:07:51.445764Z 0 [ERROR] [MY-013772] [InnoDB] [FATAL] There is not enough free space in the redo log during recovery to perform pending ibuf merges. Please retry starting MySQL with --innodb-force-recovery=4.
2022-06-19T08:07:51.445797Z 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: log0write.cc:1967:ib::fatal triggered thread 139743082768128
<...>
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
--------
* I increased log level, restarted again and saw that MySQL is stuck in an (infinite?) loop with messages "Out of space in the redo log" and stating that it will wait for space to come up in the redo (I have lost original messages).
* I have tried setting increased the redo log file size and count in config (and restarted MySQL) but that did not help.

How to repeat:
Can not tell exactly how to reproduce, but explained above what have lead to the issue.

Suggested fix:
As a workaround I have:
* (made a backup)
* removed the redo files
* set 'force recovery' to 5
* restarted MySQL 
* dumped data from readonly running database
* reinstalled fresh MySQL instance and imported the data dump.

Ideally redo files should not get out of space.

If redo files get out of space then the recovery process should do something about that instead of just waiting forever.
[20 Jun 2022 14:16] MySQL Verification Team
Hi,

I will try to reproduce this but this looks like a "disk full" scenario. Are you sure you checked your system log for quota warnings, disk space on your system, process limits..? 

kind regards
[22 Jun 2022 20:23] Raigedas Radisauskas
regarding the "disk full" scenario. i doubt it very much but unfortunately i can not disprove it. but:
1. if i remember correctly, soon after receiving those errors and before applying workaround i have even started (but cancelled in the middle) to make a backup of entire /var/lib/mysql volume and did not receive any "disk full" error from cp/mc.
2. and second... my knowledge here is very limited but it seems to me that the redo log is kind of a circular buffer which means there should never arise the situation with actual out of disk space, for me it seems that this 'out of space' error means more like 'all entries in the buffer are used up', isn't it? or maybe the DML queries were coming at too fast rate for the configured redo log?
[27 Jun 2022 18:14] MySQL Verification Team
Hi,

The only way I can see this happening is some kind of "disk full" scenario. Do not have to be disk full. Disk quota, process quota, something like that can trigger it too. I cannot see other path to "out of space" and since I cannot reproduce it (I tried 1T database filled with dummy data, could not reproduce, inspected the source code leading to the error message .. I cannot see anything wrong) there is very little we can do. Our InnoDB team will review the source code themselves leading to this error too.
[8 Oct 2024 13:10] David Flammer
This has been happening to us about once a week on version 8.0.22   

Our database size is very large (~11TB)

It looks like some work was done to revamp how redo log sizing was done in 8.0.30. 
In our version sizing of this is explained here: 
https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html#innodb-redo-log-file-reconfig...

Our settings are:
innodb_log_file_size      | 50331648 
innodb_log_files_in_group | 2  

I don't think this is due to the disk being out of space. The disk monitors say there is room on all drives, but the only way to fix this is a hard kill of the mysql process.

Do you have recommendations on what we should set those values to given our conditions? Any other information I can provide?