Description:
MySQL crash recovery seems to be overly verbose. Some detail is really good, to ensure the user knows what the server is doing and how things are progressing but is the current logging useful except to developers?
I wonder if we could perhaps provide a slightly more concise view of the recovery process to ensure that logging is not excessive while keeping the user aware of progress and ideally expected time to completion.
How to repeat:
I see the detection of the crash:
2021-04-05T13:34:50.065922Z 1 [Note] [MY-012560] [InnoDB] The log sequence number 66721837640536 in the system tablespace does not match the log sequence number 67437848865241 in the ib_logfiles!
2021-04-05T13:34:50.066054Z 1 [Note] [MY-012551] [InnoDB] Database was not shutdown normally!
2021-04-05T13:34:50.066161Z 1 [Note] [MY-012552] [InnoDB] Starting crash recovery.
2021-04-05T13:34:50.120944Z 1 [Note] [MY-013086] [InnoDB] Starting to parse redo log at lsn = 67437848864827, whereas checkpoint_lsn = 67437848865241
followed by 103 lines like the one below between 34:50 and 35:09:
2021-04-05T13:34:50.244421Z 1 [Note] [MY-012550] [InnoDB] Doing recovery: scanned up to log sequence number 67437854107648
then 4947 lines like the one below between 35:30 and 35:51:
2021-04-05T13:35:20.464986Z 1 [Note] [MY-013029] [InnoDB] Transaction 82672092687 was in the XA prepared state.
Then 4948 groups like the ones below between 36:08 and 40:25:
2021-04-05T13:36:08.185917Z 0 [Note] [MY-013034] [InnoDB] Transaction contains changes to 1 rows
2021-04-05T13:36:08.198243Z 0 [Note] [MY-013033] [InnoDB] Transaction 82672096307 in prepared state after recovery
Suggested fix:
Consider limiting
Doing recovery: scanned up to log sequence number XXXXX
Transaction XXXXX was in the XA prepared state.
Transaction XXXXX in prepared state after recovery
message group(s) to appear at most secondly. 5- or 10-secondly (configurable?) rates might also be reasonable.
After the first scanning you know how many transactions need processing, so adding a percentage indicator indicating progress and perhaps expected time to finish might be useful. That would perhaps help someone decide if it might be more appropriate to failover to a different server rather than recover the situation that asynchronous replication is being used.
Certainly in some cases MySQL recovery may take a very long time especially if the storage is not fast. That may be less of a concern with SSDs or NVMe storage but all the faster storage does is allow you to write more and thus potentially build up a larger dataset size to recover than before leaving you in the same position as before with a potentially long time to wait for recovery.