MySQL Bugs: #81753: Make crash recovery parallel

Bug #81753	Make crash recovery parallel
Submitted:	7 Jun 2016 15:32	Modified:	8 Jun 2016 9:14
Reporter:	Vadim Tkachenko	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.7	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Based on https://www.percona.com/blog/2016/06/07/severe-performance-regression-mysql-5-7-crash-reco...
and
https://www.percona.com/blog/2016/05/31/what-is-a-big-innodb_log_file_size/
Crash recovery should be parallel on order to utilize multiple CPUs and fast storage.

How to repeat:
NA

Suggested fix:
Implement parallel crash recovery in InnoDB

Hello Vadim,

Thank you for the reasonable feature request!

Thanks,
Umesh

Related - Bug #81648, Bug #80788

Bug #81853 is a duplicate of this bug

I think the title is too generic. The apply part of crash recovery is already parallel because it is done by i/o threads.

It is only the parse of redo that is sequential.

I believe, that reading ibd files is also sequential, and with environment with thousands of thousands of tables it can take up to few hours just to scan all those files

Not anymore in 5.7.

http://dev.mysql.com/worklog/task/?id=7142 removes the need for .ibd scan.

This is basically related to my duplicate bug#81853.

I noted in my logs: "InnoDB: 1655 transaction(s) which must be rolled back or cleaned up"

The time for InnoDB recovery takes to run after this point is significantly higher than the time to appy the logs forward which happens earlier.

Also this UNDO stage seems from what I can see to be done in a single threaded manner "highest transaction id" first, working backwards.

Looking at top the CPU is pegged at 100% so only a single core is being used.

So figuring out a way to do some of this in parallel would be probably quite interesting.

Simon,

You bug is different. In your case, it is the rollback of uncomitted transactions (from undo log).
that is happening serially and taking time.

This bug is about recovery and it is from redo only.

Your bug is not duplicate.

We should change title of your bug to "Make rollback of transactions parallel"

Bug #81853 Synopsis changed and processed

To add more, the rollback of uncommitted transactions happens in background thread and it shouldn't block the server startup. So server should be ready to accept connections while the transactions are rolled back in background.

I wonder how this increased the startup time in your case.

The server in question is used to accept heavy BULK inserts/updates into a number of tables in parallel.
Load on the server due to these writes is around 5-10 and network traffic is ~60-80 MB/s (replicating to a few slaves) with disk I/O about 50-100 MB/s during _normal_ processing.

So if you startup a server when it's crashed and allow these busy writers back into the system you probably won't be surprised to see that they interfere with each other and total throughput goes down as some new writes get blocked by the undo process that's running from recovery.
Basically these writers get blocked by the ongoing recovery process.

In my case the number of readers compared to the number of writers is minimal (and slaves will pick up changes and be queried for reads).
So for low write rates maybe the server won't be affected much and queries can continue but that is not always the case.

It's also the reason why I would like to see an optional recovery behaviour where access to the server is blocked until recovery has completed.
This allows the recovery process to proceed as fast as it can unimpeded by other activity at which case normal access would work.
(There's another FR for this elsewhere.)  Configuring this manually is possible but not easy to do cleanly.

FWIW on the server in question the recovery process took > 45 minutes (and hadn't completed) when inserts were allowed in vs it ran for 20 minutes with network traffic blocked.
That's quite a difference.

Naga: You said: "I wonder how this increased the startup time in your case."

For me startup time is the time for the server to be working "normally". That is the time when it can accept reads and writes and process them in a normal manner.  In my opinion accepting network SQL connections and allowing queries to run does not mean that the server has fully started up as some writes will get blocked.

It's true that some writes may be allowed to make progress if they affect tables where transactions are not being rolled back but in the end the tables affected by the UNDO are likely to be the busiest tables and they're also likely to be the tables that take new writes. Hence interaction between new writes and the UNDO process is pretty likely (especially on a master).

It is also worth noting that this is not only about the performance of crash recovery. It's also about performance while in normal production service, which can be limited by crash recovery speed.

The limit happens because people can have a maximum amount of time that they are willing to wait for crash recovery. Duplicated redundant servers can help with single server failures via failover arrangements but typically not if all are in one data center that suffers a power outage and that is the worst case recovery time that must be limited.

To achieve that target they must do things like increasing page flushing via innodb_io_capacity and/or innodb_io_capacity_max or decreasing redo log size. Each of those reduces potential foreground thread throughput because of the loss of the i/o operations wasted on flushing dirty pages earlier than the buffer pool size could handle.

As InnoDB gets faster in general, this can end up being the limiting throughput factor because of the sub-optimal configurations that are necessary to hit the target. Fortunately it's still rare for us to have to recommend such approaches but that is the way things are trending as RAM sizes increase and storage doesn't match RAM speed.

SSDs and bus attached storage would just hit the problem at higher throughput levels, though they can help initially if the price per gigabyte is tolerable for the data capacity needed. Both are still way more expensive in $/GB than SSDs for the bulk end of the data size range where archival SATA drives are currently the clear leader but there are many i/o capacity vs price points where they can fit the need and budget.

James Day, MySQL Senior Principal Support Engineer, Oracle