Bug #81753 Make crash recovery parallel
Submitted: 7 Jun 2016 15:32 Modified: 8 Jun 2016 9:14
Reporter: Vadim Tkachenko Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[7 Jun 2016 15:32] Vadim Tkachenko
Based on https://www.percona.com/blog/2016/06/07/severe-performance-regression-mysql-5-7-crash-reco...
Crash recovery should be parallel on order to utilize multiple CPUs and fast storage.

How to repeat:

Suggested fix:
Implement parallel crash recovery in InnoDB
[8 Jun 2016 9:14] MySQL Verification Team
Hello Vadim,

Thank you for the reasonable feature request!

[8 Jun 2016 9:15] MySQL Verification Team
Related - Bug #81648, Bug #80788
[15 Jun 2016 7:41] Daniël van Eeden
Bug #81853 is a duplicate of this bug
[28 Jun 2016 16:21] Naga Satyanarayana Bodapati
I think the title is too generic. The apply part of crash recovery is already parallel because it is done by i/o threads.

It is only the parse of redo that is sequential.
[28 Jun 2016 17:04] Adrian Słowik
I believe, that reading ibd files is also sequential, and with environment with thousands of thousands of tables it can take up to few hours just to scan all those files
[28 Jun 2016 17:22] Naga Satyanarayana Bodapati
Not anymore in 5.7.

http://dev.mysql.com/worklog/task/?id=7142 removes the need for .ibd scan.
[30 Jun 2016 5:53] Simon Mudd
This is basically related to my duplicate bug#81853.

I noted in my logs: "InnoDB: 1655 transaction(s) which must be rolled back or cleaned up"

The time for InnoDB recovery takes to run after this point is significantly higher than the time to appy the logs forward which happens earlier.

Also this UNDO stage seems from what I can see to be done in a single threaded manner "highest transaction id" first, working backwards.

Looking at top the CPU is pegged at 100% so only a single core is being used.

So figuring out a way to do some of this in parallel would be probably quite interesting.
[30 Jun 2016 6:20] Naga Satyanarayana Bodapati

You bug is different. In your case, it is the rollback of uncomitted transactions (from undo log).
that is happening serially and taking time.

This bug is about recovery and it is from redo only.

Your bug is not duplicate.

We should change title of your bug to "Make rollback of transactions parallel"
[30 Jun 2016 6:27] MySQL Verification Team
Bug #81853 Synopsis changed and processed
[30 Jun 2016 6:27] Naga Satyanarayana Bodapati
To add more, the rollback of uncommitted transactions happens in background thread and it shouldn't block the server startup. So server should be ready to accept connections while the transactions are rolled back in background.

I wonder how this increased the startup time in your case.
[30 Jun 2016 8:16] Simon Mudd
The server in question is used to accept heavy BULK inserts/updates into a number of tables in parallel.
Load on the server due to these writes is around 5-10 and network traffic is ~60-80 MB/s (replicating to a few slaves) with disk I/O about 50-100 MB/s during _normal_ processing.

So if you startup a server when it's crashed and allow these busy writers back into the system you probably won't be surprised to see that they interfere with each other and total throughput goes down as some new writes get blocked by the undo process that's running from recovery.
Basically these writers get blocked by the ongoing recovery process.

In my case the number of readers compared to the number of writers is minimal (and slaves will pick up changes and be queried for reads).
So for low write rates maybe the server won't be affected much and queries can continue but that is not always the case.

It's also the reason why I would like to see an optional recovery behaviour where access to the server is blocked until recovery has completed.
This allows the recovery process to proceed as fast as it can unimpeded by other activity at which case normal access would work.
(There's another FR for this elsewhere.)  Configuring this manually is possible but not easy to do cleanly.

FWIW on the server in question the recovery process took > 45 minutes (and hadn't completed) when inserts were allowed in vs it ran for 20 minutes with network traffic blocked.
That's quite a difference.
[30 Jun 2016 8:28] Simon Mudd
Naga: You said: "I wonder how this increased the startup time in your case."

For me startup time is the time for the server to be working "normally". That is the time when it can accept reads and writes and process them in a normal manner.  In my opinion accepting network SQL connections and allowing queries to run does not mean that the server has fully started up as some writes will get blocked.

It's true that some writes may be allowed to make progress if they affect tables where transactions are not being rolled back but in the end the tables affected by the UNDO are likely to be the busiest tables and they're also likely to be the tables that take new writes. Hence interaction between new writes and the UNDO process is pretty likely (especially on a master).
[17 Sep 2016 0:08] James Day
It is also worth noting that this is not only about the performance of crash recovery. It's also about performance while in normal production service, which can be limited by crash recovery speed.

The limit happens because people can have a maximum amount of time that they are willing to wait for crash recovery. Duplicated redundant servers can help with single server failures via failover arrangements but typically not if all are in one data center that suffers a power outage and that is the worst case recovery time that must be limited.

To achieve that target they must do things like increasing page flushing via innodb_io_capacity and/or innodb_io_capacity_max or decreasing redo log size. Each of those reduces potential foreground thread throughput because of the loss of the i/o operations wasted on flushing dirty pages earlier than the buffer pool size could handle.

As InnoDB gets faster in general, this can end up being the limiting throughput factor because of the sub-optimal configurations that are necessary to hit the target. Fortunately it's still rare for us to have to recommend such approaches but that is the way things are trending as RAM sizes increase and storage doesn't match RAM speed.

SSDs and bus attached storage would just hit the problem at higher throughput levels, though they can help initially if the price per gigabyte is tolerable for the data capacity needed. Both are still way more expensive in $/GB than SSDs for the bulk end of the data size range where archival SATA drives are currently the clear leader but there are many i/o capacity vs price points where they can fit the need and budget.

James Day, MySQL Senior Principal Support Engineer, Oracle