Bug #90993 | InnoDB crashes from long lock wait - log writer waiting for checkpointer | ||
---|---|---|---|
Submitted: | 23 May 2018 14:21 | Modified: | 20 Aug 2018 12:24 |
Reporter: | Mark Callaghan | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 8.0.11 | OS: | Any |
Assigned to: | Pawel Olchawa | CPU Architecture: | Any |
[23 May 2018 14:21]
Mark Callaghan
[23 May 2018 14:26]
Mark Callaghan
innodb monitor output
Attachment: 92a (application/octet-stream, text), 37.76 KiB.
[24 May 2018 13:49]
Mark Callaghan
This reproduces for me too when I use 32 clients & 32 tables. it did not crash with 32 clients sharing 1 table. This error message is confusing. The semaphore wait didn't last > 42675766 seconds. 2018-05-24T04:46:59.669746Z 0 [ERROR] [MY-012872] [InnoDB] InnoDB: Semaphore wait has lasted > 42675766 seconds. We intentionally crash the server because it appears to be hung .[FATAL]Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
[27 May 2018 13:47]
Mark Callaghan
Test server has 2 sockets, 14 cores / socket and HT enabled so user land see 56 cores. I ran the test starting with 24 clients and then increased it by 1 each time the test was OK. The first failure occurs at 29 clients. Either the ingest rate is too high starting with 29 clients or there is some kind of starvation preventing background threads from getting work done fast enough. It is interesting this doesn't happen until the number of clients is slightly larger than the number of real CPU cores. I wonder if part of the problem is the more frequent use of spin waiting by InnoDB in 8.0.11.
[18 Jun 2018 12:30]
Daniel Price
Posted by developer: BUG#28072385 INNODB CRASHES FROM LONG LOCK WAIT - LOG WRITER WAITING FOR CHECKPOINTER 1. We missed a margin between boundary up to which log writer may write and boundary up to which log_free_check() calls do not have to wait (after they included the concurrency margin). This margin has been restored and it is called "extra_margin". 2. There was a completely useless safety-margin equal to 10% of true redo capacity. The safety margin is supposed not to be used (never, ever). We decrease size of the margin to: min(10% of true redo capacity, 1024 * UNIV_PAGE_SIZE). 3. The concurrency margin became increased by 5% of redo capacity (after the safety_margin is subtracted from the redo capacity). Therefore the initial value of concurrency_margin is now greater by a value which does not depend on number of concurrent threads. This is an extra protection for two cases: - when innodb_thread_concurrency = 0, - when we miss a call to log_free_check() in some flow (bug). 4. What's more, the concurrency_margin became adaptively increased when log_writer enters the extra_margin. If that happens, we had to either encounter a missing call to log_free_check() or we have the unlimited thread concurrency (innodb_thread_concurrency = 0) case. Then next log_free_check() calls are stopped even earlier and the log_writer thread proceeds further (it still has the extra_margin in which it can operate (5% of redo capacity)). We hopefully will resolve the problems (thanks to log_writer writing further, which allows to unlock the oldest dirty pages, flush them and advance the checkpoint) and next time we will stop users at log_free_check() calls earlier (this point might be moved up to even 50% of redo). When log_writer exits the extra_margin (by checkpoint advanced), it notices the fact and next time it re-enters the extra_margin, it increases the concurrency_margin again. Each time it increases the concurrency_margin, it does that by 20%. 5. Do not kill MySQL explicitly when log_writer runs out of space in redo. Keep waiting and emit error every 5s. In case of real deadlock it is error monitor thread which kills MySQL (after 10s, so we would have the error and we could recognize the reason). 6. Abort redo's archiver when log_writer started to overwrite non-archived yet region of the redo log. 7. Decrease sleep time in loop which is executed when log_writer ran out of free space in redo log (or reached non-archived region). 8. Do not emit warnings about running out of space in redo soon (we emit only error and only when we hit the final limit, and only once per 5s). 9. Emit warning about overwriting non-archived region of redo only once per provided period. 10. Introduced new monitor counters which allow to monitor scenarios in which we run out of free space in redo. 11. Reclaim space in log recent_written buffer when log_writer is waiting for free space in redo log / archiver's progress.
[3 Jul 2018 5:49]
MySQL Verification Team
Bug #91517 marked as duplicate of this one
[20 Aug 2018 12:24]
Daniel Price
Posted by developer: Fixed as of the upcoming 8.0.13 release, and here's the changelog entry: An apparent hang due to the log writer running out of free space in the redo log caused the server to exit. Thank you for the bug report.