MySQL Bugs: #90993: InnoDB crashes from long lock wait - log writer waiting for checkpointer

Bug #90993	InnoDB crashes from long lock wait - log writer waiting for checkpointer
Submitted:	23 May 2018 14:21	Modified:	20 Aug 2018 12:24
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	8.0.11	OS:	Any
Assigned to:	Pawel Olchawa	CPU Architecture:	Any

Description:
InnoDB crashes from a long semaphore wait during the load when I run the insert benchmark with 32 tables, 32 clients and 2B rows. This is an IO-bound setup -- the buffer pool is 35gb. It ran without crashing with 16 clients. The server has 56 CPU cores with HT enabled and fast SSD storage.

Database error log is full of messages like this prior to the crash:
2018-05-22T15:52:27.287624Z 0 [ERROR] [MY-013059] [InnoDB] InnoDB: Log writer overwriting data after checkpoint - waited too long (1 second), lag: 3865468416 bytes, checkpoint LSN: 134684446208
2018-05-22T15:52:27.287686Z 0 [Warning] [MY-013060] [InnoDB] InnoDB: Log writer is waiting for checkpointer to to catch up lag: 3865468928 bytes, checkpoint LSN: 134684446208

Crash stack is:
2018-05-22T20:20:40.539083Z 0 [Warning] [MY-013060] [InnoDB] InnoDB: Log writer is waiting for checkpointer to to catch up lag: 3865954304 bytes, checkpoint LSN: 199802707456
2018-05-22T20:20:40.639192Z 0 [Warning] [MY-013060] [InnoDB] InnoDB: Log writer is waiting for checkpointer to to catch up lag: 3865954304 bytes, checkpoint LSN: 199802707456
InnoDB: ###### Diagnostic info printed to the standard error stream
2018-05-22T20:20:40.657998Z 0 [ERROR] [MY-012872] [InnoDB] InnoDB: Semaphore wait has lasted > 42675766 seconds. We intentionally crash the server because it appears to be hung.[FATAL]Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2018-05-22T20:20:40.658015Z 0 [ERROR] [MY-000000] [InnoDB] InnoDB: Assertion failure: ut0ut.cc:675
InnoDB: thread 140180706748160
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
20:20:40 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=34
max_threads=5000
thread_count=33
connection_count=32
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1983309 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x46000
/data/users/mcallaghan/orig8011/libexec/mysqld(my_print_stacktrace(unsigned char*, unsigned long)+0x2d) [0x1b077cd]
/data/users/mcallaghan/orig8011/libexec/mysqld(handle_fatal_signal+0x4b0) [0xde60e0]
/usr/local/fbcode/gcc-5-glibc-2.23/lib/libpthread.so.0(+0x11acf) [0x7f8814be9acf]
/usr/local/fbcode/gcc-5-glibc-2.23/lib/libc.so.6(gsignal+0x38) [0x7f88130327c8]
/usr/local/fbcode/gcc-5-glibc-2.23/lib/libc.so.6(abort+0x170) [0x7f8813034590]
/data/users/mcallaghan/orig8011/libexec/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0xe9) [0xb2f524]
/data/users/mcallaghan/orig8011/libexec/mysqld(ib::fatal::~fatal()+0x1e7) [0x1da23b7]
/data/users/mcallaghan/orig8011/libexec/mysqld(srv_error_monitor_thread()+0xbf8) [0x1d40b38]
/data/users/mcallaghan/orig8011/libexec/mysqld(std::thread::_Impl<std::_Bind_simple<Runnable (void (*)())> >::_M_run()+0x68) [0x1c67b78]
/usr/local/fbcode/gcc-5-glibc-2.23/lib/libstdc++.so.6(+0xc7170) [0x7f8813799170]
/usr/local/fbcode/gcc-5-glibc-2.23/lib/libpthread.so.0(+0x77a8) [0x7f8814bdf7a8]
/usr/local/fbcode/gcc-5-glibc-2.23/lib/libc.so.6(clone+0x6c) [0x7f881310aa7c]

How to repeat:
Test is the insert benchmark described at http://smalldatum.blogspot.com/2017/06/the-insert-benchmark.html

Benchmark source is here - https://github.com/mdcallag/mytools/tree/master/bench/ibench

repro script is:
for i in $( seq 1 32 ); do
python iibench.py --db_name=test --rows_per_report=100000 --db_host=127.0.0.1 --db_user=root --db_password=pw --engine=innodb --unique_checks=1 --bulk_load=0 --max_rows=62500000 --table_name=pi${i} --setup --num_secondary_indexes=3 --data_length_min=10 --data_length_max=20 --rows_per_commit=100 --inserts_per_second=0 --query_threads=0 --seed=$(( 1527002255 + $i )) >& ld.$i &
sleep 3
done

my.cnf is:
[mysqld]
sql_mode=no_engine_substitution,no_unsigned_subtraction
default_tmp_storage_engine=MyISAM
tmpdir=/tmp
slow_query_log_file=/data/users/mcallaghan/orig8011/slow.log
slow_query_log=1
long_query_time=5
log_bin=/binlogs/myrocks/bl
max_binlog_size=128M
sync_binlog=0
server_id=77
binlog_format=ROW
binlog_checksum=CRC32

datadir=/data/users/mcallaghan/orig8011/data
innodb_data_home_dir=/data/users/mcallaghan/orig8011/data
innodb_data_file_path=ibdata1:512M:autoextend
innodb_log_group_home_dir=/txlogs/myrocks
innodb_log_file_size=2G

innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
innodb_doublewrite=1

innodb_flush_neighbors=0
innodb_io_capacity=20000
innodb_io_capacity_max=50000
innodb_lru_scan_depth=5000
innodb_buffer_pool_instances=4

performance_schema=1
ssl=0

character_set_server=latin1
collation_server=latin1_swedish_ci
default_authentication_plugin=mysql_native_password

max_connections=5000

innodb_buffer_pool_size=35G

innodb monitor output

Attachment: 92a (application/octet-stream, text), 37.76 KiB.

This reproduces for me too when I use 32 clients & 32 tables. it did not crash with 32 clients sharing 1 table.

This error message is confusing. The semaphore wait didn't last > 42675766 seconds.

2018-05-24T04:46:59.669746Z 0 [ERROR] [MY-012872] [InnoDB] InnoDB: Semaphore wait has lasted > 42675766 seconds. We intentionally crash the server because it appears to be hung
.[FATAL]Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

Test server has 2 sockets, 14 cores / socket and HT enabled so user land see 56 cores. I ran the test starting with 24 clients and then increased it by 1 each time the test was OK. The first failure occurs at 29 clients. 

Either the ingest rate is too high starting with 29 clients or there is some kind of starvation preventing background threads from getting work done fast enough. It is interesting this doesn't happen until the number of clients is slightly larger than the number of real CPU cores. I wonder if part of the problem is the more frequent use of spin waiting by InnoDB in 8.0.11.

Posted by developer:
 
    BUG#28072385 INNODB CRASHES FROM LONG LOCK WAIT - LOG WRITER
                 WAITING FOR CHECKPOINTER
    
    1. We missed a margin between boundary up to which log writer
    may write and boundary up to which log_free_check() calls do
    not have to wait (after they included the concurrency margin).
    This margin has been restored and it is called "extra_margin".
    
    2. There was a completely useless safety-margin equal to 10%
    of true redo capacity. The safety margin is supposed not to
    be used (never, ever). We decrease size of the margin to:
      min(10% of true redo capacity, 1024 * UNIV_PAGE_SIZE).
    
    3. The concurrency margin became increased by 5% of redo capacity
    (after the safety_margin is subtracted from the redo capacity).
    Therefore the initial value of concurrency_margin is now greater
    by a value which does not depend on number of concurrent threads.
    This is an extra protection for two cases:
      - when innodb_thread_concurrency = 0,
      - when we miss a call to log_free_check() in some flow (bug).
    
    4. What's more, the concurrency_margin became adaptively increased
    when log_writer enters the extra_margin. If that happens, we had to
    either encounter a missing call to log_free_check() or we have the
    unlimited thread concurrency (innodb_thread_concurrency = 0) case.
    Then next log_free_check() calls are stopped even earlier and the
    log_writer thread proceeds further (it still has the extra_margin
    in which it can operate (5% of redo capacity)). We hopefully will
    resolve the problems (thanks to log_writer writing further, which
    allows to unlock the oldest dirty pages, flush them and advance
    the checkpoint) and next time we will stop users at log_free_check()
    calls earlier (this point might be moved up to even 50% of redo).
    When log_writer exits the extra_margin (by checkpoint advanced),
    it notices the fact and next time it re-enters the extra_margin,
    it increases the concurrency_margin again. Each time it increases
    the concurrency_margin, it does that by 20%.
    
    5. Do not kill MySQL explicitly when log_writer runs out of space
    in redo. Keep waiting and emit error every 5s. In case of real
    deadlock it is error monitor thread which kills MySQL (after 10s,
    so we would have the error and we could recognize the reason).
    
    6. Abort redo's archiver when log_writer started to overwrite
    non-archived yet region of the redo log.
    
    7. Decrease sleep time in loop which is executed when log_writer
    ran out of free space in redo log (or reached non-archived region).
    
    8. Do not emit warnings about running out of space in redo soon (we emit
    only error and only when we hit the final limit, and only once per 5s).
    
    9. Emit warning about overwriting non-archived region of redo only once
    per provided period.
    
    10. Introduced new monitor counters which allow to monitor scenarios
    in which we run out of free space in redo.
    
    11. Reclaim space in log recent_written buffer when log_writer is
    waiting for free space in redo log / archiver's progress.

Bug #91517 marked as duplicate of this one

Posted by developer:
 
Fixed as of the upcoming 8.0.13 release, and here's the changelog entry:

An apparent hang due to the log writer running out of free space in the
redo log caused the server to exit. 

Thank you for the bug report.