Bug #96374 binlog rotation deadlock when innodb concurrency limit setted
Submitted: 30 Jul 6:53 Modified: 21 Aug 16:00
Reporter: jia liu Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.16 8.0.17 5.7.25 OS:Any
Assigned to: CPU Architecture:Any
Tags: binlog rotaion, innodb-commit-concurrency, innodb-thread-concurrency

[30 Jul 6:53] jia liu
Description:
In version 8.0.16

Under this settings:
log-bin=/data/mysql/mysqldata3307/binlog/mysql-bin
master-info-repository=TABLE
innodb-commit-concurrency=16
innodb-thread-concurrency=64

With high concurrency updates(more than 256 threads), mysql can deadlock when rotating binlog.

Here is the part of the stack trace:
Thread 673 (Thread 0x7f5b3c0be700 (LWP 8905)):
...
#3  srv_conc_enter_innodb (prebuilt=prebuilt@entry=0x7f59b40451d8) at ../../../mysql-8.0.16/storage/innobase/srv/srv0conc.cc:224
...
#5  ha_innobase::write_row(unsigned char*) () at ../../../mysql-8.0.16/storage/innobase/handler/ha_innodb.cc:8374
...
#10 0x000000000198f01d in Gtid_state::save_gtids_of_last_binlog_into_table(bool) () at ../../mysql-8.0.16/sql/rpl_gtid_state.cc:711
#11 0x00000000019bafd4 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*) () at ../../mysql-8.0.16/sql/binlog.cc:6394
#12 0x00000000019bb6aa in MYSQL_BIN_LOG::new_file_without_locking (this=<optimized out>, extra_description_event=extra_description_event@entry=0x0) at ../../mysql-8.0.16/sql/binlog.cc:6329
#13 0x00000000019bb6e5 in MYSQL_BIN_LOG::rotate (this=<optimized out>, force_rotate=<optimized out>, check_purge=0x7f5b3c0ba970) at ../../mysql-8.0.16/sql/binlog.cc:6992
#14 0x00000000019c88b7 in MYSQL_BIN_LOG::ordered_commit(THD*, bool, bool) () at ../../mysql-8.0.16/sql/binlog.cc:8934
#15 0x00000000019c97a1 in MYSQL_BIN_LOG::commit(THD*, bool) () at ../../mysql-8.0.16/sql/binlog.cc:8070
#16 0x0000000000fe061e in ha_commit_trans (thd=thd@entry=0x7f5a38083bc0, all=all@entry=false, ignore_global_read_lock=<optimized out>) at ../../mysql-8.0.16/sql/xa.h:401
#17 0x0000000000eaf81d in trans_commit_stmt(THD*, bool) () at ../../mysql-8.0.16/sql/transaction.cc:531
....

Thread 683 (Thread 0x7f5b5d626700 (LWP 8563)):
#0  0x00007f5b722d1965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000019b3c64 in native_cond_wait (mutex=0x3807608 <mysql_bin_log+3912>, cond=0x38075d0 <mysql_bin_log+3856>) at ../../mysql-8.0.16/include/thr_cond.h:162
#2  my_cond_wait (mp=0x3807608 <mysql_bin_log+3912>, cond=0x38075d0 <mysql_bin_log+3856>) at ../../mysql-8.0.16/include/thr_cond.h:162
#3  inline_mysql_cond_wait (src_file=0x2ae0410 "../../mysql-8.0.16/sql/binlog.cc", src_line=2346, mutex=0x3807608 <mysql_bin_log+3912>, that=0x38075d0 <mysql_bin_log+3856>) at ../../mysql-8.0.16/include/mysql/psi/mysql_cond.h:174
#4  Stage_manager::enroll_for (this=this@entry=0x38074f8 <mysql_bin_log+3640>, stage=<optimized out>, thd=0x7f5a68011a70, stage_mutex=<optimized out>) at ../../mysql-8.0.16/sql/binlog.cc:2346
#5  0x00000000019b8b45 in MYSQL_BIN_LOG::change_stage (this=0x38066c0 <mysql_bin_log>, thd=<optimized out>, stage=<optimized out>, queue=<optimized out>, leave_mutex=<optimized out>, enter_mutex=0x38066c8 <mysql_bin_log+8>) at ../../mysql-8.0.16/sql/binlog.cc:8355
#6  0x00000000019c8366 in MYSQL_BIN_LOG::ordered_commit(THD*, bool, bool) () at ../../mysql-8.0.16/sql/binlog.cc:8719
...

Thread 626 (Thread 0x7f59a14bf700 (LWP 9316)):
#0  0x00007f5b722d44ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5b722cfdcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f5b722cfc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000019b8b8f in native_mutex_lock (mutex=0x38066c8 <mysql_bin_log+8>) at ../../mysql-8.0.16/include/thr_mutex.h:186
#4  my_mutex_lock (mp=0x38066c8 <mysql_bin_log+8>) at ../../mysql-8.0.16/include/thr_mutex.h:186
#5  inline_mysql_mutex_lock (src_line=8373, src_file=0x2ae0410 "../../mysql-8.0.16/sql/binlog.cc", that=0x38066c8 <mysql_bin_log+8>) at ../../mysql-8.0.16/include/mysql/psi/mysql_mutex.h:250
#6  MYSQL_BIN_LOG::change_stage (this=<optimized out>, thd=<optimized out>, stage=<optimized out>, queue=<optimized out>, leave_mutex=<optimized out>, enter_mutex=0x38066c8 <mysql_bin_log+8>) at ../../mysql-8.0.16/sql/binlog.cc:8373
#7  0x00000000019c8366 in MYSQL_BIN_LOG::ordered_commit(THD*, bool, bool) () at ../../mysql-8.0.16/sql/binlog.cc:8719
...

I my guess, thread 673 need to write rows when rotating binlogs,   which requires to enter inside into innodb, but many other thread like 683 and 626 are waiting binlog while holding innodb-thread-concurrency or innodb-commit-concurrency(according to the show engine innodb status result , it is indeed that 64 threads is inside innodb), which lead to deadlock.

How to repeat:
In version 8.0.16

Under this settings:
log-bin=/data/mysql/mysqldata3307/binlog/mysql-bin
master-info-repository=TABLE
innodb-commit-concurrency=16
innodb-thread-concurrency=64

or use attached my.cnf

Then stress MySQL server with high concurrency updates(more than 256 threads), and wait binlog rotates.
It could need many rotates to trigger this problem, but it will happen, we can reproduce this problem.

Suggested fix:
Temporary workaround: set innodb-commit-concurrency=0 and innodb-thread-concurrency=0
Suggested fix: bypass innodb concurrency limits for system internal operations, only enforce it for user request.
[30 Jul 6:54] jia liu
gdb outputs

Attachment: gdb.txt (text/plain), 1.63 MiB.

[30 Jul 6:54] jia liu
show engine innodb status

Attachment: show engine innodb status.txt (text/plain), 70.15 KiB.

[30 Jul 6:55] jia liu
show processlist

Attachment: show porcesslist.txt (text/plain), 135.61 KiB.

[30 Jul 6:55] jia liu
my.cnf

Attachment: my.cnf (application/octet-stream, text), 4.95 KiB.

[12 Aug 5:15] Bogdan Kecman
Hi,

I'm running this for weeks and I'm not able to reproduce the problem.

What environment are you using to reproduce this?

thanks
Bogdan
[13 Aug 11:45] jia liu
this is my comment, it is too long to post

Attachment: analyze.txt (text/plain), 24.03 KiB.

[14 Aug 9:08] jia liu
I am sorry for my misleading suspect,
this bug is nothing to do with master-info-repository=TABLE,
just with innodb-commit-concurrency\innodb-thread-concurrency.
[15 Aug 3:33] jia liu
With more investigation, we find it is more like to to trigger when update 0 rows.(insert ignore\delete 0 rows also possible, but not tested yet)
And it also affects 5.7.25 but not only 5.7.25.In my suspect, it will affect all versions which gtid_executed is an innodb table.

I still suggest to fix this problem by "bypass innodb concurrency limits for system internal operations", not to only patchs for update 0 rows.
[21 Aug 13:58] Sven Petai
We have also run into this deadlock twice in production (didn't have debug tools in place first time around to narrow it down).
Our normal write load is several thousand commits per second 24/7 and it takes 3-4 months of that to trigger the deadlock so it definitely requires careful massaging of both mysqld config and the test script to trigger it reliably in a reasonable amount of time.

To replicate the deadlock reliably on mysql-community-8.0.17 server the following non-default settings seem to be important:
innodb_thread_concurrency = 4 # low value in order to increase probability of the deadlock
max_binlog_size=2M # since deadlock involves binlog rotation we set binlog file size to a low value to increase rotation frequency
binlog_format=mixed # haven't been able to reproduce it with the default ROW yet
gtid_mode = ON # haven't been able to replicate it without GTID 
enforce_gtid_consistency = ON

For the test client I used sysbench 1.1.0-174f3aa with the bundled oltp_update_non_index.lua test with a small change to ./src/lua/oltp_common.lua to throw in some updates that do not match any rows as Jia suggested.
I'm not sure yet if these zero-updates are essential for triggering of the bug or just increase the probability but with some zero-updates thrown in it will hang in a couple of seconds whereas without these I haven't been able to replicate it in an hour.

Replace get_id() function in sysbenches oltp_common.lua with something like this:
local function get_id()
   local id = sysbench.rand.default(1, sysbench.opt.table_size)
   if (id % 10) == 0 then
      return id+10000000
   else
      return id
   end
end

After that populate the test table:
sysbench ./src/lua/oltp_update_non_index.lua --time=00 --threads=230 --mysql-user=sbtest --mysql-password=sbtest --mysql-db=test --tables=1 --table-size=1000000 prepare

and run the test:
sysbench ./src/lua/oltp_update_non_index.lua --time=00 --threads=230 --mysql-user=sbtest --mysql-password=sbtest --mysql-db=test --tables=1 --table-size=1000000 run

You should see hang within a couple of seconds.
[21 Aug 16:00] Bogdan Kecman
Hi,

Thanks for the VM, reproduced on the first try.

All best
Bogdan