Bug #102143 workload stalls while running truncation with large buffer pool
Submitted: 5 Jan 3:06 Modified: 7 Jan 9:46
Reporter: zhai weixiang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[5 Jan 3:06] zhai weixiang
Description:
While running benchamark on production enviroment, we found the tps decrease to zero sometimes and the backtrace looks like this:

   1 buf_page_in_file(buf0buf.ic:338),buf_LRU_remove_all_pages(buf0buf.ic:338),buf_LRU_remove_pages(buf0buf.ic:338),buf_LRU_flush_or_remove_pages(buf0buf.ic:338),Fil_shard::space_delete(fil0fil.cc:4386),fil_delete_tablespace(fil0fil.cc:4485),fil_replace_tablespace(fil0fil.cc:4581),trx_undo_truncate_tablespace(trx0undo.cc:1968),trx_purge_truncate_marked_undo_low(trx0purge.cc:1320),trx_purge_truncate_marked_undo(trx0purge.cc:1320),trx_purge_truncate_history(trx0purge.cc:1602),trx_purge_truncate(trx0purge.cc:2190),trx_purge(trx0purge.cc:2190),srv_do_purge(srv0srv.cc:3000),srv_purge_coordinator_thread(srv0srv.cc:3000),void(srv0srv.cc:3000),__invoke<void(srv0srv.cc:3000),__call<void>(srv0srv.cc:3000),operator()<>(srv0srv.cc:3000),operator()<void(srv0srv.cc:3000),__invoke_impl<void,(srv0srv.cc:3000),__invoke<Runnable,(srv0srv.cc:3000),_M_invoke<0,(srv0srv.cc:3000),operator()(srv0srv.cc:3000),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable,(srv0srv.cc:3000),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

 200 pthread_cond_wait,wait(os0event.cc:189),os_event::wait_low(os0event.cc:189),os_event_wait_low(os0event.cc:608),sync_array_wait_event(sync0arr.cc:444),rw_lock_s_lock_spin(sync0rw.cc:390),rw_lock_s_lock_func(sync0rw.ic:376),pfs_rw_lock_s_lock_func(sync0rw.ic:376),s_lock(sync0rw.ic:376),get_next_redo_rseg_from_undo_spaces(sync0rw.ic:376),get_next_redo_rseg(sync0rw.ic:376),trx_assign_rseg_durable(sync0rw.ic:376),trx_set_rw_mode(trx0trx.cc:3212),lock_table(lock0lock.cc:3929),row_search_mvcc(row0sel.cc:4828),ha_innobase::index_read(ha_innodb.cc:10375),index_read_map(handler.h:4987),handler::ha_index_read_map(handler.h:4987),handler::read_range_first(handler.cc:7293),ha_innobase::read_range_first(ha_innodb.cc:10746),handler::multi_range_read_next(handler.cc:6450),handler::ha_multi_range_read_next(handler.cc:6391),QUICK_RANGE_SELECT::get_next(opt_range.cc:10948),IndexRangeScanIterator::Read(records.cc:358),Sql_cmd_update::update_single_table(sql_update.cc:815),Sql_cmd_dml::execute(sql_select.cc:704),mysql_execute_command(sql_parse.cc:4754),Prepared_statement::execute(sql_prepare.cc:3244),Prepared_statement::execute_loop(sql_prepare.cc:2806),mysqld_stmt_execute(sql_prepare.cc:1924),dispatch_command(sql_parse.cc:1830),do_command(sql_parse.cc:1304),threadpool_process_request(threadpool_common.cc:258),handle_event(threadpool_unix.cc:1737),worker_main(threadpool_unix.cc:1737),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)

How to repeat:
1. turn off binlog so that we can run very heavy workload with sysbench.
2. Trigger an undo truncation (for example, create an undo tablespace and mark it inactive)
3, big buffer pool,(we have 256GB buffer pool 

Suggested fix:
Don't hold undo::spaces->x_lock() while running trx_undo_truncate_tablespace() invoked by trx_purge_truncate_marked_undo_low
[6 Jan 8:30] MySQL Verification Team
Hello Zhai,

Thank you for the report and feedback.
May I request you to please provide sysbench test details(unaltered sysbench command with all the options used), server configuration file, server version details and statements which triggered stall etc to reproduce the issue at our end? Thank you.

regards,
Umesh
[6 Jan 9:02] zhai weixiang
sysbench command:
200 tables, each has 10000000 records
sysbench oltp_update_non_index.lua --threads=200 --time=3000 --mysql-user=$user --mysql-password=$password --mysql-host=$host --mysql-port=$port --mysql-db=sysbench --report-interval=1 --tables=200 --table-size=10000000  --events=0 --forced-shutdown=off  run

undo truncation is turn on.

buffer pool configuration:
innodb_buffer_pool_chunk_size=32M
innodb_buffer_pool_size = 500000M
innodb_buffer_pool_instances = 32

Actually after analyzing the code, I think it should be safe to invoke trx_undo_truncate_tablespace without holding undo::spaces->x_lock(). because the undo tablespace to be truncated is marked as inactive and no user thread will use rsegs of the space (ref code in function get_next_redo_rseg_from_undo_spaces)
[7 Jan 2:37] zhai weixiang
btw. I'm using 8.0.18 version. and the version after (WL#11819: InnoDB: Support ACID Undo DDL) should suffer less influence. But may still be affected as the truncation process is holding x lock
[7 Jan 9:46] MySQL Verification Team
Thank you for the feedback.
I tried on 8.0.22 release build with the moderate buffer pool(96G) considering test box capacity:

rm -rf 102143/
bin/mysqld --initialize-insecure --basedir=$PWD --datadir=$PWD/102143 --log-error-verbosity=3
bin/mysqld --no-defaults --basedir=$PWD --datadir=$PWD/102143 --core-file --socket=/tmp/mysql_ushastry.sock --port=3333 --log-error=$PWD/102143/log.err --mysqlx-port=33330 --mysqlx-socket=/tmp/mysql_x_ushastry.sock --log-error-verbosity=3  --secure-file-priv=/tmp/ --innodb_buffer_pool_chunk_size=32M --innodb_buffer_pool_size=96G --innodb_buffer_pool_instances=32 --skip-log-bin 2>&1 &

- prepare
 bin/sysbench --threads=200 --mysql-user=root --mysql-socket=/tmp/mysql_ushastry.sock --mysql-db=sysbench --report-interval=1 --tables=200 --table-size=10000000 --events=0 --forced-shutdown=off ./share/sysbench/oltp_update_non_index.lua prepare

+
 du -h 102143/sysbench/
454G    102143/sysbench/

- run (increased max connections 1000)

 bin/sysbench --threads=500 --time=3000 --mysql-user=root --mysql-socket=/tmp/mysql_ushastry.sock --mysql-db=sysbench --report-interval=1 --tables=200 --table-size=10000000 --events=0 --forced-shutdown=off ./share/sysbench/oltp_update_non_index.lua run

In between:

CREATE UNDO TABLESPACE tablespace_name ADD DATAFILE 'file_name.ibu';
ALTER UNDO TABLESPACE tablespace_name SET INACTIVE;

Observed decrease in tps:

 777s ] thds: 500 tps: 544.82 qps: 544.82 (r/w/o: 0.00/544.82/0.00) lat (ms,95%): 1938.16 err/s: 0.00 reconn/s: 0.00
[ 778s ] thds: 500 tps: 590.21 qps: 590.21 (r/w/o: 0.00/590.21/0.00) lat (ms,95%): 1109.09 err/s: 0.00 reconn/s: 0.00
[ 779s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 780s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 781s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 782s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 783s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 784s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 785s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 786s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 787s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 788s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 789s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 790s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 791s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 792s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 793s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 794s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 795s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 796s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 797s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 798s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 799s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 800s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 801s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 802s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 803s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 804s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 805s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 806s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 807s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 808s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 809s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 810s ] thds: 500 tps: 1199.01 qps: 1199.01 (r/w/o: 0.00/1199.01/0.00) lat (ms,95%): 32161.14 err/s: 0.00 reconn/s: 0.00
[ 811s ] thds: 500 tps: 942.74 qps: 942.74 (r/w/o: 0.00/942.74/0.00) lat (ms,95%): 1213.57 err/s: 0.00 reconn/s: 0.00
[ 812s ] thds: 500 tps: 658.16 qps: 658.16 (r/w/o: 0.00/658.16/0.00) lat (ms,95%): 1069.86 err/s: 0.00 reconn/s: 0.00
[ 813s ] thds: 500 tps: 920.00 qps: 920.00 (r/w/o: 0.00/920.00/0.00) lat (ms,95%): 1149.76 err/s: 0.00 reconn/s: 0.00
[ 814s ] thds: 500 tps: 1102.81 qps: 1102.81 (r/w/o: 0.00/1102.81/0.00) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 815s ] thds: 500 tps: 1448.62 qps: 1448.62 (r/w/o: 0.00/1448.62/0.00) lat (ms,95%): 960.30 err/s: 0.00 reconn/s: 0.00
[ 816s ] thds: 500 tps: 1086.39 qps: 1086.39 (r/w/o: 0.00/1086.39/0.00) lat (ms,95%): 861.95 err/s: 0.00 reconn/s: 0.00

Looking at "Performance Impact of Truncating Undo Tablespace Files" in https://dev.mysql.com/doc/refman/8.0/en/innodb-undo-tablespaces.html it seems to be expected and also didn't observe the stacktraces which you pointed in the report. Most of the issues seems to be handled as part of WL#11819(in 8.0.21) but since you have pointed rightly about truncation process holding x lock and hence verifying so that further it is improved.