Bug #102143 workload stalls while running truncation with large buffer pool
Submitted: 5 Jan 2021 3:06 Modified: 14 May 2021 12:40
Reporter: zhai weixiang (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[5 Jan 2021 3:06] zhai weixiang
Description:
While running benchamark on production enviroment, we found the tps decrease to zero sometimes and the backtrace looks like this:

   1 buf_page_in_file(buf0buf.ic:338),buf_LRU_remove_all_pages(buf0buf.ic:338),buf_LRU_remove_pages(buf0buf.ic:338),buf_LRU_flush_or_remove_pages(buf0buf.ic:338),Fil_shard::space_delete(fil0fil.cc:4386),fil_delete_tablespace(fil0fil.cc:4485),fil_replace_tablespace(fil0fil.cc:4581),trx_undo_truncate_tablespace(trx0undo.cc:1968),trx_purge_truncate_marked_undo_low(trx0purge.cc:1320),trx_purge_truncate_marked_undo(trx0purge.cc:1320),trx_purge_truncate_history(trx0purge.cc:1602),trx_purge_truncate(trx0purge.cc:2190),trx_purge(trx0purge.cc:2190),srv_do_purge(srv0srv.cc:3000),srv_purge_coordinator_thread(srv0srv.cc:3000),void(srv0srv.cc:3000),__invoke<void(srv0srv.cc:3000),__call<void>(srv0srv.cc:3000),operator()<>(srv0srv.cc:3000),operator()<void(srv0srv.cc:3000),__invoke_impl<void,(srv0srv.cc:3000),__invoke<Runnable,(srv0srv.cc:3000),_M_invoke<0,(srv0srv.cc:3000),operator()(srv0srv.cc:3000),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable,(srv0srv.cc:3000),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

 200 pthread_cond_wait,wait(os0event.cc:189),os_event::wait_low(os0event.cc:189),os_event_wait_low(os0event.cc:608),sync_array_wait_event(sync0arr.cc:444),rw_lock_s_lock_spin(sync0rw.cc:390),rw_lock_s_lock_func(sync0rw.ic:376),pfs_rw_lock_s_lock_func(sync0rw.ic:376),s_lock(sync0rw.ic:376),get_next_redo_rseg_from_undo_spaces(sync0rw.ic:376),get_next_redo_rseg(sync0rw.ic:376),trx_assign_rseg_durable(sync0rw.ic:376),trx_set_rw_mode(trx0trx.cc:3212),lock_table(lock0lock.cc:3929),row_search_mvcc(row0sel.cc:4828),ha_innobase::index_read(ha_innodb.cc:10375),index_read_map(handler.h:4987),handler::ha_index_read_map(handler.h:4987),handler::read_range_first(handler.cc:7293),ha_innobase::read_range_first(ha_innodb.cc:10746),handler::multi_range_read_next(handler.cc:6450),handler::ha_multi_range_read_next(handler.cc:6391),QUICK_RANGE_SELECT::get_next(opt_range.cc:10948),IndexRangeScanIterator::Read(records.cc:358),Sql_cmd_update::update_single_table(sql_update.cc:815),Sql_cmd_dml::execute(sql_select.cc:704),mysql_execute_command(sql_parse.cc:4754),Prepared_statement::execute(sql_prepare.cc:3244),Prepared_statement::execute_loop(sql_prepare.cc:2806),mysqld_stmt_execute(sql_prepare.cc:1924),dispatch_command(sql_parse.cc:1830),do_command(sql_parse.cc:1304),threadpool_process_request(threadpool_common.cc:258),handle_event(threadpool_unix.cc:1737),worker_main(threadpool_unix.cc:1737),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)

How to repeat:
1. turn off binlog so that we can run very heavy workload with sysbench.
2. Trigger an undo truncation (for example, create an undo tablespace and mark it inactive)
3, big buffer pool,(we have 256GB buffer pool 

Suggested fix:
Don't hold undo::spaces->x_lock() while running trx_undo_truncate_tablespace() invoked by trx_purge_truncate_marked_undo_low
[6 Jan 2021 8:30] MySQL Verification Team
Hello Zhai,

Thank you for the report and feedback.
May I request you to please provide sysbench test details(unaltered sysbench command with all the options used), server configuration file, server version details and statements which triggered stall etc to reproduce the issue at our end? Thank you.

regards,
Umesh
[6 Jan 2021 9:02] zhai weixiang
sysbench command:
200 tables, each has 10000000 records
sysbench oltp_update_non_index.lua --threads=200 --time=3000 --mysql-user=$user --mysql-password=$password --mysql-host=$host --mysql-port=$port --mysql-db=sysbench --report-interval=1 --tables=200 --table-size=10000000  --events=0 --forced-shutdown=off  run

undo truncation is turn on.

buffer pool configuration:
innodb_buffer_pool_chunk_size=32M
innodb_buffer_pool_size = 500000M
innodb_buffer_pool_instances = 32

Actually after analyzing the code, I think it should be safe to invoke trx_undo_truncate_tablespace without holding undo::spaces->x_lock(). because the undo tablespace to be truncated is marked as inactive and no user thread will use rsegs of the space (ref code in function get_next_redo_rseg_from_undo_spaces)
[7 Jan 2021 2:37] zhai weixiang
btw. I'm using 8.0.18 version. and the version after (WL#11819: InnoDB: Support ACID Undo DDL) should suffer less influence. But may still be affected as the truncation process is holding x lock
[7 Jan 2021 9:46] MySQL Verification Team
Thank you for the feedback.
I tried on 8.0.22 release build with the moderate buffer pool(96G) considering test box capacity:

rm -rf 102143/
bin/mysqld --initialize-insecure --basedir=$PWD --datadir=$PWD/102143 --log-error-verbosity=3
bin/mysqld --no-defaults --basedir=$PWD --datadir=$PWD/102143 --core-file --socket=/tmp/mysql_ushastry.sock --port=3333 --log-error=$PWD/102143/log.err --mysqlx-port=33330 --mysqlx-socket=/tmp/mysql_x_ushastry.sock --log-error-verbosity=3  --secure-file-priv=/tmp/ --innodb_buffer_pool_chunk_size=32M --innodb_buffer_pool_size=96G --innodb_buffer_pool_instances=32 --skip-log-bin 2>&1 &

- prepare
 bin/sysbench --threads=200 --mysql-user=root --mysql-socket=/tmp/mysql_ushastry.sock --mysql-db=sysbench --report-interval=1 --tables=200 --table-size=10000000 --events=0 --forced-shutdown=off ./share/sysbench/oltp_update_non_index.lua prepare

+
 du -h 102143/sysbench/
454G    102143/sysbench/

- run (increased max connections 1000)

 bin/sysbench --threads=500 --time=3000 --mysql-user=root --mysql-socket=/tmp/mysql_ushastry.sock --mysql-db=sysbench --report-interval=1 --tables=200 --table-size=10000000 --events=0 --forced-shutdown=off ./share/sysbench/oltp_update_non_index.lua run

In between:

CREATE UNDO TABLESPACE tablespace_name ADD DATAFILE 'file_name.ibu';
ALTER UNDO TABLESPACE tablespace_name SET INACTIVE;

Observed decrease in tps:

 777s ] thds: 500 tps: 544.82 qps: 544.82 (r/w/o: 0.00/544.82/0.00) lat (ms,95%): 1938.16 err/s: 0.00 reconn/s: 0.00
[ 778s ] thds: 500 tps: 590.21 qps: 590.21 (r/w/o: 0.00/590.21/0.00) lat (ms,95%): 1109.09 err/s: 0.00 reconn/s: 0.00
[ 779s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 780s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 781s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 782s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 783s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 784s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 785s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 786s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 787s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 788s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 789s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 790s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 791s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 792s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 793s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 794s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 795s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 796s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 797s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 798s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 799s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 800s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 801s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 802s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 803s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 804s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 805s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 806s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 807s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 808s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 809s ] thds: 500 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 810s ] thds: 500 tps: 1199.01 qps: 1199.01 (r/w/o: 0.00/1199.01/0.00) lat (ms,95%): 32161.14 err/s: 0.00 reconn/s: 0.00
[ 811s ] thds: 500 tps: 942.74 qps: 942.74 (r/w/o: 0.00/942.74/0.00) lat (ms,95%): 1213.57 err/s: 0.00 reconn/s: 0.00
[ 812s ] thds: 500 tps: 658.16 qps: 658.16 (r/w/o: 0.00/658.16/0.00) lat (ms,95%): 1069.86 err/s: 0.00 reconn/s: 0.00
[ 813s ] thds: 500 tps: 920.00 qps: 920.00 (r/w/o: 0.00/920.00/0.00) lat (ms,95%): 1149.76 err/s: 0.00 reconn/s: 0.00
[ 814s ] thds: 500 tps: 1102.81 qps: 1102.81 (r/w/o: 0.00/1102.81/0.00) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 815s ] thds: 500 tps: 1448.62 qps: 1448.62 (r/w/o: 0.00/1448.62/0.00) lat (ms,95%): 960.30 err/s: 0.00 reconn/s: 0.00
[ 816s ] thds: 500 tps: 1086.39 qps: 1086.39 (r/w/o: 0.00/1086.39/0.00) lat (ms,95%): 861.95 err/s: 0.00 reconn/s: 0.00

Looking at "Performance Impact of Truncating Undo Tablespace Files" in https://dev.mysql.com/doc/refman/8.0/en/innodb-undo-tablespaces.html it seems to be expected and also didn't observe the stacktraces which you pointed in the report. Most of the issues seems to be handled as part of WL#11819(in 8.0.21) but since you have pointed rightly about truncation process holding x lock and hence verifying so that further it is improved.
[14 May 2021 12:40] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 8.0.26 release, and here's the proposed changelog entry from the documentation team:

A workload stalled while executing a undo tablespace truncation operation
on an instance with a large buffer pool. The function that truncates
marked undo tablespaces now takes a shared latch instead of an exclusive
latch, and the shared latch is taken for a shorter period of time.