Bug #115982 The replication thread accumulates in get_default_db_collation, causing delays
Submitted: 2 Sep 13:41 Modified: 5 Sep 2:36
Reporter: xilin Chen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:8.0.39 OS:Any
Assigned to: CPU Architecture:Any

[2 Sep 13:41] xilin Chen
Description:
In the scenario where there is no use db or specified database name, directly insert into db.table, and the begin event in the binlog will not record db information. When the slave parses these events, the dbname obtained is empty, and dd::schema cache misses frequently occur during get_default_db_collation, causing a large number of replication threads in the standby database to accumulate in get_default_db_collation, causing delays

How to repeat:
First, start two MySQL nodes and build a master-slave replication relationship. I only tested the gtid replication scenario here
gtid_mode=on
enforce_gtid_consistency=on
slave_parallel_workers=64

Then create a test table on the master database
create database mydb;
create table mydb.t(id int auto_increment primary key);

Then debug the standby database through gdb, set breakpoints, and execute command c to continue running
b dd::cache::Shared_dictionary_cache::get_uncached<dd::Item_name_key, dd::Schema>

Then relogin the master database again. Please note that you should not execute use mydb here, but directly execute insert into mydb.t values();

At this time, the standby database will trigger a breakpoint, and the dd::Schema cache misses. You can try to execute the above insert multiple times, and you will find that the cache misses will occur each time. The corresponding statement is BEGIN, and db is an empty string. The stack that triggers the breakpoint is as follows
#0  dd::cache::Shared_dictionary_cache::get_uncached<dd::Item_name_key, dd::Schema> (this=0x8a77660 <dd::cache::Shared_dictionary_cache::instance()::s_cache>, thd=0x7fe56c000940, key=..., isolation=ISO_READ_COMMITTED, object=0x7fe61428d470) at /data/github/tmp/mysql-server/sql/dd/impl/cache/shared_dictionary_cache.cc:113
#1  0x000000000497532d in dd::cache::Shared_dictionary_cache::get<dd::Item_name_key, dd::Schema> (this=0x8a77660 <dd::cache::Shared_dictionary_cache::instance()::s_cache>, thd=0x7fe56c000940, key=..., element=0x7fe61428d4e8) at /data/github/tmp/mysql-server/sql/dd/impl/cache/shared_dictionary_cache.cc:99
#2  0x00000000048a880d in dd::cache::Dictionary_client::acquire<dd::Item_name_key, dd::Schema> (this=0x7fe56c004700, key=..., object=0x7fe61428d558, local_committed=0x7fe61428d557, local_uncommitted=0x7fe61428d556) at /data/github/tmp/mysql-server/sql/dd/impl/cache/dictionary_client.cc:913
#3  0x00000000048859e9 in dd::cache::Dictionary_client::acquire<dd::Schema> (this=0x7fe56c004700, object_name="", object=0x7fe61428d648) at /data/github/tmp/mysql-server/sql/dd/impl/cache/dictionary_client.cc:1285
#4  0x00000000033f6096 in get_default_db_collation (thd=0x7fe56c000940, db_name=0x7fe56c00a310 "", collation=0x7fe56c002b38) at /data/github/tmp/mysql-server/sql/sql_db.cc:156
#5  0x0000000004688a91 in Query_log_event::do_apply_event (this=0x7fe5721d3e70, rli=0x7fe57001f440, query_arg=0x7fe5721ca7b4 "BEGIN", q_len_arg=5) at /data/github/tmp/mysql-server/sql/log_event.cc:4499
#6  0x0000000004687f15 in Query_log_event::do_apply_event (this=0x7fe5721d3e70, rli=0x7fe57001f440) at /data/github/tmp/mysql-server/sql/log_event.cc:4406
#7  0x00000000046b0b86 in Log_event::do_apply_event_worker (this=0x7fe5721d3e70, w=0x7fe57001f440) at /data/github/tmp/mysql-server/sql/log_event.cc:1036
#8  0x000000000477ac1e in Slave_worker::slave_worker_exec_event (this=0x7fe57001f440, ev=0x7fe5721d3e70) at /data/github/tmp/mysql-server/sql/rpl_rli_pdb.cc:1739
#9  0x000000000477d784 in slave_worker_exec_job_group (worker=0x7fe57001f440, rli=0xb4de870) at /data/github/tmp/mysql-server/sql/rpl_rli_pdb.cc:2504
#10 0x0000000004798b3d in handle_slave_worker (arg=0x7fe57001f440) at /data/github/tmp/mysql-server/sql/rpl_replica.cc:6093
#11 0x00000000056840a2 in pfs_spawn_thread (arg=0x7fe5700246c0) at /data/github/tmp/mysql-server/storage/perfschema/pfs.cc:3050
#12 0x00007fe627865ea5 in start_thread (arg=0x7fe61428f700) at pthread_create.c:307
#13 0x00007fe625df2b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 5
#5  0x0000000004688a91 in Query_log_event::do_apply_event (this=0x7fe5721d3e70, rli=0x7fe57001f440, query_arg=0x7fe5721ca7b4 "BEGIN", q_len_arg=5) at /data/github/tmp/mysql-server/sql/log_event.cc:4499
warning: Source file is more recent than executable.
4499        }
(gdb) p db
$1 = 0x7fe5721ca82d ""
(gdb) p query_arg
$2 = 0x7fe5721ca7b4 "BEGIN"

Then execute use mydb on the master database and insert into t values() again. At this time, the standby database will not trigger a breakpoint, and the dd::Schema cache hits

By viewing the binlog in hexadecimal through hexdump -C binlog.00001, you can find that there is no db information recorded in the BEGIN event of the transaction without use mydb, resulting in the db parameter in get_default_db_collation being empty, resulting in frequent dd::Schema cache misses. When the concurrent write volume is large and the standby database replication thread is high, the delay here is getting larger and larger. After trying to change slave_parallel_workers from 64 to 8, the delay changes from an upward trend to a downward trend. Multiple attempts can stably reproduce this phenomenon

Stack in the scenario of slave_parallel_workers=64
     96 epoll_wait(libc.so.6),::??,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     83 pthread_cond_timedwait,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     29 pthread_cond_wait,bool,bool,bool,bool,get_default_db_collation,Query_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     26 pthread_cond_wait,Commit_stage_manager::enroll_for,MYSQL_BIN_LOG::change_stage,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit,non-virtual,Xid_apply_log_event::do_apply_event_worker,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     22 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,::??,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     18 ppoll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,::??,::??,my_net_read,Protocol_classic::get_command,do_command,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
     16 pthread_cond_timedwait,CThdBottomHalfAnsThread::run,CLocalMysqlThread::startThread,start_thread(libpthread.so.0),clone(libc.so.6)
     15 pthread_cond_wait,os_event::wait_low,srv_worker_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      4 pthread_cond_timedwait,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,Per_thread_connection_handler::block_until_new_connection,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait_low,::??,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_timedwait,background_thread_sleep,background_work_sleep_once,background_work,background_thread_entry,start_thread(libpthread.so.0),clone(libc.so.6)
      2 __lll_lock_wait(libpthread.so.0),_L_cond_lock_1074(libpthread.so.0),__pthread_mutex_cond_lock(libpthread.so.0),pthread_cond_wait,bool,bool,bool,bool,get_default_db_collation,Query_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 timer_helper_thread(librt.so.1),start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwaitinfo(libc.so.6),signal_hand,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sched_yield(libc.so.6),Mts_submode_logical_clock::get_least_occupied_worker,Log_event::get_slave_worker,Log_event::apply_event,::??,::??,handle_slave_sql,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 recvfrom(libpthread.so.0),CUdpServer::loopdealreq,non-virtual,CLocalMysqlThread::startThread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,std::condition_variable::wait(libstdc++.so.6),operator(Executor.cpp:85),__invoke_impl<void,(Executor.cpp:85),__invoke<TencentCloud::Executor::Start()::<lambda()>(Executor.cpp:85),_M_invoke<0>(Executor.cpp:85),operator()(Executor.cpp:85),std::thread::_State_impl<std::thread::_Invoker<std::tuple<TencentCloud::Executor::Start()::<lambda()>(Executor.cpp:85),::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,buf_resize_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,buf_dump_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,rpl_slave_ack_thread::run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,::??,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_purge_coordinator_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_monitor_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,srv_error_monitor_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,log_writer,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,log_flush_notifier,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,log_flusher,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,log_checkpointer,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,lock_wait_timeout_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,ib_wqueue_timedwait,::??,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,dict_stats_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_low,Clone_persist_gtid::periodic_write,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,background_thread_sleep,background_work_sleep_once,background_thread0_work,background_work,background_thread_entry,start_thread(libpthread.so.0),clone(libc.so.6)
      1 ppoll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,vio_read_buff,::??,::??,my_net_read,cli_safe_read_with_ok,mysql_binlog_fetch,handle_slave_io,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 poll(libc.so.6),Mysqld_socket_listener::listen_for_connection_event,mysqld_main,__libc_start_main(libc.so.6),_start
      1 Opt_trace_start::~Opt_trace_start,mysql_execute_command,dispatch_sql_command,Query_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 nanosleep(libpthread.so.0),srv_master_thread,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 nanosleep(libc.so.6),sleep(libc.so.6),::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 nanosleep(libc.so.6),sleep(libc.so.6),CThdBottomHalf::do_timeout_loop,CThdBottomHalf::threadfun_do_timeout,start_thread(libpthread.so.0),clone(libc.so.6)
      1 mysql_lock_tables,lock_tables,open_and_lock_tables,Rows_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 __lll_lock_wait(libpthread.so.0),pthread_cond_wait,bool,bool,bool,bool,get_default_db_collation,Query_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),dd::cache::Shared_multi_map::put(libpthread.so.0),bool,bool,bool,get_default_db_collation,Query_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),Commit_stage_manager::enroll_for,MYSQL_BIN_LOG::change_stage,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit,non-virtual,Xid_apply_log_event::do_apply_event_worker,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 fdatasync(libc.so.6),my_sync,IO_CACHE_ostream::sync,MYSQL_BIN_LOG::sync_binlog_file,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit,non-virtual,Xid_apply_log_event::do_apply_event_worker,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 dict_index_zip_pad_optimal_page_size,btr_cur_optimistic_insert,row_ins_clust_index_entry_low,row_ins_clust_index_entry,row_ins_step,::??,ha_innobase::write_row,non-virtual,Partition_helper::ph_write_row,ha_innopart::write_row,handler::ha_write_row,Write_rows_log_event::write_row,Write_rows_log_event::do_exec_row,Rows_log_event::do_apply_row,Rows_log_event::do_apply_event,slave_worker_exec_job_group,::??,::??,start_thread(libpthread.so.0),clone(libc.so.6)
      1 clock_gettime,clock_gettime(libc.so.6),os_event::get_wait_timelimit,os_event::wait_time_low,log_write_notifier,std::thread::_State_impl::_M_run,::??,start_thread(libpthread.so.0),clone(libc.so.6)
[4 Sep 14:59] MySQL Verification Team
Hi,

Thanks for the report. Do you have any replication related config in your my.cnf?
[5 Sep 2:36] xilin Chen
gtid:
gtid_mode=on
enforce_gtid_consistency=on

MTS:
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=64

other:
log_slave_updates=ON
log_slow_slave_statements = OFF
pseudo_slave_mode=OFF
rpl_stop_slave_timeout = 31536000
slave_allow_batching = OFF
slave_checkpoint_group = 512
slave_checkpoint_period = 300
slave_compressed_protocol = OFF
slave_exec_mode = STRICT
slave_pending_jobs_size_max = 1677721600
slave_preserve_commit_order = OFF
slave_rows_search_algorithms = INDEX_SCAN,HASH_SCAN
[18 Sep 23:15] Marcos Albe
Flamgraph showing 10% of time used by get_default_db_collation when using rewrite_db

Attachment: get_default_db_collation_flamegraph.svg (image/svg+xml, text), 1.67 MiB.