Bug #109031 | 'start slave' reports error after stopping slave during IO thd reconnecting | ||
---|---|---|---|
Submitted: | 9 Nov 2022 2:47 | Modified: | 29 Nov 2022 1:02 |
Reporter: | Weijie Kong | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.7.38 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[9 Nov 2022 2:47]
Weijie Kong
[9 Nov 2022 9:34]
MySQL Verification Team
Hi, What part of the slave is important for you to be lagging, part that fetches data from the master or part that executes it as in order to delay slave receiving data easiest test would be to cut network connection for a while but not sure if that's what you are testing here. Running a slower server for slave will get slave to delay execution (increased slave behind master value). Which one of these is more important for your test (as just doing the test as you explained did not reproduce the issue). Thanks
[9 Nov 2022 11:35]
Weijie Kong
The part that fetches data from the master is important. Exactly, we make IO thd to be lagging to avoid fetching data from the newly created binlog after master restarted. Slave server is not necessarily to be very slow for repeating this issue, because we wait for slave SQL thd to catch up with IO thd in step 4, faster slave server is preferred. The keys to repeat this issue are step 3 and step 6. In step3, IO thd is stopped(master is killed or shutdowned) as near to as the end of master's binlog file(but not at the end). The nearer to the end of master's binlog file, the more time will be cost to skip events when master is restarted and deals with slave IO reconnecting, the more time we have to do step 6. So we use HDD for master to produce more skipping events time. Just network cutting is not enough, because it just causes IO thd sbm, instead of more time for master to skip events. In one word, we should guarantee slave's Read_Master_Log_Pos is almost at the end of master's binlog when master is shutdowned/killed, and stop slave when master is skipping slave-executed events after restarted.
[10 Nov 2022 7:20]
MySQL Verification Team
Hi, still having issue reproducing. I'll try few more things. Thanks for update.
[30 Nov 2022 7:30]
huahua xu
Hi Weijie Kong, It maybe works for reproducing your case: (1). on slave: build the replication mysql> change master to master_host=..., MASTER_AUTO_POSITION=1; mysql> set global slave_parallel_workers = 2; mysql> SET GLOBAL slave_parallel_type = LOGICAL_CLOCK; mysql> set global relay_log_purge = off; mysql> start slave; It is important to disable automatic purging of relay logs.(Gtid mode is on , MTS is on) (2). on master: prepare the test data mysql> create database test; mysql> use test; mysql> create table t (id int); Ensure that the replication is healthy. (3). on slave: lock the table about the checkpoint of SQL thread (coordinator and scheduler) mysql> lock table mysql.slave_relay_log_info read; It is also important to make SQL thread to be lagging to worker threads (4). on master: produce some test data mysql> flush binary logs; mysql> insert into tmp.t values (2); (5). on slave: show the checkpoint about SQL thread and worker threads mysql> select Relay_log_name,Relay_log_pos, Master_log_name, Master_log_pos from mysql.slave_relay_log_info; +------------------------------------+---------------+------------------+----------------+ | Relay_log_name | Relay_log_pos | Master_log_name | Master_log_pos | +------------------------------------+---------------+------------------+----------------+ | .\DESKTOP-45GUNCI-relay-bin.000002 | 688 | mysql-bin.000001 | 475 | +------------------------------------+---------------+------------------+----------------+ 1 row in set (0.00 sec) mysql> select Relay_log_name,Relay_log_pos, Master_log_name, Master_log_pos from mysql.slave_worker_info; +------------------------------------+---------------+------------------+----------------+ | Relay_log_name | Relay_log_pos | Master_log_name | Master_log_pos | +------------------------------------+---------------+------------------+----------------+ | .\DESKTOP-45GUNCI-relay-bin.000004 | 658 | mysql-bin.000002 | 445 | | | 0 | | 0 | +------------------------------------+---------------+------------------+----------------+ (6). on slave: force kill the slave instance (7). on slave: start the slave instance It is not wise to start the replication threads when the server starts.(You can start the slave with --skip-slave-start) (8). on slave: restart the replication mysql> set global slave_parallel_workers = 2; mysql> SET GLOBAL slave_parallel_type = LOGICAL_CLOCK; mysql> set global relay_log_purge = off; mysql> start slave; ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
[30 Nov 2022 9:00]
huahua xu
The bug has been fixed in mysql8.0 thought the commit: https://github.com/mysql/mysql-server/commit/77c7d1e43de3ef25e50d18a1b0a6ae52d5fe65d6