Description:
In the MTS environment, stop slave was successfully executed. Then after 60s, if stop slave is executed again, the ER_MTS_INCONSISTENT_DATA(1756) error may be reported immediately.
How to repeat:
Testcase opt and test file is as follows:
testcase-slave.opt
===================
--slave_parallel_type=LOGICAL_CLOCK
--slave_parallel_workers=2
testcase.test
===================
--source include/master-slave.inc
--source include/have_binlog_format_statement.inc
--echo # ==============================================================
--echo # Prepare
connection master;
CREATE TABLE t1(
id int not null,
a varchar(1000) not null
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE t2(
id int not null,
a varchar(1000) not null
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
connect (con1,127.0.0.1,root,,test,$MASTER_MYPORT,);
call mtr.add_suppression("Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT");
--echo # ==============================================================
--echo # phase1. Insert sleep() data
--disable_warnings
connection master;
--send insert into t1 values (sleep(2), "sleep1");
connection con1;
--send insert into t2 values (sleep(2.2), "sleep2");
connection master;
--reap;
connection con1;
--reap;
--enable_warnings
--echo # on slave
connection slave;
--sleep 1
stop slave;
start slave;
--echo # ==============================================================
--echo # phase2. Wait for more than 60s before inserting data
--disable_warnings
connection master;
--sleep 60
--send insert into t1 values (sleep(2), "sleep3");
connection con1;
--send insert into t2 values (sleep(2.2), "sleep4");
connection master;
--reap;
connection con1;
--reap;
--enable_warnings
--echo # on slave
connection slave;
--sleep 1
stop slave;
start slave;
--echo # ==============================================================
--echo # Cleanup
connection master;
drop table t1;
drop table t2;
--source include/rpl_end.inc
When the slave is stopped for the second time, the following error will be reported:
[ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
Suggested fix:
Analyse as below, this code is based on 5.7.34.
step1. When executing stop slave, sql_slave_killed function will be executed every time the event is read to check whether the sql thread is in the killed status;
step2. Execute the sql_slave_killed function, where rli->sql_thread_kill_accepted is true and because the coordinator thread is distributing the event of a certain transaction in the MTS environment, is_parallel_warn also is true
2251 if (abort_loop || thd->killed || rli->abort_slave)
2252 {
2253 rli->sql_thread_kill_accepted= true;
2254 is_parallel_warn= (rli->is_parallel_exec() &&
2255 (rli->is_mts_in_group() || thd->killed));
step3. Then in sql_slave_killed, because rli->last_event_start_time is 0, set rli->last_event_start_time to the current time, and then enter the sql_slave_killed function again. if the current time and rli->last_event_start_time are different from SLAVE_WAIT_GROUP_DONE (60s) or more, then rli-> sql_thread_kill_accepted is set to true.
2301 if (rli->last_event_start_time == 0)
2302 rli->last_event_start_time= my_time(0);
2303 rli->sql_thread_kill_accepted= difftime(my_time(0),
2304 rli->last_event_start_time) <=
2305 SLAVE_WAIT_GROUP_DONE ?
2306 FALSE : TRUE;
step4. Then if the coordinator thread is distributing the event of a certain transaction within 60s, it returns TRUE, but rli->last_event_start_time is not reset.
step5. After 60s, execute start slave; stop slave; and the situation is the same as step 2;
step6. Then repeat the check of step 3, you will find that the value of rli->last_event_start_time is greater than 60, so rli->sql_thread_kill_accepted is set to true;
step7. Report ER_MTS_INCONSISTENT_DATA error afterwards
2333 if (is_parallel_warn)
2334 rli->report(!rli->is_error() ? ERROR_LEVEL :
2335 WARNING_LEVEL, // an error was reported by Worker
2336 ER_MTS_INCONSISTENT_DATA,
2337 ER(ER_MTS_INCONSISTENT_DATA),
2338 msg_stopped_mts);
2339 else
2340 rli->report(ERROR_LEVEL, ER_SLAVE_FATAL_ERROR,
2341 ER(ER_SLAVE_FATAL_ERROR), msg_stopped);
The fix method from Tencent TXSQL Team, it is when the stop slave is executed successfully, reset rli->last_event_start_time to 0.
diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index 0ec73c3907c..87d1022e2db 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -2342,6 +2342,10 @@ bool sql_slave_killed(THD* thd, Relay_log_info* rli)
}
}
}
+ if (rli->sql_thread_kill_accepted)
+ rli->last_event_start_time= 0;
+
+
DBUG_RETURN(rli->sql_thread_kill_accepted);
}