MySQL Bugs: #103405: stop slave reporting ER_MTS_INCONSISTENT

Bug #103405	stop slave reporting ER_MTS_INCONSISTENT_DATA(1756) error within 60s
Submitted:	21 Apr 2021 14:01	Modified:	26 Apr 2021 10:08
Reporter:	Xiaodong Huang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.34,8.0.24	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In the MTS environment, stop slave was successfully executed. Then after 60s, if stop slave is executed again, the ER_MTS_INCONSISTENT_DATA(1756) error may be reported immediately.

How to repeat:
Testcase opt and test file is as follows：

testcase-slave.opt
===================

--slave_parallel_type=LOGICAL_CLOCK
--slave_parallel_workers=2

testcase.test
===================

--source include/master-slave.inc
--source include/have_binlog_format_statement.inc

--echo # ==============================================================
--echo # Prepare

connection master;

CREATE TABLE t1(
  id int not null,
  a varchar(1000) not null
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t2(
  id int not null,
  a varchar(1000) not null
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

connect (con1,127.0.0.1,root,,test,$MASTER_MYPORT,);
call mtr.add_suppression("Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT");

--echo # ==============================================================
--echo # phase1. Insert sleep() data
--disable_warnings  
connection master;
--send insert into t1 values (sleep(2), "sleep1");
connection con1;
--send insert into t2 values (sleep(2.2), "sleep2");
connection master;
--reap;
connection con1;
--reap;
--enable_warnings   

--echo # on slave
connection slave;
--sleep 1
stop slave;
start slave;

--echo # ==============================================================
--echo # phase2. Wait for more than 60s before inserting data
--disable_warnings 
connection master;
--sleep 60
--send insert into t1 values (sleep(2), "sleep3");
connection con1;
--send insert into t2 values (sleep(2.2), "sleep4");
connection master;
--reap;
connection con1;
--reap;
--enable_warnings

--echo # on slave
connection slave;
--sleep 1
stop slave;
start slave;

--echo # ==============================================================
--echo # Cleanup

connection master;
drop table t1;
drop table t2;
--source include/rpl_end.inc

When the slave is stopped for the second time, the following error will be reported:

 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

Suggested fix:
Analyse as below, this code is based on 5.7.34.

step1. When executing stop slave, sql_slave_killed function will be executed every time the event is read to check whether the sql thread is in the killed status;

step2. Execute the sql_slave_killed function, where rli->sql_thread_kill_accepted is true and because the coordinator thread is distributing the event of a certain transaction in the MTS environment, is_parallel_warn also is true

  2251   if (abort_loop || thd->killed || rli->abort_slave)
  2252   {
  2253     rli->sql_thread_kill_accepted= true;
  2254     is_parallel_warn= (rli->is_parallel_exec() &&
  2255                        (rli->is_mts_in_group() || thd->killed));

step3. Then in sql_slave_killed, because rli->last_event_start_time is 0, set rli->last_event_start_time to the current time, and then enter the sql_slave_killed function again. if the current time and rli->last_event_start_time are different from SLAVE_WAIT_GROUP_DONE (60s) or more, then rli-> sql_thread_kill_accepted is set to true.

  2301         if (rli->last_event_start_time == 0)
  2302           rli->last_event_start_time= my_time(0);
  2303         rli->sql_thread_kill_accepted= difftime(my_time(0),
  2304                                                rli->last_event_start_time) <=
  2305                                                SLAVE_WAIT_GROUP_DONE ?
  2306                                                FALSE : TRUE;

step4. Then if the coordinator thread is distributing the event of a certain transaction within 60s, it returns TRUE, but rli->last_event_start_time is not reset.

step5. After 60s, execute start slave; stop slave; and the situation is the same as step 2;

step6. Then repeat the check of step 3, you will find that the value of rli->last_event_start_time is greater than 60, so rli->sql_thread_kill_accepted is set to true;

step7. Report ER_MTS_INCONSISTENT_DATA error afterwards
  2333         if (is_parallel_warn)
  2334           rli->report(!rli->is_error() ? ERROR_LEVEL :
  2335                       WARNING_LEVEL,    // an error was reported by Worker
  2336                       ER_MTS_INCONSISTENT_DATA,
  2337                       ER(ER_MTS_INCONSISTENT_DATA),
  2338                       msg_stopped_mts);
  2339         else
  2340           rli->report(ERROR_LEVEL, ER_SLAVE_FATAL_ERROR,
  2341                       ER(ER_SLAVE_FATAL_ERROR), msg_stopped);

The fix method from Tencent TXSQL Team, it is when the stop slave is executed successfully, reset rli->last_event_start_time to 0.

diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index 0ec73c3907c..87d1022e2db 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -2342,6 +2342,10 @@ bool sql_slave_killed(THD* thd, Relay_log_info* rli)
       }
     }
   }
+  if (rli->sql_thread_kill_accepted)
+    rli->last_event_start_time= 0;
+
+
   DBUG_RETURN(rli->sql_thread_kill_accepted);
 }

Hi,

Thanks for the report and for the test case

all best
Bogdan

This problem can reference: https://dev.mysql.com/doc/refman/5.7/en/replication-features-transaction-inconsistencies.h...

  "If the last transaction in the relay log is only half-received and the 
   multithreaded replica coordinator has started to schedule the transaction to a 
   worker, then STOP SLAVE waits up to 60 seconds for the transaction to be 
   received. After this timeout, the coordinator gives up and aborts the 
   transaction. If the transaction is mixed, it may be left half-completed."

we is informed that After stopping slave, wait at least 60s before issuing the ER_MTS_INCONSISTENT_DATA(1756). But in here , it didn't wait for 60s at all then reported the error.