Description:
First: In a slave server of master-slave replication system, a worker thread execute a trx has the follow error: "...Lock wait timeout exceeded; try restarting transaction, Error_code: 1205". Then the trx retry 10 times and cause the sql thread and worker thread exit.
Second: We execute show slave status\G and find the execute relay log pos is not consistent with the Executed_Gtid_Set.
Third: If we execute start slave to continue, we can see gaps in Executed_Gtid_Set.
At the end , we think the slave has lost some trxs.
slave configs:
log-slave-updates=1
slave-skip-errors=1007,1008,1050,1060,1061,1062,1068
slave_parallel_workers = 32
slave-parallel-type = LOGICAL_CLOCK
slave_preserve_commit_order = 1
relay_log_recovery = 0
innodb_flush_log_at_trx_commit = 1
How to repeat:
First: Make a master-slave server. Do the change master command:
(slave_session_1)
reset master;
stop slave;
reset salve;
reset slave all;
change master to Master_Host='127.0.0.1',Master_User='rpl',Master_Port=5518, Master_Password='123456',MASTER_AUTO_POSITION=1;
start slave;
(master_session)
drop database if exists abczyy_test;
create database abczyy_test;
create table abczyy_test.tb1(a int key, b int);
Second: Do the following cmds.
(master_session)
reset master;
(slave_session_1)
reset master;
stop slave;
reset slave;
start slave;
(master_session)
truncate table abczyy_test.tb1;
(slave_session_1)
stop slave sql_thread;
(master_session)
insert into abczyy_test.tb1(a,b) values(1,1);
insert into abczyy_test.tb1(a,b) values(2,2);
flush logs;
insert into abczyy_test.tb1(a,b) values(3,3);
insert into abczyy_test.tb1(a,b) values(4,4);
flush logs;
(slave_session_2)
start transaction;
insert into abczyy_test.tb1(a,b) values(2,2);
(slave_session_1)
start slave;
show slave status\G
Third: Wait until the sql thread and worker threads exit.
[error log in mysqld.log]
...
2019-04-19T15:50:38.460226+08:00 583 [Warning] Slave SQL for channel '': Worker 1 failed executing transaction 'a902ca7e-2e4a-11e8-a450-00e0ec27ea92:3' at master log mysql-bin.000001, end_log_pos 820; Could not execute Write_rows event on table abczyy_test.tb1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.000001, end_log_pos 820, Error_code: 1205
2019-04-19T15:50:38.460359+08:00 583 [ERROR] Slave SQL for channel '': worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
2019-04-19T15:50:38.460397+08:00 583 [ERROR] Slave SQL for channel '': Worker 1 failed executing transaction 'a902ca7e-2e4a-11e8-a450-00e0ec27ea92:3' at master log mysql-bin.000001, end_log_pos 820; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205
2019-04-19T15:50:38.460484+08:00 582 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2019-04-19T16:30:20.408096+08:00 582 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'mysql-bin.000002' at position 4
[result of show slave status]
Relay_Log_File: relay-bin.000004
Relay_Log_Pos: 241
Relay_Master_Log_File: mysql-bin.000002
Exec_Master_Log_Pos: 4
Retrieved_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-5
Executed_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-2
Fourth: start slave and see the gaps.
(slave_session_2)
exit;
(slave_session_1)
start slave;
show slave status\G
[result of show slave status]
Retrieved_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-5
Executed_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-2:4-5
Suggested fix:
This bug is caused by the slave apply the next master rotate_event in sql_thread, which the rli->sql_thread_kill_accepted is set to TURE after the lock wait timeout error.
[code fix:]
In Log_event::apply_event function
......
if (rli->current_mts_submode->wait_for_workers_to_finish(rli) == -1)
{
// handle synchronization error
rli->report(WARNING_LEVEL, 0,
"Slave worker thread has failed to apply an event. As a "
"consequence, the coordinator thread is stopping "
"execution.");
DBUG_RETURN(-1);
}
+
+ if(rli->sql_thread_kill_accepted == TRUE &&
+ get_type_code() == binary_log::ROTATE_EVENT &&
+ rli->current_mts_submode->get_type() == MTS_PARALLEL_TYPE_LOGICAL_CLOCK)
+ {
+ rli->report(WARNING_LEVEL, 0,
+ "Slave worker thread has failed. Do not need to "
+ "execute the master rotate event.");
+ DBUG_RETURN(-1);
+ }
+
/*
Given not in-group mark the event handler can invoke checkpoint
update routine in the following course.
*/
DBUG_ASSERT(rli->mts_group_status == Relay_log_info::MTS_NOT_IN_GROUP
|| !is_mts_db_partitioned(rli));
At the end, If the code fix is not better please give me the suggested fix.
ThankYou!