Description:
First: In a slave server of master-slave replication system, a worker thread execute a trx has the follow error: "...Lock wait timeout exceeded; try restarting transaction, Error_code: 1205". Then the trx retry 10 times and cause the sql thread and worker thread exit.
Second: We execute show slave status\G and find the execute relay log pos is not consistent with the Executed_Gtid_Set.
Third: If we execute start slave to continue, we can see gaps in Executed_Gtid_Set.
At the end , we think the slave has lost some trxs.
slave configs:
log-slave-updates=1
slave-skip-errors=1007,1008,1050,1060,1061,1062,1068
slave_parallel_workers = 32
slave-parallel-type = LOGICAL_CLOCK
slave_preserve_commit_order = 1
relay_log_recovery = 0
innodb_flush_log_at_trx_commit = 1
How to repeat:
First: Make a master-slave server. Do the change master command:
(slave_session_1)
reset master;
stop slave;
reset salve;
reset slave all;
change master to Master_Host='127.0.0.1',Master_User='rpl',Master_Port=5518, Master_Password='123456',MASTER_AUTO_POSITION=1;
start slave;
(master_session)
drop database if exists abczyy_test;
create database abczyy_test;
create table abczyy_test.tb1(a int key, b int);
Second: Do the following cmds.
(master_session)
reset master;
(slave_session_1)
reset master;
stop slave;
reset slave;
start slave;
(master_session)
truncate table abczyy_test.tb1;
(slave_session_1)
stop slave sql_thread;
(master_session)
insert into abczyy_test.tb1(a,b) values(1,1);
insert into abczyy_test.tb1(a,b) values(2,2);
flush logs;
insert into abczyy_test.tb1(a,b) values(3,3);
insert into abczyy_test.tb1(a,b) values(4,4);
flush logs;
(slave_session_2)
start transaction;
insert into abczyy_test.tb1(a,b) values(2,2);
(slave_session_1)
start slave;
show slave status\G
Third: Wait until the sql thread and worker threads exit.
[error log in mysqld.log]
...
2019-04-19T15:50:38.460226+08:00 583 [Warning] Slave SQL for channel '': Worker 1 failed executing transaction 'a902ca7e-2e4a-11e8-a450-00e0ec27ea92:3' at master log mysql-bin.000001, end_log_pos 820; Could not execute Write_rows event on table abczyy_test.tb1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.000001, end_log_pos 820, Error_code: 1205
2019-04-19T15:50:38.460359+08:00 583 [ERROR] Slave SQL for channel '': worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
2019-04-19T15:50:38.460397+08:00 583 [ERROR] Slave SQL for channel '': Worker 1 failed executing transaction 'a902ca7e-2e4a-11e8-a450-00e0ec27ea92:3' at master log mysql-bin.000001, end_log_pos 820; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205
2019-04-19T15:50:38.460484+08:00 582 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2019-04-19T16:30:20.408096+08:00 582 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'mysql-bin.000002' at position 4
[result of show slave status]
Relay_Log_File: relay-bin.000004
Relay_Log_Pos: 241
Relay_Master_Log_File: mysql-bin.000002
Exec_Master_Log_Pos: 4
Retrieved_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-5
Executed_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-2
Fourth: start slave and see the gaps.
(slave_session_2)
exit;
(slave_session_1)
start slave;
show slave status\G
[result of show slave status]
Retrieved_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-5
Executed_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-2:4-5
Suggested fix:
This bug is caused by the slave apply the next master rotate_event in sql_thread, which the rli->sql_thread_kill_accepted is set to TURE after the lock wait timeout error.
[code fix:]
In Log_event::apply_event function
......
if (rli->current_mts_submode->wait_for_workers_to_finish(rli) == -1)
{
// handle synchronization error
rli->report(WARNING_LEVEL, 0,
"Slave worker thread has failed to apply an event. As a "
"consequence, the coordinator thread is stopping "
"execution.");
DBUG_RETURN(-1);
}
+
+ if(rli->sql_thread_kill_accepted == TRUE &&
+ get_type_code() == binary_log::ROTATE_EVENT &&
+ rli->current_mts_submode->get_type() == MTS_PARALLEL_TYPE_LOGICAL_CLOCK)
+ {
+ rli->report(WARNING_LEVEL, 0,
+ "Slave worker thread has failed. Do not need to "
+ "execute the master rotate event.");
+ DBUG_RETURN(-1);
+ }
+
/*
Given not in-group mark the event handler can invoke checkpoint
update routine in the following course.
*/
DBUG_ASSERT(rli->mts_group_status == Relay_log_info::MTS_NOT_IN_GROUP
|| !is_mts_db_partitioned(rli));
At the end, If the code fix is not better please give me the suggested fix.
ThankYou!
Description: First: In a slave server of master-slave replication system, a worker thread execute a trx has the follow error: "...Lock wait timeout exceeded; try restarting transaction, Error_code: 1205". Then the trx retry 10 times and cause the sql thread and worker thread exit. Second: We execute show slave status\G and find the execute relay log pos is not consistent with the Executed_Gtid_Set. Third: If we execute start slave to continue, we can see gaps in Executed_Gtid_Set. At the end , we think the slave has lost some trxs. slave configs: log-slave-updates=1 slave-skip-errors=1007,1008,1050,1060,1061,1062,1068 slave_parallel_workers = 32 slave-parallel-type = LOGICAL_CLOCK slave_preserve_commit_order = 1 relay_log_recovery = 0 innodb_flush_log_at_trx_commit = 1 How to repeat: First: Make a master-slave server. Do the change master command: (slave_session_1) reset master; stop slave; reset salve; reset slave all; change master to Master_Host='127.0.0.1',Master_User='rpl',Master_Port=5518, Master_Password='123456',MASTER_AUTO_POSITION=1; start slave; (master_session) drop database if exists abczyy_test; create database abczyy_test; create table abczyy_test.tb1(a int key, b int); Second: Do the following cmds. (master_session) reset master; (slave_session_1) reset master; stop slave; reset slave; start slave; (master_session) truncate table abczyy_test.tb1; (slave_session_1) stop slave sql_thread; (master_session) insert into abczyy_test.tb1(a,b) values(1,1); insert into abczyy_test.tb1(a,b) values(2,2); flush logs; insert into abczyy_test.tb1(a,b) values(3,3); insert into abczyy_test.tb1(a,b) values(4,4); flush logs; (slave_session_2) start transaction; insert into abczyy_test.tb1(a,b) values(2,2); (slave_session_1) start slave; show slave status\G Third: Wait until the sql thread and worker threads exit. [error log in mysqld.log] ... 2019-04-19T15:50:38.460226+08:00 583 [Warning] Slave SQL for channel '': Worker 1 failed executing transaction 'a902ca7e-2e4a-11e8-a450-00e0ec27ea92:3' at master log mysql-bin.000001, end_log_pos 820; Could not execute Write_rows event on table abczyy_test.tb1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.000001, end_log_pos 820, Error_code: 1205 2019-04-19T15:50:38.460359+08:00 583 [ERROR] Slave SQL for channel '': worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205 2019-04-19T15:50:38.460397+08:00 583 [ERROR] Slave SQL for channel '': Worker 1 failed executing transaction 'a902ca7e-2e4a-11e8-a450-00e0ec27ea92:3' at master log mysql-bin.000001, end_log_pos 820; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205 2019-04-19T15:50:38.460484+08:00 582 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756 2019-04-19T16:30:20.408096+08:00 582 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'mysql-bin.000002' at position 4 [result of show slave status] Relay_Log_File: relay-bin.000004 Relay_Log_Pos: 241 Relay_Master_Log_File: mysql-bin.000002 Exec_Master_Log_Pos: 4 Retrieved_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-5 Executed_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-2 Fourth: start slave and see the gaps. (slave_session_2) exit; (slave_session_1) start slave; show slave status\G [result of show slave status] Retrieved_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-5 Executed_Gtid_Set: a902ca7e-2e4a-11e8-a450-00e0ec27ea92:1-2:4-5 Suggested fix: This bug is caused by the slave apply the next master rotate_event in sql_thread, which the rli->sql_thread_kill_accepted is set to TURE after the lock wait timeout error. [code fix:] In Log_event::apply_event function ...... if (rli->current_mts_submode->wait_for_workers_to_finish(rli) == -1) { // handle synchronization error rli->report(WARNING_LEVEL, 0, "Slave worker thread has failed to apply an event. As a " "consequence, the coordinator thread is stopping " "execution."); DBUG_RETURN(-1); } + + if(rli->sql_thread_kill_accepted == TRUE && + get_type_code() == binary_log::ROTATE_EVENT && + rli->current_mts_submode->get_type() == MTS_PARALLEL_TYPE_LOGICAL_CLOCK) + { + rli->report(WARNING_LEVEL, 0, + "Slave worker thread has failed. Do not need to " + "execute the master rotate event."); + DBUG_RETURN(-1); + } + /* Given not in-group mark the event handler can invoke checkpoint update routine in the following course. */ DBUG_ASSERT(rli->mts_group_status == Relay_log_info::MTS_NOT_IN_GROUP || !is_mts_db_partitioned(rli)); At the end, If the code fix is not better please give me the suggested fix. ThankYou!