Description:
If relay log index file is corrupted or if there is some error opening new relay log file, MTS slave may get deadlocked if coordinator (i.e; SQL thread) is in the middle of a group (mts_group_status::MTS_IN_GROUP enum in the code). The coordinator thread errors out and tries to stop worker threads cleanly. So it waits for workers to finish their current scheduled transactions, but the worker which received partial transaction is waiting on coordinator for terminal event. So the deadlock. Subsequently STOP SLAVE is blocked which in turn blocks SHOW SLAVE STATUS
How to repeat:
1. stop slave while IO thread in the middle of a transaction (group)
2. rotate relay log and introduce corruption in relay log index file. Or removing the newly created relay log file should also introduce the bug
3. start slave
4. check for the deadlock
SHOW PROCESSLIST should show "Waiting for Slave Worker to release partition" for SQL thread and "Waiting for an event from Coordinator" for worker threads
Suggested fix:
mts_group_status should be set properly if there is any error in SQL thread while executing relay log events
diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index d002873..8ad3810 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -4534,6 +4534,7 @@ static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
delete ev;
DBUG_RETURN(exec_res);
}
+ rli->mts_group_status= Relay_log_info::MTS_KILLED_GROUP;
mysql_mutex_unlock(&rli->data_lock);
rli->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_READ_FAILURE,
ER(ER_SLAVE_RELAY_LOG_READ_FAILURE), "\