MySQL Bugs: #78686: MTS slave deadlock if relay log index file is corrupted

Bug #78686	MTS slave deadlock if relay log index file is corrupted
Submitted:	2 Oct 2015 18:25	Modified:	27 Nov 2015 12:43
Reporter:	Santosh Praneeth Banda	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.24	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
If relay log index file is corrupted or if there is some error opening new relay log file, MTS slave may get deadlocked if coordinator (i.e; SQL thread) is in the middle of a group (mts_group_status::MTS_IN_GROUP enum in the code). The coordinator thread errors out and tries to stop worker threads cleanly. So it waits for workers to finish their current scheduled transactions, but the worker which received partial transaction is waiting on coordinator for terminal event. So the deadlock. Subsequently STOP SLAVE is blocked which in turn blocks SHOW SLAVE STATUS

How to repeat:
1. stop slave while IO thread in the middle of a transaction (group)
2. rotate relay log and introduce corruption in relay log index file. Or removing the newly created relay log file should also introduce the bug
3. start slave
4. check for the deadlock

SHOW PROCESSLIST should show "Waiting for Slave Worker to release partition" for SQL thread and "Waiting for an event from Coordinator" for worker threads

Suggested fix:
mts_group_status should be set properly if there is any error in SQL thread while executing relay log events

diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index d002873..8ad3810 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -4534,6 +4534,7 @@ static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
       delete ev;
     DBUG_RETURN(exec_res);
   }
+  rli->mts_group_status= Relay_log_info::MTS_KILLED_GROUP;
   mysql_mutex_unlock(&rli->data_lock);
   rli->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_READ_FAILURE,
               ER(ER_SLAVE_RELAY_LOG_READ_FAILURE), "\

Thanks for the bug report. I did verify this against 5.6.24, but was unable to do it against 5.6.26. The patch for bug#75525 fix this one as well.