MySQL Bugs: #74906: MSR: slave worker recovery is effectively skipped

Bug #74906	MSR: slave worker recovery is effectively skipped
Submitted:	17 Nov 2014 19:53	Modified:	30 Jan 2015 17:16
Reporter:	Andrei Elkin	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.6	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
After a channel's applier that worked in MTS mode has stopped due to an error,
an expected slave worker recovery routine is ineffective.
There will be no attempt to fix gaps in transaction execution left by the stopped
session.
That can lead to repeated applying of some other transactions with
all sort of idemponent (and not only) errors.

How to repeat:
--connection slave

# first channel is dummy, but needed
CHANGE MASTER TO MASTER_HOST='localhost', MASTER_USER='root', MASTER_PORT=13010 FOR CHANNEL 'ch_a';
CHANGE MASTER TO MASTER_HOST='localhost', MASTER_USER='root', MASTER_PORT=13000 FOR CHANNEL 'ch_b';
start slave sql_thread;

--connection master

insert into d2.t set a=2; insert into d1.t set a=10;

--connection slave;
start slave sql_thread for channel 'ch_b'; 
# to create blocking records on the slave to make the applier to stop
begin; insert into d2.t set a=2; insert into d2.t set a=3;
start slave io_thread for channel 'ch_b';

# wait for stop, remove offending records and retry
# to fail anyway:

start slave sql_thread for channel 'ch_b'; 

[ERROR] Slave SQL for channel 'ch_b': Worker 0 failed executing transaction '' at master log master-bin.000001, end_log_pos 532; Error 'Duplicate entry '10' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into d1.t set a=10', Error_code: 1062

Suggested fix:
Refine the MTS initialization in the MSR branch to get it back effective.
There's a patch template almost ready, to be uploaded.

Based on email discussion with Andrei, the following text was added to the 5.7.6 changelog:
When using multi-source replication and a multi-threaded slave in a situation that required recovery of a channel, such as after a slave applier thread error, or after a crash, the channel was not being recovered correctly. This meant there was no attempt to fix gaps in transaction execution left by the stopped session, which led to some transactions being applied repeatedly. The fix ensures that in such a situation, the correct channel is passed through to multi-threaded slave recovery.