Bug #72824 Can't restart replication after restart with slave_parallel_workers and error
Submitted: 1 Jun 2014 3:13 Modified: 19 Sep 2016 8:22
Reporter: Jesper wisborg Krogh Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.19 OS:Linux
Assigned to: CPU Architecture:Any

[1 Jun 2014 3:13] Jesper wisborg Krogh
Description:
With slave_parallel_workers > 0 if the slave is restarted while the replication is stopped due to an error, it is not possible to get the replication running again.

Confirmed in 5.6.19 and 5.7.4.

How to repeat:
Replication has been set up with:

Master:
=======

[mysqld]
binlog_format                = ROW
binlog_rows_query_log_events = ON

Slave:
======

[mysqld]
binlog_format                = ROW
binlog_rows_query_log_events = ON
master_info_repository       = TABLE
relay_log_info_repository    = TABLE
relay_log_recovery           = ON
slave_parallel_workers       = 8

and the world sample database has been loaded.

On the slave:
=============

INSERT INTO world.City VALUES (4080, 'Vejle', 'DNK', 'Jylland', 65000);

On the master:
==============

CREATE DATABASE db1; CREATE DATABASE db2; CREATE DATABASE db3; CREATE DATABASE db4; CREATE DATABASE db5; CREATE DATABASE db6;
CREATE TABLE db1.t1 (id int unsigned NOT NULL auto_increment PRIMARY KEY, val int NOT NULL DEFAULT 0);
CREATE TABLE db2.t1 (id int unsigned NOT NULL auto_increment PRIMARY KEY, val int NOT NULL DEFAULT 0);
CREATE TABLE db3.t1 (id int unsigned NOT NULL auto_increment PRIMARY KEY, val int NOT NULL DEFAULT 0);
CREATE TABLE db4.t1 (id int unsigned NOT NULL auto_increment PRIMARY KEY, val int NOT NULL DEFAULT 0);
CREATE TABLE db5.t1 (id int unsigned NOT NULL auto_increment PRIMARY KEY, val int NOT NULL DEFAULT 0);
CREATE TABLE db6.t1 (id int unsigned NOT NULL auto_increment PRIMARY KEY, val int NOT NULL DEFAULT 0);

Start 6 concurrent threads each inserting into its own database.

While the INSERTs are executing:

INSERT INTO world.City VALUES (4080, 'Vejle', 'DNK', 'Jylland', 65000);

On the slave:
=============

Replication has now stopped with:

slave> SELECT * FROM performance_schema.replication_execute_status_by_coordinator\G
*************************** 1. row ***************************
           THREAD_ID: NULL
       SERVICE_STATE: OFF
   LAST_ERROR_NUMBER: 1062
  LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '' at master log , end_log_pos 587543. See error log and/or performance_schema.replication_execute_status_by_worker table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2014-06-01 13:10:07
1 row in set (0.01 sec)                                                                                                                                                                                                                      

slave> SELECT * FROM performance_schema.replication_execute_status_by_worker WHERE LAST_ERROR_NUMBER <> 0\G
*************************** 1. row ***************************
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: ANONYMOUS
    LAST_ERROR_NUMBER: 1062
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '' at master log , end_log_pos 587543; Could not execute Write_rows event on table world.City; Duplicate entry '4080' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 587543
 LAST_ERROR_TIMESTAMP: 2014-06-01 13:10:07
1 row in set (0.01 sec)

Restart the slave - replication doesn't start by itself.

slave> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State:                                
...
             Slave_IO_Running: No                             
            Slave_SQL_Running: No
...
                   Last_Errno: 0                              
                   Last_Error:                                
...
                Last_IO_Errno: 0                              
                Last_IO_Error:                                
               Last_SQL_Errno: 0                              
               Last_SQL_Error:
...

slave> START SLAVE;
ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository

Suggested fix:
N/A
[1 Jun 2014 3:13] Jesper wisborg Krogh
Workaround:
===========

Use RESET SLAVE ALL followed by a restart. Then it's possible to use CHANGE MASTER TO to re-initialise the replication.
[1 Jun 2014 22:11] Jesper wisborg Krogh
Note the workaround only works if all the other threads happens to be at the same point in the binary logs as the thread that encountered the error. Otherwise resetting the replication will cause the slave to become out of sync.
[9 Jul 2014 8:02] qinglin zhang
reset slave all won't work, in my instance, I use the following method:
1) keep record of the info in Relay log info;(show slave status)
2) stop slave;
3) reset slave;
4) start slave; 
5) stop slave;
6) change master 到1)中的位点;
7) start slave;
[9 Jul 2014 8:03] qinglin zhang
reset slave all won't work, in my instance, I use the following method:
1) keep record of the info in Relay log info;(show slave status)
2) stop slave;
3) reset slave;
4) start slave; 
5) stop slave;
6) change master to position recorded in step 1);
7) start slave;
[20 Nov 2014 7:11] WANG GUANGYOU
Hi, 
   Can you skip the error temporary? 
   set global sql_slave_skip_counter=1
[22 Jan 2016 3:04] Jesper wisborg Krogh
Posted by developer:
 
See also Bug 21507981/Bug 77496
[19 Sep 2016 8:22] Erlend Dahl
Duplicate of 

Bug#77496 Replication position lost after crash on MTS configured slave