Bug #86641 Repeated multi-threaded slave replication failures
Submitted: 9 Jun 2017 19:38 Modified: 15 Oct 2017 4:00
Reporter: monty solomon Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.18, 5.7.19 OS:CentOS
Assigned to: CPU Architecture:Any

[9 Jun 2017 19:38] monty solomon
Description:
I enabled multi-threaded slave replication and we are experiencing repeated replication failures where the slave stops with an error message similar to this one.

                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:242101388' at master log bin.000406, end_log_pos 906634781. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

I executed stop slave; start slave; and it failed again after a while

                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:242473841' at master log bin.000407, end_log_pos 436498711. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

Sometimes I am unable to resume replication with the server reporting the error

                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

How to repeat:
Enable MTS replication.

mysql> show global variables like 'slave_parallel%'\G
*************************** 1. row ***************************
Variable_name: slave_parallel_type
        Value: LOGICAL_CLOCK
*************************** 2. row ***************************
Variable_name: slave_parallel_workers
        Value: 8
2 rows in set (0.00 sec)

mysql> select @@relay_log_info_repository\G
*************************** 1. row ***************************
@@relay_log_info_repository: TABLE
1 row in set (0.00 sec)
[10 Jun 2017 6:26] monty solomon
Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 6 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:314434033' at master log bin.000524, end_log_pos 168699733. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
[10 Jun 2017 20:17] monty solomon
Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 7 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:324784349' at master log bin.000541, end_log_pos 604451328. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
[20 Jun 2017 22:28] monty solomon
I had to disable multi-threaded slave replication to prevent the failures.
[28 Jun 2017 18:42] MySQL Verification Team
Hi,

do you have primary key on all tables being replicated?

thanks
Bogdan
[29 Jun 2017 4:53] monty solomon
Yes, each table uses InnoDB and each table has a primary key.
[29 Jun 2017 8:33] MySQL Verification Team
The server error log and performance_schema.replication_applier_status_by_worker table may have more information about why the worker thread is failing. 

Can you get us this info?

If this us the same env as 86643, the slave might be re-applying transactions already applied by the mysqldump restore, and this may lead to unexpected behavior from the SQL thread. 

all best
Bogdan
[29 Jun 2017 21:04] monty solomon
I posted the error entries from the performance_schema.replication_applier_status_by_worker table here previously.

This is the same environment as 86643

I use CHANGE MASTER TO MASTER_AUTO_POSITION = 1 after the restore. How would it re-apply transactions that were in the restore?
[29 Jun 2017 21:29] MySQL Verification Team
Hi,

This issue is, after consulting with the dev team, result of the issue from Bug #86643. So in order to save time I'm setting this one to duplicate and we should focus on the 86643.

best regards
Bogdan
[30 Jun 2017 8:50] monty solomon
I'm not sure that this is a duplicate.

It occurs on multiple MySQL clusters when parallel replication is enabled.
[30 Jun 2017 11:13] MySQL Verification Team
Hi,

I do believe the reason for both bugs is the same, anyhow I managed to reproduce this behavior last night on my test rig so I'm setting this bug as verified. Dev team can later on determine if the same fix for 86643 fixes this one too.

In order to reproduce the bug

1. create 3 databases
2. load 3 databases with data using 3 different stress testing apps (I used sysbench and dtm)
3. dump the db and restore to multithreaded slave
4. start the slave (replication at all)
5. start 3 stress testing apps (maybe one will work alone but I started like this)

let it run for a while, slave will break with 1032

tested with 5.7.18
[30 Jun 2017 18:38] monty solomon
FYI, on most of the clusters where we experienced the replication failure there was no dump and restore involved.

We enabled parallel replication and started seeing the issue within a day or two on clusters that are very active.

We saw the replication failures on the cluster referenced in this bug and in bug #86643 both before and after we used the dump and restore.
[18 Jul 2017 15:13] monty solomon
I'm not sure I understand how this bug is related to Bug #86643.

You wrote that you were able to reproduce this bug on 5.7.18. Did you expect this bug to be fixed in 5.7.19? Now that 5.7.19 is released, can you run your test again?

Thanks.
[18 Jul 2017 15:38] MySQL Verification Team
Hi,

>  on most of the clusters where we experienced the replication 
> failure there was no dump and restore involved.

how was the slave created? you started with blank system?

> I'm not sure I understand how this bug is related to Bug #86643.

I don't know for sure that it is, that is why I set this bug as "verified" and not as "duplicate". So it is a separate bug. 

> You wrote that you were able to reproduce this bug on 5.7.18.

yes

> Did you expect this bug to be fixed in 5.7.19? 

only if it really has the same root cause as 86643 but I'm not sure that is the case

> Now that 5.7.19 is released, can you run your test again?

of course, and I'm sure you will be running them too.

The problem I have is that I can't reproduce this issue "every time". And it is possible that 86643 is masking this issue for me and that I'm reproducing 86643 and not this one. Now the fix for 86643 is to just not dump that one table. When I do that I can't reproduce the issue. When I start with blank system, I can't reproduce the issue, so it is very possible 86643 is masking it.
[19 Jul 2017 4:24] monty solomon
We don't usually use mysqldump to create slaves. I used mysqldump to update some of the slaves described in Bug # 86643 because they were unable to keep up with replication and got very far behind. The database is small enough that it was faster to just take a fresh mysqldump from the master and load it on the lagging slaves.

> how was the slave created? you started with blank system?

We run in AWS and use EBS for the MySQL volumes.

For a new cluster, we create the master, and clone it to create a slave. We take a consistent snapshot of the MySQL volumes, create new volumes from the snapshots, create a new instance, attach the new volumes to the new instance and start mysql.

For an existing cluster, we clone a new slave using snapshots taken from a running slave.
[19 Jul 2017 10:40] MySQL Verification Team
Hi,

> For a new cluster, we create the master, and clone it to create a slave.
> We take a consistent snapshot of the MySQL volumes, create new volumes
> from the snapshots, create a new instance, attach the new volumes to
> the new instance and start mysql.

> For an existing cluster, we clone a new slave using snapshots taken 
> from a running slave.

5.7.19 will not fix that I think. You are copying this way the same table that should not be copied to the slave (mysql.gtid_executed) and the problem is there. 5.7.19 "solves" this by mysqldump not exporting mysql.gtid_executed but if you do a physical snapshot that goes around the "fix" and the same behavior you have now you will have with 5.7.19.

all best
Bogdan
[19 Jul 2017 18:53] monty solomon
We have not had any issues with the GTID failures reported in 86643 other than when we used mysqldump to create the slaves.
[19 Jul 2017 18:57] monty solomon
The multi-threaded replication failures continue to happen in 5.7.19. Stopping the slave and starting the slave enables it to resume replication.

               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 4 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724231382' at master log bin.002932, end_log_pos 499799016. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:2724231382
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 4 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724231382' at master log bin.002932, end_log_pos 499799016; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.002932, end_log_pos 499799016
 LAST_ERROR_TIMESTAMP: 2017-07-19 04:47:39
[19 Jul 2017 18:58] monty solomon
Another failure a little later

               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724333862' at master log bin.002932, end_log_pos 610274461. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:2724333862
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724333862' at master log bin.002932, end_log_pos 610274461; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.002932, end_log_pos 610274461
 LAST_ERROR_TIMESTAMP: 2017-07-19 18:56:21
[19 Jul 2017 19:06] monty solomon
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724874197' at master log bin.002933, end_log_pos 71696440. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

        CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:2724874197
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724874197' at master log bin.002933, end_log_pos 71696440; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.002933, end_log_pos 71696440
 LAST_ERROR_TIMESTAMP: 2017-07-19 19:00:50
[19 Jul 2017 19:06] monty solomon
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724877000' at master log bin.002933, end_log_pos 76721639. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:2724877000
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2724877000' at master log bin.002933, end_log_pos 76721639; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.002933, end_log_pos 76721639
 LAST_ERROR_TIMESTAMP: 2017-07-19 19:05:11
[19 Jul 2017 19:16] monty solomon
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2725353237' at master log bin.002934, end_log_pos 387270122. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:2725353237
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2725353237' at master log bin.002934, end_log_pos 387270122; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.002934, end_log_pos 387270122
 LAST_ERROR_TIMESTAMP: 2017-07-19 19:08:48
[19 Jul 2017 19:25] monty solomon
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2725522615' at master log bin.002934, end_log_pos 550005288. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:2725522615
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 3 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:2725522615' at master log bin.002934, end_log_pos 550005288; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.002934, end_log_pos 550005288
 LAST_ERROR_TIMESTAMP: 2017-07-19 19:16:40
[21 Jul 2017 18:56] monty solomon
Unable to stop the slave.

mysql> show processlist;
+--------+----------------+-----------------+------+---------+--------+---------------------------------------------+------------------+
| Id     | User           | Host            | db   | Command | Time   | State                                       | Info             |
+--------+----------------+-----------------+------+---------+--------+---------------------------------------------+------------------+
| 147223 | system user    |                 | NULL | Connect |   3549 | Waiting for master to send event            | NULL             |
| 147224 | system user    |                 | NULL | Connect |   3466 | Waiting for workers to exit                 | NULL             |
| 147228 | system user    |                 | NULL | Connect | 138023 | Waiting for preceding transaction to commit | NULL             |
| 149552 | root           | localhost       | NULL | Query   |   1528 | Killing slave                               | stop slave       |
| 150282 | root           | localhost       | NULL | Query   |    907 | starting                                    | FLUSH LOCAL LOGS |
| 150471 | SUSR_Heartbeat | 127.0.0.1:46474 | NULL | Sleep   |      0 |                                             | NULL             |
| 151152 | SUSR_Heartbeat | 127.0.0.1:48306 | NULL | Sleep   |      0 |                                             | NULL             |
| 151209 | root           | localhost       | NULL | Query   |      0 | starting                                    | show processlist |
+--------+----------------+-----------------+------+---------+--------+---------------------------------------------+------------------+
8 rows in set (0.00 sec)
[7 Aug 2017 1:49] ashe sun
Please check  if  M/S data is  coincident.
[10 Aug 2017 4:56] monty solomon
I'm not sure I understand the last comment. What does "Please check  if  M/S data is  coincident." mean?
[10 Aug 2017 5:02] monty solomon
The bug seems to be caused by some sort of race or edge condition.

We are running many distinct clusters in Q/A and production. We have not seen this error occur on any of the Q/A slaves.

The errors seem to occur only on production clusters where there are a significant number of INSERTs and/or UPDATEs.
[10 Aug 2017 5:03] monty solomon
Is setting slave_parallel_workers to 1 the same as setting it to 0?

The error doesn't seem to occur when slave_parallel_workers is set to 1.
[10 Aug 2017 5:37] MySQL Verification Team
Hi Monty,

what I mentioned is that when you create a slave with the copy of the gtid_executed the slave can diverge from master and that's where all the problems start.

Did you try to, as I asked, create a slave without copying the gtid_executed table? So not by cloning the whole master (including gtid_executed) but running a dump (.19 will not dump gtid_executed or you can dump from any version and remove the content of the gtid_executed manually from dump file). And then trying to reproduce this issue. As when I create a slave without this table I cannot reproduce the issue and neither can our dev team. So making this test would definitively show if this is or not the same bug we know about.

all best
Bogdan
[14 Aug 2017 21:10] monty solomon
All of the slaves in this cluster were created without copying the gtid_executed table and they all exhibit the problem. The method I described in another comment to create slaves using snapshots should be fine too otherwise a crashed slave would never be able to be restarted.

I just took a fresh dump using

mysqldump --master-data=1 --single-transaction --all-databases --triggers --routines --events --ignore-table=mysql.gtid_executed -r dump.sql

copied it to the slaves and used the following commands to reload and start the slaves

mysql -e 'stop slave ; reset master; reset slave all' ; mysql < dump.sql 

mysql -e "CHANGE MASTER TO MASTER_HOST='sweet-hyena', MASTER_AUTO_POSITION = 1, MASTER_CONNECT_RETRY=15, MASTER_RETRY_COUNT=0, MASTER_HEARTBEAT_PERIOD=1 , MASTER_USER='SUSR_Repl', MASTER_PASSWORD='--redacted--'"

mysql -e 'stop slave; set global slave_parallel_workers = 4; start slave'
[14 Aug 2017 21:17] monty solomon
The slave stopped replicating after a while and it resumes replication after executing

mysql> stop slave; start slave;

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007250
          Read_Master_Log_Pos: 698552842
               Relay_Log_File: relay.000014
                Relay_Log_Pos: 33087845
        Relay_Master_Log_File: bin.007248
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676193820' at master log bin.007248, end_log_pos 33668377. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 33087684
              Relay_Log_Space: 2298432823
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676193820' at master log bin.007248, end_log_pos 33668377. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:08:12
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5676071495-5677768129
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5676193816,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676193819
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676193820
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676193820' at master log bin.007248, end_log_pos 33668377; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007248, end_log_pos 33668377
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:08:12
*************************** 3. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676193817
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676193818
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)

mysql> select @@slave_parallel_workers\G
*************************** 1. row ***************************
@@slave_parallel_workers: 4
1 row in set (0.00 sec)
[14 Aug 2017 21:28] monty solomon
On another slave

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007252
          Read_Master_Log_Pos: 601487153
               Relay_Log_File: relay.000014
                Relay_Log_Pos: 500636601
        Relay_Master_Log_File: bin.007249
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 4 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677273820' at master log bin.007249, end_log_pos 500696162. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 500636440
              Relay_Log_Space: 3275125235
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 4 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677273820' at master log bin.007249, end_log_pos 500696162. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:08:00
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5674300718-5679198991
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5677273816,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677273817
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677273818
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 3. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677273819
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677273820
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 4 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677273820' at master log bin.007249, end_log_pos 500696162; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007249, end_log_pos 500696162
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:08:00
4 rows in set (0.00 sec)
[14 Aug 2017 21:34] monty solomon
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007253
          Read_Master_Log_Pos: 528121501
               Relay_Log_File: relay.000021
                Relay_Log_Pos: 156637395
        Relay_Master_Log_File: bin.007250
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677561177' at master log bin.007250, end_log_pos 388604272. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 388336794
              Relay_Log_Space: 3749404267
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677561177' at master log bin.007250, end_log_pos 388604272. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:27:08
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5676071495-5679907088
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5677561173,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677561175
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677561176
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 3. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677561177
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677561177' at master log bin.007250, end_log_pos 388604272; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007250, end_log_pos 388604272
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:27:08
*************************** 4. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677561174
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)
[14 Aug 2017 21:40] monty solomon
Does it make sense that the worker number in the error message doesn't match the worker ID?

For example, here are two excerpts from two of the query results above. These are from separate failures (only one worker fails at a time).

         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676193820
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676193820' at master log bin.007248, end_log_pos 33668377; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007248, end_log_pos 33668377
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:08:12

         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5677561177
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5677561177' at master log bin.007250, end_log_pos 388604272; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007250, end_log_pos 388604272
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:27:08
[14 Aug 2017 21:43] monty solomon
Over time, each slave fails.

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007254
          Read_Master_Log_Pos: 959327805
               Relay_Log_File: relay.000028
                Relay_Log_Pos: 27820335
        Relay_Master_Log_File: bin.007254
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5680177426' at master log bin.007254, end_log_pos 28289313. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 27820174
              Relay_Log_Space: 959328600
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5680177426' at master log bin.007254, end_log_pos 28289313. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:34:42
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5678799962-5680845672
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5680177425,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5680177425
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5680177426
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 2 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5680177426' at master log bin.007254, end_log_pos 28289313; Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:34:42
*************************** 3. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5680177427
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5680177428
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)
[14 Aug 2017 21:45] monty solomon
And another.

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007255
          Read_Master_Log_Pos: 254137852
               Relay_Log_File: relay.000026
                Relay_Log_Pos: 510084889
        Relay_Master_Log_File: bin.007248
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676522435' at master log bin.007248, end_log_pos 510676523. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 510084728
              Relay_Log_Space: 7029912221
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676522435' at master log bin.007248, end_log_pos 510676523. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:28:48
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5676024446-5681106604
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5676522431,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676522435
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c7258e28-4c06-11e7-b189-0a26f8645d98:5676522435' at master log bin.007248, end_log_pos 510676523; Could not execute Update_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007248, end_log_pos 510676523
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:28:48
*************************** 2. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676522432
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 3. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676522433
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 4. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 4
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5676522434
    LAST_ERROR_NUMBER: 0
   LAST_ERROR_MESSAGE: 
 LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
4 rows in set (0.00 sec)
[14 Aug 2017 21:56] monty solomon
I executed the following command on each slave to turn off parallel replication and it caused unrecoverable replication failures on a couple of the slaves.

mysql -e 'stop slave; set global slave_parallel_workers = 0 ; start slave'

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007256
          Read_Master_Log_Pos: 693738906
               Relay_Log_File: relay.000027
                Relay_Log_Pos: 28289087
        Relay_Master_Log_File: bin.007254
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 28288926
              Relay_Log_Space: 2841254734
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:52:32
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5674300718-5682084979
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5680177425,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 0
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5680177426
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:52:32
1 row in set (0.00 sec)
[14 Aug 2017 21:59] monty solomon
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: sweet-hyena
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: bin.007257
          Read_Master_Log_Pos: 71753930
               Relay_Log_File: relay.000028
                Relay_Log_Pos: 28289087
        Relay_Master_Log_File: bin.007254
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 28288926
              Relay_Log_Space: 3293047523
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 293473754
                  Master_UUID: c7258e28-4c06-11e7-b189-0a26f8645d98
             Master_Info_File: /opt/mysql/dbdata1/monty-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170814 21:47:33
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c7258e28-4c06-11e7-b189-0a26f8645d98:5678799962-5682397007
            Executed_Gtid_Set: 06970d75-3105-11e7-9e18-1228337aa004:1-2368691277,
0e6d000d-16e1-11e6-a80c-0a48c05654c7:1-11939041,
12459d26-f870-11e5-a18b-12544ded6a95:1-5628574,
2bdcda39-47c3-11e7-a5f6-0ac0ba22019e:1-144095557,
85ebe027-a5b6-11e5-861c-129ccaa4cbf7:1-4991016,
c7258e28-4c06-11e7-b189-0a26f8645d98:1-5680177425,
d2ecb5d0-3a7e-11e6-a6bb-0a0ad9f3138d:1-71377440,
d7fbf9fd-81fd-11e6-9f00-1244bf4cd67b:1-975482865
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> select * from performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 0
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: c7258e28-4c06-11e7-b189-0a26f8645d98:5680177426
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Could not execute Delete_rows event on table access_tokens; Can't find record in 'access_tokens', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.007254, end_log_pos 28289313
 LAST_ERROR_TIMESTAMP: 2017-08-14 21:47:33
1 row in set (0.00 sec)
[14 Aug 2017 22:36] monty solomon
The master is executing around 7k QPS with about 25% of them being UPDATE statements.
[15 Aug 2017 5:58] MySQL Verification Team
Hi,
I'm having issues reproducing this, and as you say you can't reproduce it neither on any of your other instances, only on that one production setup, right?

> Does it make sense that the worker number in the error message 
> doesn't match the worker ID?

yes, that should not be relevant

> Over time, each slave fails.

From the errors I see, they diverge from master in time. If you compare the data in table access_tokens between servers it will show to be different. Now, looking at the errors, it seems to always be access_tokens table. Can you give me the show create table for that one?

All best
Bogdan
[15 Aug 2017 13:54] monty solomon
No, I am just using this cluster for the testing for this bug. We have many MySQL clusters and enabling multi-threaded slave replication on them will cause replication to stop on the more active clusters.

Where do you see a data discrepancy and the divergence? One of the slave worker threads stops with an error but executing "stop slave; start slave;" causes it to resume without error. I didn't run pt-table-checksum yesterday but previous runs of pt-table-checksum on the clusters that experienced the multi-threaded slave replication failures did not find any discrepancies.
[15 Aug 2017 15:56] monty solomon
This cluster uses foreign keys. Is it possible there is a problematic interaction between foreign keys and multi-threaded slave replication?
[15 Aug 2017 16:20] MySQL Verification Team
Hi,

Well the "Can't find record in 'access_tokens'" is a discrepancy for start. That's why I asked for access_tokens create table. Would be good to compare the master and slave content of that table.

Did you ever seen any other table being the one with error?

all best
Bogdan

p.s. as for the fk's, dunno, maybe it makes a change, my tests were all without fk's, maybe you can share the db structure (not data) so our test can be more like yours?
[15 Aug 2017 16:22] MySQL Verification Team
got them, thanks, will redo the tests, maybe it will make a difference :D

thanks
Bogdan
[21 Aug 2017 18:49] MySQL Verification Team
I did manage to reproduce the problem again but I did not find any useful info apart that "it happens". Bug is verified so it will be tackled and if we get to have more questions we will update the report here.

all best
Bogdan
[31 Aug 2017 15:09] monty solomon
I re-enabled parallel replication on other clusters and we are getting some replication failures on the busier clusters. Stopping and starting the slave resumes replication.
[31 Aug 2017 15:38] monty solomon
Slave_IO_State: Waiting for master to send event
                  Master_Host: fair-mud
                  Master_User: SUSR_Repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.047456
          Read_Master_Log_Pos: 586828898
               Relay_Log_File: relay.028529
                Relay_Log_Pos: 732379994
        Relay_Master_Log_File: bin.047453
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'b4fbc30c-47d1-11e7-87bf-0e31271f3a2e:2839441659' at master log bin.047453, end_log_pos 799953010. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 799910909
              Relay_Log_Space: 2978955384
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 2 failed executing transaction 'b4fbc30c-47d1-11e7-87bf-0e31271f3a2e:2839441659' at master log bin.047453, end_log_pos 799953010. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2896803318
                  Master_UUID: b4fbc30c-47d1-11e7-87bf-0e31271f3a2e
             Master_Info_File: /opt/mysql/dbdata1/connector-p0/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170831 14:45:30
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: b4fbc30c-47d1-11e7-87bf-0e31271f3a2e:715582749-2840156709
            Executed_Gtid_Set: 1dd5e429-6905-11e6-a6a8-1240cbe55d21:1-1117754036,
23fcb3fe-68f3-11e6-9ef5-0ac31ecd1189:1-315562957,
5938e60a-807a-11e6-9e2d-0ea51abb0757:1-2413579,
76d8efc6-d585-11e5-bddc-0e645e8f7631:1-4211572860,
7cb15337-8158-11e6-818e-1272624dc303:1-4205513337,
8d1d310a-6253-11e6-80d1-0ac31ecd1189:1-192152346,
b4fbc30c-47d1-11e7-87bf-0e31271f3a2e:1-2839441656,
b7feee7e-811f-11e6-afe0-1240cbe55d21:1-4636584277,
ca8e364d-b663-11e4-ad86-0ae0ca1e8c90:1-8391,
d11ea3ee-624b-11e6-813f-0e8670919271:1-128632911,
eebd589d-aa73-11e5-a504-0eda5c5fcf19:1-1963253628
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)
[31 Aug 2017 15:38] monty solomon
CHANNEL_NAME: 
            WORKER_ID: 2
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: b4fbc30c-47d1-11e7-87bf-0e31271f3a2e:2839441659
    LAST_ERROR_NUMBER: 1032
   LAST_ERROR_MESSAGE: Worker 2 failed executing transaction 'b4fbc30c-47d1-11e7-87bf-0e31271f3a2e:2839441659' at master log bin.047453, end_log_pos 799953010; Could not execute Update_rows event on table SyncQueue; Can't find record in 'SyncQueue', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log bin.047453, end_log_pos 799953010
 LAST_ERROR_TIMESTAMP: 2017-08-31 14:45:30
[15 Oct 2017 3:48] monty solomon
2017-09-05T02:57:35.159591Z 10 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
[15 Oct 2017 3:48] monty solomon
2017-10-14T00:40:23.875283Z 221 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
[15 Oct 2017 3:49] monty solomon
2017-10-14T00:40:38.541857Z 245 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
[15 Oct 2017 3:56] monty solomon
More examples of both warnings and errors with code 1756 from slaves in different clusters.

angry-fog:5
big-bar:1
faint-bar:3
faint-dew:2
fair-mud:6
large-bonus:20
rainy-quail:14
weak-squid:1

2017-10-13T22:17:43.002032Z 39 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-10-13T22:18:08.510625Z 76 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-10-13T22:21:44.980605Z 502 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-10-14T00:02:38.853039Z 190 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-10-14T05:19:57.334780Z 8828 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-09-11T16:47:20.809361Z 254 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-09-11T17:11:17.553141Z 593046 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-09-11T17:56:50.817219Z 596878 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

2017-09-05T02:57:35.159591Z 10 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
[28 Jan 2018 3:41] Yan Huang
Hello Bogdan Kecman,
May I know the cause of error 1032? 
I met the same error in parallel replication, restart slave will work but cannot tell if there is any impact.
[30 May 2018 2:51] Bin Hong
Hello
We're affected, too. Is there any progress ?
Thanks
[23 Dec 2019 13:14] Shubhra Prakash Nandi
High OLTP load along with large batch jobs can trigger this issue consistently.