Bug #97540 errors not showing on ps.replication_applier_status_by_worker on single applier
Submitted: 7 Nov 2019 17:59 Modified: 12 Nov 2019 15:38
Reporter: Nuno Carvalho Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[7 Nov 2019 17:59] Nuno Carvalho
Description:
When a member joins to a group, it will go through a recovering
stage on which it updates its state with the group one.
If a error happens on this stage, the new member will retry X times.
The error that happened on each recovery retry is displayed on
performance_schema.replication_applier_status_by_worker table.
Example of the output of attached x.test, on which during recovery a
already existent table is attempted to be created:
```
SELECT * FROM performance_schema.replication_applier_status_by_worker WHERE CHANNEL_NAME = "group_replication_recovery";
CHANNEL_NAME    group_replication_recovery
WORKER_ID       1
THREAD_ID       NULL
SERVICE_STATE   OFF
LAST_ERROR_NUMBER       1050
LAST_ERROR_MESSAGE      Worker 1 failed executing transaction 'c2e4ddd7-0183-11ea-9a3a-0010e0734796:2' at master log server-binary-log.000001, end_log_pos 685; Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (c1 INT NOT NULL PRIMARY KEY) ENGINE=InnoDB'
LAST_ERROR_TIMESTAMP    2019-11-07 20:26:49.801122
```

The above example is when replication applier is configured to have
parallel workers:
```
--slave-parallel-workers=4 --slave-parallel-type=logical_clock
```

Though, it was found that when applier is single worker (the default
value) the error is not show, the same x.test produces the output:
```
SELECT * FROM performance_schema.replication_applier_status_by_worker WHERE CHANNEL_NAME = "group_replication_recovery";
CHANNEL_NAME    group_replication_recovery
WORKER_ID       0
THREAD_ID       NULL
SERVICE_STATE   OFF
LAST_ERROR_NUMBER       0
LAST_ERROR_MESSAGE
LAST_ERROR_TIMESTAMP    0000-00-00 00:00:00.000000
```
which is incorrect.

How to repeat:
Please run attached x.test with parallel applier:
```
$ ./mtr --mem --nocheck-testcase group_replication.x --mysqld=--slave-parallel-workers=4 --mysqld=--slave-parallel-type=logical_clock --mysqld=--slave_preserve_commit_order=ON
```

Please run attached x.test with sequential applier:
```
$ ./mtr --mem --nocheck-testcase group_replication.x
```
[12 Nov 2019 15:38] Margaret Fisher
Posted by developer:
 
Changelog entry added for MySQL 8.0.19 and 5.7.29:

When a member is joining or rejoining a replication group, if Group Replication detects an error in the distributed recovery process (during which the joining member receives state transfer from an existing online member), it automatically switches over to a new donor, and retries the state transfer. The number of times the joining member retries before giving up is set by the group_replication_recovery_retry_count  system variable. The  Performance Schema table replication_applier_status_by_worker displays the error that caused the last retry. Previously, this error was only shown if the group member was configured with parallel replication applier threads (as set by the slave_parallel_workers system variable). If the group member was configured with a single applier thread, the error was cleared after each retry by an internal RESET SLAVE operation, so it could not be viewed. This was also the case for the output of the SHOW SLAVE STATUS statement whether there were single or multiple applier threads. The RESET SLAVE operation is now no longer carried out after retrying distributed recovery, so the error that caused the last retry can always be viewed.