Bug #89145 Provide relay log details in case of Group Replication applier failure.
Submitted: 8 Jan 22:15 Modified: 11 Jan 9:47
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S4 (Feature request)
Version:5.7.20,8.0.3 OS:Any
Assigned to: CPU Architecture:Any
Triage: Needs Triage: D5 (Feature request)

[8 Jan 22:15] Jean-François Gagné
Description:
Hi,

In Bug#89141, I describe a situation generating an error in the Group Replication applier.  We have the following in P_S:

> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
        CHANNEL_NAME: group_replication_applier
           THREAD_ID: 4531
       SERVICE_STATE: ON
   LAST_ERROR_NUMBER: 1062
  LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:147' at
master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-01 19:29:30
1 row in set (0.00 sec)

And we have the following in the error log:

2018-01-01T18:29:30.880298Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

None of these 2 messages include a position in the relay logs.  For investigating the error, we can only rely on the GTID and on relay log parsing, which is not very practical.

Please add relay log positional information in Group Replication error messages.  In addition to the relay log filename, this could include the position of the beginning of the failed transaction in the relay logs.  Note that the offset of the transaction is already present with end_log_pos, but this is strangely named (I will open another bug/feature request for that and but the bug number in the comments).

Many thanks,

JFG

How to repeat:
Not a bug but a feature request.

See Bug#89141 to know how to get the errors quoted in the description.

Suggested fix:
Add relay log positional information in Group Replication error messages.  The position could include the relay log filename and the position of the beginning of the failed transaction in the relay logs  (the offset of the transaction is already present with end_log_pos).
[8 Jan 22:19] Jean-François Gagné
About end_log_pos strangely named:  Bug #89147 - The field end_log_pos in Group Replication error messages is ambiguous.
[11 Jan 9:47] Umesh Shastry
Hello Jean,

Thank you for the report and feature request!

Thanks,
Umesh