Bug #89146 Binlog name and Pos are wrong in group_replication_applier channel's error msgs
Submitted: 8 Jan 22:16 Modified: 10 Jan 13:00
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7.20, 8.0.3 OS:Any
Assigned to: Venkatesh Duggirala CPU Architecture:Any

[8 Jan 22:16] Jean-François Gagné
Description:
Hi,

In Bug#89141, I describe a situation generating an error in the Group Replication applier.  This bug can produce something like below in the error log:

2018-01-01T18:29:30.880298Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

And the corresponding message in P_S:

> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
        CHANNEL_NAME: group_replication_applier
           THREAD_ID: 4531
       SERVICE_STATE: ON
   LAST_ERROR_NUMBER: 1062
  LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-01 19:29:30
1 row in set (0.00 sec)

In the error log and in P_S, we can read the following:

Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168

It looks like we are missing a word between “master log” and the following coma.  It looks like this word is “FIRST” as we get the following more detailed error message in the error log:

the event's master log FIRST, end_log_pos 168

Btw, master log FIRST and end_log_pos 168 are not very useful: I will open a bug/feature request for that and put the bug number in the comments.

Many thanks for looking into that,

JFG

How to repeat:
See Bug#89141 to know how to get the errors quoted in the description.
[8 Jan 22:22] Jean-François Gagné
About end_log_pos 168 are not very useful: Bug #89145 - Provide relay log details in case of Group Replication applier failure.

About end_log_pos that can be improved: Bug #89147 - The field end_log_pos in Group Replication error messages is ambiguous.
[10 Jan 13:00] Umesh Shastry
Hello Jean,

Thank you for the report and feedback!

Thanks,
Umesh
[10 Jan 13:01] Umesh Shastry
Taken from Bug#89141

Attachment: 89141_5.7.20.results (application/octet-stream, text), 46.86 KiB.

[12 Mar 5:15] Venkatesh Duggirala
Post by Developer:
==================

Hello Jean,
Thank you for the bug report. The issue happens only in Group Replication(GR)
and the channel is "group_replication_applier" channel. 

In GR, the events are broadcasted to other nodes before the events are written
on originated Node's binary log. So the node which is applying the events does not know the exact binary log name and position (in the above example 168). 

We will analyze more and see what can be done in this situation.

Changing the title, category according to the above description.

Thanks once again for using Group Replication.

Regards,
Venkatesh.