MySQL Bugs: #89146: Binlog name and Pos are wrong in group_replication

Bug #89146	Binlog name and Pos are wrong in group_replication_applier channel's error msgs
Submitted:	8 Jan 2018 22:16	Modified:	20 Jun 2018 16:45
Reporter:	Jean-François Gagné	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	5.7.20, 8.0.3	OS:	Any
Assigned to:	Venkatesh Duggirala	CPU Architecture:	Any

Description:
Hi,

In Bug#89141, I describe a situation generating an error in the Group Replication applier.  This bug can produce something like below in the error log:

2018-01-01T18:29:30.880298Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

And the corresponding message in P_S:

> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
        CHANNEL_NAME: group_replication_applier
           THREAD_ID: 4531
       SERVICE_STATE: ON
   LAST_ERROR_NUMBER: 1062
  LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-01 19:29:30
1 row in set (0.00 sec)

In the error log and in P_S, we can read the following:

Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168

It looks like we are missing a word between “master log” and the following coma.  It looks like this word is “FIRST” as we get the following more detailed error message in the error log:

the event's master log FIRST, end_log_pos 168

Btw, master log FIRST and end_log_pos 168 are not very useful: I will open a bug/feature request for that and put the bug number in the comments.

Many thanks for looking into that,

JFG

How to repeat:
See Bug#89141 to know how to get the errors quoted in the description.

About end_log_pos 168 are not very useful: Bug #89145 - Provide relay log details in case of Group Replication applier failure.

About end_log_pos that can be improved: Bug #89147 - The field end_log_pos in Group Replication error messages is ambiguous.

Hello Jean,

Thank you for the report and feedback!

Thanks,
Umesh

Taken from Bug#89141

Attachment: 89141_5.7.20.results (application/octet-stream, text), 46.86 KiB.

Post by Developer:
==================

Hello Jean,
Thank you for the bug report. The issue happens only in Group Replication(GR)
and the channel is "group_replication_applier" channel. 

In GR, the events are broadcasted to other nodes before the events are written
on originated Node's binary log. So the node which is applying the events does not know the exact binary log name and position (in the above example 168). 

We will analyze more and see what can be done in this situation.

Changing the title, category according to the above description.

Thanks once again for using Group Replication.

Regards,
Venkatesh.

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.23 / 8.0.12 changelog:
When the group_replication_applier channel's applier thread encountered an error, the master_log_name and end_log_pos in the error message were incorrect.  In Group Replication, the events of a transaction are replicated before they are written to the binary log of the member where the transaction originated. The result is that the final master_log_name and the end_log_pos of those events are unknown at the time they are applied on the replica by group_replication_applier channel's applier thread. To avoid confusion, now any such error messages encountered by a group_replication_applier channel do not contain the binary log name and the binary log position.