Bug #89146 Binlog name and Pos are wrong in group_replication_applier channel's error msgs
Submitted: 8 Jan 2018 22:16 Modified: 20 Jun 2018 16:45
Reporter: Jean-François Gagné Email Updates:
Status: Closed Impact on me:
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7.20, 8.0.3 OS:Any
Assigned to: Venkatesh Duggirala CPU Architecture:Any

[8 Jan 2018 22:16] Jean-François Gagné

In Bug#89141, I describe a situation generating an error in the Group Replication applier.  This bug can produce something like below in the error log:

2018-01-01T18:29:30.880298Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

And the corresponding message in P_S:

> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
        CHANNEL_NAME: group_replication_applier
           THREAD_ID: 4531
  LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-01 19:29:30
1 row in set (0.00 sec)

In the error log and in P_S, we can read the following:

Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168

It looks like we are missing a word between “master log” and the following coma.  It looks like this word is “FIRST” as we get the following more detailed error message in the error log:

the event's master log FIRST, end_log_pos 168

Btw, master log FIRST and end_log_pos 168 are not very useful: I will open a bug/feature request for that and put the bug number in the comments.

Many thanks for looking into that,


How to repeat:
See Bug#89141 to know how to get the errors quoted in the description.
[8 Jan 2018 22:22] Jean-François Gagné
About end_log_pos 168 are not very useful: Bug #89145 - Provide relay log details in case of Group Replication applier failure.

About end_log_pos that can be improved: Bug #89147 - The field end_log_pos in Group Replication error messages is ambiguous.
[10 Jan 2018 13:00] MySQL Verification Team
Hello Jean,

Thank you for the report and feedback!

[10 Jan 2018 13:01] MySQL Verification Team
Taken from Bug#89141

Attachment: 89141_5.7.20.results (application/octet-stream, text), 46.86 KiB.

[12 Mar 2018 5:15] Venkatesh Duggirala
Post by Developer:

Hello Jean,
Thank you for the bug report. The issue happens only in Group Replication(GR)
and the channel is "group_replication_applier" channel. 

In GR, the events are broadcasted to other nodes before the events are written
on originated Node's binary log. So the node which is applying the events does not know the exact binary log name and position (in the above example 168). 

We will analyze more and see what can be done in this situation.

Changing the title, category according to the above description.

Thanks once again for using Group Replication.

[20 Jun 2018 16:45] David Moss
Posted by developer:
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.23 / 8.0.12 changelog:
When the group_replication_applier channel's applier thread encountered an error, the master_log_name and end_log_pos in the error message were incorrect.  In Group Replication, the events of a transaction are replicated before they are written to the binary log of the member where the transaction originated. The result is that the final master_log_name and the end_log_pos of those events are unknown at the time they are applied on the replica by group_replication_applier channel's applier thread. To avoid confusion, now any such error messages encountered by a group_replication_applier channel do not contain the binary log name and the binary log position.