MySQL Bugs: #89147: The field end_log_pos in Group Replication error messages is ambiguous.

Bug #89147	The field end_log_pos in Group Replication error messages is ambiguous.
Submitted:	8 Jan 2018 22:18	Modified:	11 Jan 2018 9:51
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S4 (Feature request)
Version:	5.7.20,8.0.3	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

In Bug#89141, I describe a situation generating an error in the Group Replication applier.  This bug can produce something like below in the error log (this is not an exact quote of Bug#89141 but it is generated the same way):

2018-01-06T18:41:23.331363Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:126' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:41:31.885927Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:132' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:41:52.143778Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:41:56.857187Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:150' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:42:05.385074Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:156' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:42:25.254811Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:171' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:42:33.694658Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:177' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

And this is what we get in P_S for the last error:

> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
        CHANNEL_NAME: group_replication_applier
           THREAD_ID: 2274
       SERVICE_STATE: ON
   LAST_ERROR_NUMBER: 1062
  LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:177' at
master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-06 19:42:33
1 row in set (0.00 sec)

Note: those error quotes are from this blog post:
https://jfg-mysql.blogspot.com/2018/01/write-set-in-mysql-5-7-group-replication.html

As you can see, the 7 lines above are referring to the same end_log_pos (168) but with different GITD.  If we parse the relay logs for the last 2 lines, we get the following:

# at 13288
#180106 19:40:07 [...] end_log_pos 61 GTID last_committed=55 sequence_number=57 [...]
SET @@SESSION.GTID_NEXT= 'UUID:171'/*!*/;
# at 13349
#180106 19:42:13 server id 198208002  end_log_pos 69  Query thread_id=2488 exec_time=0  error_code=0
SET TIMESTAMP=1515264133/*!*/;
BEGIN
# at 13418
#180106 19:42:13 [...] end_log_pos 129 Table_map: `test_jfg_ws`.`test_jfg_ws` mapped to number 331
# at 13478
#180106 19:42:13 [...] end_log_pos 168 Write_rows: table id 331 flags: STMT_END_F
[...]
# at 13517
#180106 19:42:13 [...] end_log_pos 195 Xid = 1094765
COMMIT/*!*/;

# at 14722
#180106 19:40:07 [...] end_log_pos 61 GTID last_committed=61 sequence_number=63 [...]
SET @@SESSION.GTID_NEXT= 'UUID:177'/*!*/;
# at 14783
#180106 19:42:20 [...] end_log_pos 69 Query thread_id=2517 exec_time=0 error_code=0
SET TIMESTAMP=1515264140/*!*/;
BEGIN
# at 14852
#180106 19:42:20 [...] end_log_pos 129 Table_map: `test_jfg_ws`.`test_jfg_ws` mapped to number 331
# at 14912
#180106 19:42:20 [...] end_log_pos 168 Write_rows: table id 331 flags: STMT_END_F
[...]
# at 14951
#180106 19:42:20 [...] end_log_pos 195   Xid = 1094789
COMMIT/*!*/;

The COMMIT of those two transactions are at the relay log positions 13517 and 14951 but their end_log_pos values for the previous event are 168.

Usually, the end_log_pos of the relay log is from the binary log (this is how the slave knows at which position it is in the binary log of its master).  I can guess that the end_log_pos of Group Replication match the offset of that event in the transaction.  However, this information is not very useful in the error messages of Group Replication.

Two things:

1) the error message should probably be improved (some suggestions in Bug#89145),

2) for Group Replication, end_log_pos should be renamed trx_offset or similar, both in the error message (error log and P_S) and in mysqlbinlog.

Many thanks,

JFG

How to repeat:
See Bug#89141.

Suggested fix:
Two things:

1) the error message should probably be improved (some suggestions in Bug#89145),

2) for Group Replication, end_log_pos should be renamed trx_offset or similar, both in the error message (error log and P_S) and in mysqlbinlog.

Update title because typo.

Hello Jean,

Thank you for the report and feature request!

Thanks,
Umesh