Description:
Hi,
In Bug#89141, I describe a situation generating an error in the Group Replication applier. This bug can produce something like below in the error log (this is not an exact quote of Bug#89141 but it is generated the same way):
2018-01-06T18:41:23.331363Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:126' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:41:31.885927Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:132' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:41:52.143778Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:41:56.857187Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:150' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:42:05.385074Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:156' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:42:25.254811Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:171' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-06T18:42:33.694658Z 2242 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:177' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
And this is what we get in P_S for the last error:
> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
-> WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
THREAD_ID: 2274
SERVICE_STATE: ON
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:177' at
master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-06 19:42:33
1 row in set (0.00 sec)
Note: those error quotes are from this blog post:
https://jfg-mysql.blogspot.com/2018/01/write-set-in-mysql-5-7-group-replication.html
As you can see, the 7 lines above are referring to the same end_log_pos (168) but with different GITD. If we parse the relay logs for the last 2 lines, we get the following:
# at 13288
#180106 19:40:07 [...] end_log_pos 61 GTID last_committed=55 sequence_number=57 [...]
SET @@SESSION.GTID_NEXT= 'UUID:171'/*!*/;
# at 13349
#180106 19:42:13 server id 198208002 end_log_pos 69 Query thread_id=2488 exec_time=0 error_code=0
SET TIMESTAMP=1515264133/*!*/;
BEGIN
# at 13418
#180106 19:42:13 [...] end_log_pos 129 Table_map: `test_jfg_ws`.`test_jfg_ws` mapped to number 331
# at 13478
#180106 19:42:13 [...] end_log_pos 168 Write_rows: table id 331 flags: STMT_END_F
[...]
# at 13517
#180106 19:42:13 [...] end_log_pos 195 Xid = 1094765
COMMIT/*!*/;
# at 14722
#180106 19:40:07 [...] end_log_pos 61 GTID last_committed=61 sequence_number=63 [...]
SET @@SESSION.GTID_NEXT= 'UUID:177'/*!*/;
# at 14783
#180106 19:42:20 [...] end_log_pos 69 Query thread_id=2517 exec_time=0 error_code=0
SET TIMESTAMP=1515264140/*!*/;
BEGIN
# at 14852
#180106 19:42:20 [...] end_log_pos 129 Table_map: `test_jfg_ws`.`test_jfg_ws` mapped to number 331
# at 14912
#180106 19:42:20 [...] end_log_pos 168 Write_rows: table id 331 flags: STMT_END_F
[...]
# at 14951
#180106 19:42:20 [...] end_log_pos 195 Xid = 1094789
COMMIT/*!*/;
The COMMIT of those two transactions are at the relay log positions 13517 and 14951 but their end_log_pos values for the previous event are 168.
Usually, the end_log_pos of the relay log is from the binary log (this is how the slave knows at which position it is in the binary log of its master). I can guess that the end_log_pos of Group Replication match the offset of that event in the transaction. However, this information is not very useful in the error messages of Group Replication.
Two things:
1) the error message should probably be improved (some suggestions in Bug#89145),
2) for Group Replication, end_log_pos should be renamed trx_offset or similar, both in the error message (error log and P_S) and in mysqlbinlog.
Many thanks,
JFG
How to repeat:
See Bug#89141.
Suggested fix:
Two things:
1) the error message should probably be improved (some suggestions in Bug#89145),
2) for Group Replication, end_log_pos should be renamed trx_offset or similar, both in the error message (error log and P_S) and in mysqlbinlog.