Bug #88304 SOMETIMES MEMBERS ENTER IN ERROR STATE AFTER LONG RECOVERY
Submitted: 31 Oct 2017 11:16 Modified: 11 Jan 2018 15:56
Reporter: Vitor Oliveira Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.3 OS:Any
Assigned to: CPU Architecture:Any

[31 Oct 2017 11:16] Vitor Oliveira
Description:
In Group Replication, after a recovery where there is a significant amount of load being applied, the recovering node sometimes fails after recovery with an error on the binary log.

This happens immediately after finishing recovery as can be seen in this message log:
2017-10-26T14:15:37.315509Z 0 [Note] [000000] Plugin group_replication reported: 'This server was declared online within the replication group' (Recovering node)
2017-10-26T14:15:37.315598Z 0 [Note] [000000] Plugin group_replication reported: 'The member with address siv30:29543 was declared online within the replication group' (other node)
2017-10-26T14:15:37.316237Z 0 [Note] [000000] Plugin group_replication reported: 'The member with address siv30:29543 was declared online within the replication group' (other node)

2017-10-26T14:15:37.316508Z 441 [ERROR] [001782] Slave SQL for channel 'group_replication_applier': Worker 1 failed executing transaction 'NOT_YET_DETERMINED' at master log , end_log_pos 563; Error executing row event: '@@SESSION.GTID_NEXT cannot be set to ANONYMOUS when @@GLOBAL.GTID_MODE = ON.', Error_code: 1782
2017-10-26T14:15:37.316600Z 440 [Warning] [001756] Slave SQL for channel 'group_replication_applier': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2017-10-26T14:15:37.316815Z 440 [ERROR] [000000] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2017-10-26T14:15:37.316872Z 92 [ERROR] [000000] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2017-10-26T14:15:37.316946Z 92 [ERROR] [000000] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2017-10-26T14:15:37.318555Z 92 [Note] [000000] Plugin group_replication reported: 'The group replication applier thread was killed'

At this point the member trying to enter the group enters in ERROR state.

How to repeat:
This a follow up on BUG#26731317, the conditions are similar, although it seems to happen less frequently and only at the highest amounts of load.

Suggested fix:
The message "Worker 1 failed executing transaction 'NOT_YET_DETERMINED' at master log , end_log_pos 563; Error executing row event: '@@SESSION.GTID_NEXT cannot be set to ANONYMOUS when @@GLOBAL.GTID_MODE = ON.', Error_code: 1782" seems to indicate that there is corruption on the binary log, something that was not visible before the fix for BUG#26731317.
[15 Jan 2018 13:54] David Moss
Posted by developer:
 
After review by the team it was decided to add a change log entry. Please ignore previous message.
----
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.21  / 8.0.4 change logs:
During distributed recovery as part of of joining the group, when the applier was signaling that it had applied all transactions, it was also blindly searching for partial transactions. This was to avoid future applier errors, which would happen if the applier stopped at this point. However, this search and remove only made sense for applier stop cases. Upon execution completeness it should not be done, otherwise it can corrupt or purge the applier relay log, which can led to data loss. To solve this issue, when the applier is waiting for execution completeness, it no longer searches for and removes partial transactions.
[1 Feb 2018 14:03] David Moss
Now published.