Bug #112779 Not auto rejoin after failing to parse relay log
Submitted: 20 Oct 2023 3:57 Modified: 24 Oct 2023 6:39
Reporter: zetang zeng (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.33 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[20 Oct 2023 3:57] zetang zeng
Description:
After some network errors, one nodes of MySQL cluster just left the group and not auto rejoin even though group_replication_autorejoin_tries is 3.

log:

2023-09-10T13:05:33.220875Z 14 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel 'group_replication_applier': corrupted data in log event
2023-09-10T13:05:33.221651Z 14 [ERROR] [MY-013121] [Repl] Replica SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121

How to repeat:
set random packet drop on network

Suggested fix:
Nodes should auto rejoin as config.
[21 Oct 2023 1:37] MySQL Verification Team
Hi,

I cannot reproduce this. If I set packet drop server will try to rejoin 3 times every 5 minutes and will stop trying then as expected.
[24 Oct 2023 6:39] zetang zeng
From source code, when some network problem causes failure to parse binlog, it will leave group without AUTO_REJOIN. My problem is why not?

No auto-rejoin from applier.cc :

    leave_group_on_failure::mask leave_actions;
    /*
      Only follow exit_state_action if we were already inside a group. We may
      happen to come across an applier error during the startup of GR (i.e.
      during the execution of the START GROUP_REPLICATION command). We must not
      follow exit_state_action on that situation.
    */
    leave_actions.set(leave_group_on_failure::HANDLE_EXIT_STATE_ACTION,
                      gcs_module->belongs_to_group());
    leave_group_on_failure::leave(leave_actions,
                                  ER_GRP_RPL_APPLIER_EXECUTION_FATAL_ERROR,
                                  nullptr, exit_state_action_abort_log_message);

another case with auto-rejoin

    const char *exit_state_action_abort_log_message =
        "Member was expelled from the group due to network failures.";
    leave_group_on_failure::mask leave_actions;
    leave_actions.set(leave_group_on_failure::ALREADY_LEFT_GROUP, true);
    leave_actions.set(leave_group_on_failure::CLEAN_GROUP_MEMBERSHIP, true);
    leave_actions.set(leave_group_on_failure::STOP_APPLIER, true);
    leave_actions.set(leave_group_on_failure::HANDLE_EXIT_STATE_ACTION, true);
    leave_actions.set(leave_group_on_failure::HANDLE_AUTO_REJOIN, true);
    leave_group_on_failure::leave(leave_actions, ER_GRP_RPL_MEMBER_EXPELLED,
                                  &m_notification_ctx,
                                  exit_state_action_abort_log_message);