MySQL Bugs: #104980: After secondary node is killed, it can not rejoined

Bug #104980	After secondary node is killed, it can not rejoined
Submitted:	18 Sep 2021 1:33	Modified:	14 Nov 2022 23:40
Reporter:	Ye Jinrong	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.26	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
After secondary node is killed, it can not rejoined, with group_replication_consistency = BEFORE_AND_AFTER | AFTER

How to repeat:
0. Setup a mgr 3 nodes cluster in single-primary mode.

1. Set group_replication_consistency = BEFORE_AND_AFTER | AFTER (choose one from two, and there is no problem in other modes).

2. Start sysbench to conduct continuous benchmark test on mgr cluster.

3. During the test, randomly kill a secondary node.

4. After multiple retries, the secondary node will probably fail to rejjoin to the cluster. The error message is similar to the following:
```
[ERROR] [MY-013309] [Repl] Plugin group_replication reported: 'Transaction '2:39976870' does not exist on Group Replication consistency manager while receiving remote transaction prepare.'
[ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
[ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'"
```

P.S, sysbench lua script is:
```
require("oltp_common")

local runtype = 0;

function prepare_statements()
   -- use 1 query per event, rather than sysbench.opt.point_selects which
   -- defaults to 10 in other OLTP scripts
   sysbench.opt.point_selects=1

   runtype = (10 * sysbench.tid + 10) / sysbench.opt.threads

   if runtype <= 6 then
     prepare_point_selects()
   else
     prepare_non_index_updates()
   end
end

function event(thread_id)
   if runtype <= 6 then
     execute_point_selects()
   else
     execute_non_index_updates()
   end
end
```

sysbench parameters:
- --tables=10
- --table_size=100000
- --threads=16
- --report-interval=1

and my.cnf
```
innodb_buffer_pool_size = 256M

slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 64
binlog_transaction_dependency_tracking = WRITESET
slave_preserve_commit_order = 1
slave_checkpoint_period = 2

group_replication_flow_control_mode = "DISABLED"

```

Hi, 

Thanks for the report and the script to reproduce.

Hi,

This took me almost 3 days to reproduce and now our dev team is having issues reproducing it too. Can you share more details maybe, a full config file for start will help. Can you tell me the way you are "killing the node", you kill -9 or you shutdown or ?

thanks

Bug #105748 is marked as duplicate of this one

Documented fix as follows in the MySQL 8.0.32 changelog:

        When a group was run with group_replication_consistency = AFTER
        and a secondary failed due to external conditions such as an
        unstable network, the secondary could sometimes encounter the
        error -Transaction 'GTID' does not exist on Group Replication
        consistency manager while receiving remote transaction prepare.-

        The root cause of this issue was that the primary might log out
        of order the View_change_log_event with which the secondary
        rejoined; when the secondary used the primary as the group
        donor, this could cause the secondary to catch up with the group
        improperly and, eventually, generate incorrect GTIDs for the
        group transactions. The group replication primary ensures that
        the View_change_log_event is logged after all preceding
        transactions, but there was a window during which transactions
        ordered after the View_change_log_event on the group global
        order could be logged before the View_change_log_event.

        To solve this issue, we now make sure that transactions ordered
        before a view are always logged before the
        View_change_log_event, and that transactions ordered after a
        view are always logged after the View_change_log_event. This is
        now done by the binary log ticket manager, which guarantees the
        order in which transactions in the binary log group commit are
        committed.

Closed.

Hit the same issue. Is there a workaround for this bug for now?