| Bug #104980 | After secondary node is killed, it can not rejoined | ||
|---|---|---|---|
| Submitted: | 18 Sep 2021 1:33 | Modified: | 14 Nov 2022 23:40 |
| Reporter: | Ye Jinrong | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Group Replication | Severity: | S2 (Serious) |
| Version: | 8.0.26 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[23 Sep 2021 13:48]
MySQL Verification Team
Hi, Thanks for the report and the script to reproduce.
[8 Oct 2021 20:52]
MySQL Verification Team
Hi, This took me almost 3 days to reproduce and now our dev team is having issues reproducing it too. Can you share more details maybe, a full config file for start will help. Can you tell me the way you are "killing the node", you kill -9 or you shutdown or ? thanks
[13 Dec 2021 13:08]
MySQL Verification Team
Bug #105748 is marked as duplicate of this one
[14 Nov 2022 23:40]
Jon Stephens
Documented fix as follows in the MySQL 8.0.32 changelog:
When a group was run with group_replication_consistency = AFTER
and a secondary failed due to external conditions such as an
unstable network, the secondary could sometimes encounter the
error -Transaction 'GTID' does not exist on Group Replication
consistency manager while receiving remote transaction prepare.-
The root cause of this issue was that the primary might log out
of order the View_change_log_event with which the secondary
rejoined; when the secondary used the primary as the group
donor, this could cause the secondary to catch up with the group
improperly and, eventually, generate incorrect GTIDs for the
group transactions. The group replication primary ensures that
the View_change_log_event is logged after all preceding
transactions, but there was a window during which transactions
ordered after the View_change_log_event on the group global
order could be logged before the View_change_log_event.
To solve this issue, we now make sure that transactions ordered
before a view are always logged before the
View_change_log_event, and that transactions ordered after a
view are always logged after the View_change_log_event. This is
now done by the binary log ticket manager, which guarantees the
order in which transactions in the binary log group commit are
committed.
Closed.
[22 Dec 2022 5:15]
ZhaoPing Lu
Hit the same issue. Is there a workaround for this bug for now?

Description: After secondary node is killed, it can not rejoined, with group_replication_consistency = BEFORE_AND_AFTER | AFTER How to repeat: 0. Setup a mgr 3 nodes cluster in single-primary mode. 1. Set group_replication_consistency = BEFORE_AND_AFTER | AFTER (choose one from two, and there is no problem in other modes). 2. Start sysbench to conduct continuous benchmark test on mgr cluster. 3. During the test, randomly kill a secondary node. 4. After multiple retries, the secondary node will probably fail to rejjoin to the cluster. The error message is similar to the following: ``` [ERROR] [MY-013309] [Repl] Plugin group_replication reported: 'Transaction '2:39976870' does not exist on Group Replication consistency manager while receiving remote transaction prepare.' [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.' [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'" ``` P.S, sysbench lua script is: ``` require("oltp_common") local runtype = 0; function prepare_statements() -- use 1 query per event, rather than sysbench.opt.point_selects which -- defaults to 10 in other OLTP scripts sysbench.opt.point_selects=1 runtype = (10 * sysbench.tid + 10) / sysbench.opt.threads if runtype <= 6 then prepare_point_selects() else prepare_non_index_updates() end end function event(thread_id) if runtype <= 6 then execute_point_selects() else execute_non_index_updates() end end ``` sysbench parameters: - --tables=10 - --table_size=100000 - --threads=16 - --report-interval=1 and my.cnf ``` innodb_buffer_pool_size = 256M slave_parallel_type = LOGICAL_CLOCK slave_parallel_workers = 64 binlog_transaction_dependency_tracking = WRITESET slave_preserve_commit_order = 1 slave_checkpoint_period = 2 group_replication_flow_control_mode = "DISABLED" ```