MySQL Bugs: #105748: Why is group_replication_consistency = AFTER crashing a Group Replication cluste

Bug #105748	Why is group_replication_consistency = AFTER crashing a Group Replication cluste
Submitted:	30 Nov 2021 11:47	Modified:	13 Dec 2021 13:07
Reporter:	Marco Tusa	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:		OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Scenario:
    2 DC, DC1 production DC2 Disaster Recovery.
    DC2 replicate from DC1 using Asynchronous replication and Asynchronous Connection Failover.
    ProxySQL is used for routing the request to the active node only for DC1.
    sysbench as app to provide some traffic.
    The following is the specific group replication configuration:
        ######################################
        #Group Replication
        ######################################
        plugin_load_add                                     ='group_replication.so'
        plugin-load-add                                     ='mysql_clone.so'
        group_replication_start_on_boot                     =off
        group_replication_group_name                        ="dc1aaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
        group_replication_local_address                     = "10.0.0.14:33061"
        group_replication_group_seeds                       = "10.0.0.14:33061,10.0.0.36:33061,10.0.0.81:33061"
        group_replication_bootstrap_group                   = off
        group_replication_ip_allowlist                      = "10.0.0.0/24,192.168.0.0/24"
        # from 8.0.27
        group_replication_paxos_single_leader               = on
        group_replication_auto_increment_increment          = 1
        group_replication_communication_max_message_size    = 10485760
        group_replication_autorejoin_tries                  = 10
        group_replication_consistency                       = AFTER
        group_replication_flow_control_period               = 10
        group_replication_flow_control_hold_percent         = 25
        group_replication_flow_control_release_percent      = 50
        group_replication_member_expel_timeout              = 20
    
What happens?
When running minimal load (2 threads executing R/W operations), we put down the running Primary with a normal shut-down, wait for production to move to new node. 
After wait few minutes we restart the node we stopped, once the node is up and running we start group_replication on the node again. 
The node will fail to join given an error, and sometime the whole cluster fails.

How to repeat:
To replicate:
- create a cluster using the above settings
- run some load
- Stop (gently) the Primary
- let the load run for a bit on the new primary
- Restart the stopped node
- be sure you have group_replication_consistency=AFTER
- start group_replication
on fail
- stop group_replication
- change group_replication_consistency=EVENTUAL
- restart group_replication

Suggested fix:
What are the expectations?
1) When th enode is stopped no issue in closing the pending threads and certifications. Why this: [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2) When start node should not fail to rejoin the group if  group_replication_consistency=AFTER
3) if group_replication_consistency=AFTER is not supported in joining the cluster, the Node should automatically shift to a supported level and once joined move back to the one declared in the configuration.

Files with details that are too long for description

Attachment: node_bug_when_rejoin.txt (text/plain), 17.55 KiB.

This could be a duplicate of #104980

It seems like. I was able to reproduce without problem if you follow the steps as indicated. 
While it seems you had issue to reproduce it in 104980

Hi,

Took a while to reproduce this one and looks like a duplicate of Bug #104980