Bug #105748 Why is group_replication_consistency = AFTER crashing a Group Replication cluste
Submitted: 30 Nov 2021 11:47 Modified: 13 Dec 2021 13:07
Reporter: Marco Tusa Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version: OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[30 Nov 2021 11:47] Marco Tusa
Description:
Scenario:
    2 DC, DC1 production DC2 Disaster Recovery.
    DC2 replicate from DC1 using Asynchronous replication and Asynchronous Connection Failover.
    ProxySQL is used for routing the request to the active node only for DC1.
    sysbench as app to provide some traffic.
    The following is the specific group replication configuration:
        ######################################
        #Group Replication
        ######################################
        plugin_load_add                                     ='group_replication.so'
        plugin-load-add                                     ='mysql_clone.so'
        group_replication_start_on_boot                     =off
        group_replication_group_name                        ="dc1aaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
        group_replication_local_address                     = "10.0.0.14:33061"
        group_replication_group_seeds                       = "10.0.0.14:33061,10.0.0.36:33061,10.0.0.81:33061"
        group_replication_bootstrap_group                   = off
        group_replication_ip_allowlist                      = "10.0.0.0/24,192.168.0.0/24"
        # from 8.0.27
        group_replication_paxos_single_leader               = on
        group_replication_auto_increment_increment          = 1
        group_replication_communication_max_message_size    = 10485760
        group_replication_autorejoin_tries                  = 10
        group_replication_consistency                       = AFTER
        group_replication_flow_control_period               = 10
        group_replication_flow_control_hold_percent         = 25
        group_replication_flow_control_release_percent      = 50
        group_replication_member_expel_timeout              = 20
    
What happens?
When running minimal load (2 threads executing R/W operations), we put down the running Primary with a normal shut-down, wait for production to move to new node. 
After wait few minutes we restart the node we stopped, once the node is up and running we start group_replication on the node again. 
The node will fail to join given an error, and sometime the whole cluster fails.

How to repeat:
To replicate:
- create a cluster using the above settings
- run some load
- Stop (gently) the Primary
- let the load run for a bit on the new primary
- Restart the stopped node
- be sure you have group_replication_consistency=AFTER
- start group_replication
on fail
- stop group_replication
- change group_replication_consistency=EVENTUAL
- restart group_replication

Suggested fix:
What are the expectations?
1) When th enode is stopped no issue in closing the pending threads and certifications. Why this: [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2) When start node should not fail to rejoin the group if  group_replication_consistency=AFTER
3) if group_replication_consistency=AFTER is not supported in joining the cluster, the Node should automatically shift to a supported level and once joined move back to the one declared in the configuration.
[30 Nov 2021 11:48] Marco Tusa
Files with details that are too long for description

Attachment: node_bug_when_rejoin.txt (text/plain), 17.55 KiB.

[7 Dec 2021 8:05] Frederic Descamps
This could be a duplicate of #104980
[7 Dec 2021 8:11] Marco Tusa
It seems like. I was able to reproduce without problem if you follow the steps as indicated. 
While it seems you had issue to reproduce it in 104980
[13 Dec 2021 13:07] MySQL Verification Team
Hi,

Took a while to reproduce this one and looks like a duplicate of Bug #104980