Description:
Scenario:
2 DC, DC1 production DC2 Disaster Recovery.
DC2 replicate from DC1 using Asynchronous replication and Asynchronous Connection Failover.
ProxySQL is used for routing the request to the active node only for DC1.
sysbench as app to provide some traffic.
The following is the specific group replication configuration:
######################################
#Group Replication
######################################
plugin_load_add ='group_replication.so'
plugin-load-add ='mysql_clone.so'
group_replication_start_on_boot =off
group_replication_group_name ="dc1aaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
group_replication_local_address = "10.0.0.14:33061"
group_replication_group_seeds = "10.0.0.14:33061,10.0.0.36:33061,10.0.0.81:33061"
group_replication_bootstrap_group = off
group_replication_ip_allowlist = "10.0.0.0/24,192.168.0.0/24"
# from 8.0.27
group_replication_paxos_single_leader = on
group_replication_auto_increment_increment = 1
group_replication_communication_max_message_size = 10485760
group_replication_autorejoin_tries = 10
group_replication_consistency = AFTER
group_replication_flow_control_period = 10
group_replication_flow_control_hold_percent = 25
group_replication_flow_control_release_percent = 50
group_replication_member_expel_timeout = 20
What happens?
When running minimal load (2 threads executing R/W operations), we put down the running Primary with a normal shut-down, wait for production to move to new node.
After wait few minutes we restart the node we stopped, once the node is up and running we start group_replication on the node again.
The node will fail to join given an error, and sometime the whole cluster fails.
How to repeat:
To replicate:
- create a cluster using the above settings
- run some load
- Stop (gently) the Primary
- let the load run for a bit on the new primary
- Restart the stopped node
- be sure you have group_replication_consistency=AFTER
- start group_replication
on fail
- stop group_replication
- change group_replication_consistency=EVENTUAL
- restart group_replication
Suggested fix:
What are the expectations?
1) When th enode is stopped no issue in closing the pending threads and certifications. Why this: [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2) When start node should not fail to rejoin the group if group_replication_consistency=AFTER
3) if group_replication_consistency=AFTER is not supported in joining the cluster, the Node should automatically shift to a supported level and once joined move back to the one declared in the configuration.