Bug #92916 network partition, cause only one online situation
Submitted: 24 Oct 2018 2:06 Modified: 31 Oct 2018 15:54
Reporter: Gosin Gu (OCA) Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S1 (Critical)
Version:5.7.22 OS:Red Hat (7)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: network partition, paxos

[24 Oct 2018 2:06] Gosin Gu
Description:
I have a mgr cluster of three nodes, Due to instability in a central switch on production. Causing a large amount of unreachable & later second print reachable again in mysql log. 

Finally, the cluster has two node error states, one node is online.

(due to production log dont allow me to upload, sorry !)

How to repeat:
write one script ,to call three scripts, and those scripts can setting iptables to interrupt network, and then sleep time to recover iptables allow it to connect again by random.  code assert function to vert err/err/online.
[26 Oct 2018 3:17] Gosin Gu
I add some log in detertor_task. Found table when node become error status, then becasue xcom_shutdown is not. it attach error status support one node to expell another node. so exist one.
[26 Oct 2018 6:34] Gosin Gu
view state error
[26 Oct 2018 18:13] MySQL Verification Team
Hi,

I apologize but I'm not understand your bug description.

Please upload config files.

Please explain what exactly you consider a bug. From what I understand you have three nodes and when you kill the network with iptables the group replications stops working (as expected) and then after network is up it start working after some time (again as expected); I'm not sure what here you consider a bug?

Thanks
Bogdan
[29 Oct 2018 6:15] Gosin Gu
I am very sorry, I have not explained the problem clearly. 

The problem I encountered is this: In my production environment using mysql group replication, the final state of the three nodes is error/error/online due to network instability. 

According to my understanding of the paxos algorithm, it is impossible for any single node to survive and provide services externally. I think this should be a bug. 

Then I tested the scene of mgr's repeated simulation of network failure. I found that xcom does have problems. Since the detector_task does not stop working because of the error of the node, it will continue to process the view msg. So the node is again resumed.
[29 Oct 2018 6:17] Gosin Gu
Since I am working in a bank, according to the regulations of the bank, my relevant configuration files and logs cannot be sent out. But it doesn't matter, it provides a process for my application for outsourcing. When I prepare this information, I will send it to you in the first time, thank you!
[29 Oct 2018 7:30] MySQL Verification Team
Hi,
Thanks, please prepare and upload that info.
kind regards
Bogdan
[31 Oct 2018 15:54] MySQL Verification Team
Hi,
I consulted with our Group Replication development team and the issue you are having is due to flaky network (as you already know you said your switch was flaky). 

There is a workaround in 8.0.13 that help resolve this.
The period of time, in seconds, that a member waits before expelling from the 
group any member suspect of having failed.

group_replication_member_expel_timeout:
https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/group-replication-options.html#sysvar_gr...

Also check this link for more details
https://mysqlhighavailability.com/group-replication-coping-with-unreliable-failure-detecti...

Duplicate of Bug #84784