Bug #92916 | network partition, cause only one online situation | ||
---|---|---|---|
Submitted: | 24 Oct 2018 2:06 | Modified: | 31 Oct 2018 15:54 |
Reporter: | Gosin Gu (OCA) | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S1 (Critical) |
Version: | 5.7.22 | OS: | Red Hat (7) |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
Tags: | network partition, paxos |
[24 Oct 2018 2:06]
Gosin Gu
[26 Oct 2018 3:17]
Gosin Gu
I add some log in detertor_task. Found table when node become error status, then becasue xcom_shutdown is not. it attach error status support one node to expell another node. so exist one.
[26 Oct 2018 6:34]
Gosin Gu
view state error
[26 Oct 2018 18:13]
MySQL Verification Team
Hi, I apologize but I'm not understand your bug description. Please upload config files. Please explain what exactly you consider a bug. From what I understand you have three nodes and when you kill the network with iptables the group replications stops working (as expected) and then after network is up it start working after some time (again as expected); I'm not sure what here you consider a bug? Thanks Bogdan
[29 Oct 2018 6:15]
Gosin Gu
I am very sorry, I have not explained the problem clearly. The problem I encountered is this: In my production environment using mysql group replication, the final state of the three nodes is error/error/online due to network instability. According to my understanding of the paxos algorithm, it is impossible for any single node to survive and provide services externally. I think this should be a bug. Then I tested the scene of mgr's repeated simulation of network failure. I found that xcom does have problems. Since the detector_task does not stop working because of the error of the node, it will continue to process the view msg. So the node is again resumed.
[29 Oct 2018 6:17]
Gosin Gu
Since I am working in a bank, according to the regulations of the bank, my relevant configuration files and logs cannot be sent out. But it doesn't matter, it provides a process for my application for outsourcing. When I prepare this information, I will send it to you in the first time, thank you!
[29 Oct 2018 7:30]
MySQL Verification Team
Hi, Thanks, please prepare and upload that info. kind regards Bogdan
[31 Oct 2018 15:54]
MySQL Verification Team
Hi, I consulted with our Group Replication development team and the issue you are having is due to flaky network (as you already know you said your switch was flaky). There is a workaround in 8.0.13 that help resolve this. The period of time, in seconds, that a member waits before expelling from the group any member suspect of having failed. group_replication_member_expel_timeout: https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/group-replication-options.html#sysvar_gr... Also check this link for more details https://mysqlhighavailability.com/group-replication-coping-with-unreliable-failure-detecti... Duplicate of Bug #84784