Bug #111589 MySQL Operator - split brain on all data nodes
Submitted: 27 Jun 2023 18:37 Modified: 21 Jul 2023 16:31
Reporter: Carlos Abrantes Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Operator Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: CPU Architecture:Any

[27 Jun 2023 18:37] Carlos Abrantes
Description:

After deploying MySQL Operator and MySQL InnoDBCluster i run some tests, one of them is creating a split brain between the nodes.

In this scenario mysql-0, mysql-1, mysql-2 were not able to see each other, but all were able to speak with the operator.

The Operator force the last primary node to be RW:

#When bringing 1node down:
[2023-06-22 14:23:14,615] kopf.objects         [INFO    ] No quorum visible from mysql-2.mysql-instances.mysql.svc.cluster.local:3306: status=NO_QUORUM  topology=mysql-0.mysql-instances.my

#When bringing 2node down:
2023-06-22 14:26:00,668] kopf.objects         [INFO    ] No quorum visible from mysql-1.mysql-instances.mysql.svc.cluster.local:3306: status=NO_QUORUM  topology=mysql-0.mysql-instances.mys
[2023-06-22 14:26:17,325] kopf.objects         [INFO    ] No quorum visible from mysql-0.mysql-instances.mysql.svc.cluster.local:3306: status=NO_QUORUM  topology=mysql-0.mysql-instances.my

023-06-22 14:28:23: Info: Cluster.status: tid=210916: CONNECTED: mysql-0.mysql-instances.mysql.svc.cluster.local:3306
[2023-06-22 14:28:23,256] kopf.objects         [INFO    ] Force quorum successful. status={"clusterName": "mysql", "defaultReplicaSet": {"name": "default", "primary": "mysql-0.mysql-instances.mysql.svc.cluster.local:3306", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE_PARTIAL", "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", "topology": {"mysql-0.mysql-instances.mysql.svc.cluster.local:3306": {"address": "mysql-0.mysql-instances.mysql.svc.cluster.local:3306", "memberRole": "PRIMARY", "mode": "R/W", "readReplicas": {}, "replicationLag": "applier_queue_applied", "role": "HA", "status": "ONLINE", "version": "8.0.33"}, "mysql-1.mysql-instances.mysql.svc.cluster.local:3306": {"address": "mysql-1.mysql-instances.mysql.svc.cluster.local:3306", "instanceErrors": ["NOTE: group_replication is stopped."], "memberRole": "SECONDARY", "memberState": "OFFLINE", "mode": "n/a", "readReplicas": {}, "role": "HA", "status": "(MISSING)", "version": "8.0.33"}, "mysql-2.mysql-instances.mysql.svc.cluster.local:3306": {"address": "mysql-2.mysql-instances.mysql.svc.cluster.local:3306", "instanceErrors": ["NOTE: group_replication is stopped."], "memberRole": "SECONDARY", "memberState": "OFFLINE", "mode": "n/a", "readReplicas": {}, "role": "HA", "status": "(MISSING)", "version": "8.0.33"}}, "topologyMode": "Single-Primary"}, "groupInformationSourceMember": "mysql-0.mysql-instances.mysql.svc.cluster.local:3306"}

I would expect that when the network split brain is gone the operator would rejoin the cluster automatically, but nothing happened, no logs on the operator side.

I deleted the mysql-1 pod which trigger it to be rejoined to the cluster.
The same didn't happen with mysql-2 where was detected an errant transaction:
RuntimeError: Cluster.rejoin_instance: The instance 'mysql-2.mysql-instances.mysql.svc.cluster.local:3306' contains errant transactions that did not originate from the cluster.

I had to recover manually by removing the node and adding it again.

The 2 and 3 time o repeat the test the Operator didn't force the any node to be RW, without any log about the reason for it.

Version:
community-operator:8.0.33-2.0.10
community-server:8.0.33
community-router:8.0.33

apiVersion: v2
appVersion: 8.0.33
description: MySQL InnoDB Cluster Helm Chart for deploying MySQL InnoDB Cluster in Kubernetes
icon: https://labs.mysql.com/common/themes/sakila/favicon.ico
name: mysql-innodbcluster
type: application
version: 2.0.10

How to repeat:

1- Deploy the operator and the cluster
2- Apply network policies that prevent communication between data nodes, but allow with the operator

Suggested fix:

The operator should have a consistent and predictable behaviour (for the same situation have the same behaviour).

Assuming that forcing the RW on one node when there is split brain is the correct behaviour, then it should rejoin the nodes once they are back, or log a message why that was not possible (so it can be manually recovered)
[30 Jun 2023 14:29] MySQL Verification Team
Hi,

I'm not able to reproduce any inconsistencies here?

How did you exacly reconfigure firewall/networking to make this issue. Few more details about how to reproduce might be helpful to reproduce this. So far everything behaves always the same, maybe not ideal but is consistent.

thanks
[12 Jul 2023 10:37] Carlos Abrantes
Hi,

Sorry for such late reply, is it suppose to receive a notification when the ticket is updated? i didn't got. 

Can you describe which of the behaviours you are getting?
From the logs i sent its possible to understand that in first it forces the quorum in one node and in the others, well there is the absence of any logs so, i can' send it.

i m running on k8s with cilium and i m doing it with network policies.

Something like:
apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: "pod-a"
spec:
  endpointSelector:
    matchLabels:
      statefulset.kubernetes.io/pod-name: mysql-0
  ingress:
  - fromEndpoints:
    - matchExpressions:
      - key: statefulset.kubernetes.io/pod-name
        operator: NotIn
        values:
          - mysql-1
          - mysql-2
  egress:
  - toEndpoints:
    - matchExpressions:
      - key: statefulset.kubernetes.io/pod-name
        operator: NotIn
        values:
          - mysql-1
          - mysql-2

i have 3 of those rules (with changes to the target pod and src/dst pods) one applied to each pod, preventing communication to the other pod.
At the end the result is that MySQL data nodes can't communicate with each other, but can communicate with the operator.

So in the first time Operator was clever enough to force quorum to 1 node allowing service to be available (then it wasn't able to recover alone from it, which was the second problem) and other times just lost service.

Thanks,
Carlos
[21 Jul 2023 16:31] Carlos Abrantes
Hi,

Can you please confirm if you were able to reproduce this problem?

What was expected to happen in this case, as we saw logs where operator forced quorum and also didn't logged anything?

Thanks,
Carlos
[21 Jul 2023 16:31] Carlos Abrantes
Hi,

Can you please confirm if you were able to reproduce this problem?

What was expected to happen in this case, as we saw logs where operator forced quorum and also didn't logged anything?

Thanks,
Carlos