Bug #113337 Cluster View Inconsistent - New Primary Unable to be Promoted
Submitted: 5 Dec 2023 10:29 Modified: 14 Jan 2024 11:37
Reporter: Sameer Gavaskar Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Operator Severity:S4 (Feature request)
Version:8.2.0-2.1.1 OS:Linux (Amazon Linux 2)
Assigned to: CPU Architecture:x86 (x86_64)

[5 Dec 2023 10:29] Sameer Gavaskar
Description:
This was in an AWS EKS Cluster (k8s version 1.24.17). We configured the innodbcluster to have 3 replicas, one primary, two secondaries. 

We first changed the instance type of a node group from a node with a large amount of memory and cpu (think AWS r6.xlarge) to a smaller node type (t3.medium) and then began scaling up the auto scaling group. When scaling the auto scaling group, some nodes were shutdown 'non-gracefully'. This essentially means that the kubelet is not notified in time upon a node being terminated. Further presenting a problem with stateful sets (which the innodbcluster is ran as) where a given stateful set pod can be stuck 'Terminating' even if the node has been shutdown (see here for more information about non-graceful shutdowns: https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/#what-...). 

Two out of the three replicas were offline, and at this point the nodes that the offline replicas were on were irrecoverable. Note that there are three replicas (replica-0, replica-1, replica-2), replica-2 was designated as the primary as in indicated by the following log, however it ran on a node that was non-gracefully shutdown and was out of service at this point:

[33mWARNING: Error connecting to Cluster: MYSQLSH 51004: Unable to connect to the primary member of the Cluster: 'Can't connect to MySQL server on 'replica-2.replica-instances.domain.svc.cluster.local:3306' (110)'

The mysql operator showed the following (the full log for the operator can be provided on request, however the following sequence of logs just kept repeating):

[2023-11-28 17:49:40,943] kopf.objects         [INFO    ] Group view of replica-1.replica-instances.domain-spire.svc.cluster.local:3306 has dict_keys(['replica-0.replica-instances.domain-spire.svc.cluster.local:3306', 'replica-1.replica-instances.domain-spire.svc.cluster.local:3306', 'replica-2.replica-instances.domain-spire.svc.cluster.local:3306']) but these are not ONLINE: {'replica-2.replica-instances.domain.svc.cluster.local:3306'}
[2023-11-28 17:49:40,944] kopf.objects         [ERROR   ] Handler 'on_pod_delete' failed temporarily: Cluster status results inconsistent

...

[2023-11-28 17:49:53,504] kopf.objects         [INFO    ] diag instance replica-2 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
[2023-11-28 17:50:03,834] kopf.objects         [INFO    ] diag instance replica-0 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
[2023-11-28 17:50:24,455] kopf.objects         [INFO    ] diag instance replica-1 --> InstanceDiagStatus.ONLINE quorum=True gtid_executed=6cce6f0a-7a59-11ee-bc0d-06bfb4bc7677:1-15,
6e0ab277-7a59-11ee-bd25-8ef6fc2370d1:1-13,

It looks like the operator was attempting to terminate the primary but could not as the surviving replica (replica-1) had an inconsistent view of the state of the cluster (note the line in the logs: 'Cluster status results inconsistent') from the operator itself. This effectively meant that the operator would refuse to officially promote the online replica to primary causing an outage.

How to repeat:
This might be somewhat difficult to reproduce, but try the following:

1. Create a k8s cluster, preferably in EKS, we're using k8s version 1.24.17 but a different cloud provider might be okay.
2. Create an ASG (Autoscaling Group) or equivalent or 3 worker nodes using the specs provided in the ticket tags (the approximate AMI used can possibly be provided, but might not be important)
3. Run the mysql operator with a statefulset pod, persistent volume, and persistent volume claim corresponding to each of the three worker nodes.
4. 'Non-gracefully terminate' the nodes that the primary and maybe one of the secondaries are running on. This is the tricky part to reproduce, but if you can forcibly terminate the VM's that the worker nodes correspond to
5. Even trickier is simulating inconsistent cluster views, or having one replica think the other is online when it's not, try to observe if the scenario described above, namely an inconsistent cluster view results.

Suggested fix:
A temporary workaround for us was to simply manually restart each of the replica pods and that resolved the problem. That being said the operator reacting on its own would be better.

That being said, I don't think this is entirely a bug. Looking at the mysql operator's code and when searching for 'Cluster status results inconsistent' (https://sourcegraph.com/github.com/mysql/mysql-operator/-/blob/mysqloperator/controller/di...), it seems like the cluster views need to be consistent before doing something like promoting a new primary so that there don't end up being two primaries or a split brain on accident. I'm supposing the scenario where one of the replica's views was more correct than the operator and the primary was not actually offline. However this favors consistency in the face of network partitions. For our needs, I think availability would be more important in the face of network partitions. 

It's not entirely clear why the cluster views were inconsistent to begin with but ultimately in scenarios where a primary cannot be promoted because of something like this, would it be possible to have some sort of config that skips this the check in the link noted above?
[12 Dec 2023 14:21] MySQL Verification Team
Hi,

I could not reproduce this but I understand what happened to you and I do not believe this to be a bug. Any decent database system needs to favor consistency to anything else.

Your request to have a config that can change this behavior could be taken in as feature request if that is ok with you but I cannot promise that our dev team will agree. If you want me to take this bug report as a feature request let us know.

Thanks for using MySQL
[18 Dec 2023 21:37] Simon Mudd
"Any decent database system needs to favor consistency to anything else."  Slightly off-topic but I do not always agree with that.

If you take down the database you can provide no service. That means an outage. If you choose to not take down an inconsistent database setup then you have chosen to accept the inconsistencies and resolve them (somehow) while the system is still up. That MAY trigger more problems but it does mean that to a percentage of your users/customers the service appears to be working and possibly the impact of the inconsistencies may not affect them.

So in a technically pure database environment I would agree. However, in a business environment it may be better to have a partially working system than no system working at all. That decision can not be made by software but must be made by humans, those that operate the system in question.  MySQL has traditionally been sufficiently flexible that it might trigger inconsistencies, the reason for GR helping to avoid "lost updates" or "split brains".  Clearly we never want this to happen but if something does break having more flexibility in handling such issues is important.

So I think the comment from the original poster was to have more flexibility in the configured setup.  Real life is just more complex that we'd like.
[19 Dec 2023 10:12] MySQL Verification Team
Hi Simon,

Of course you are right and you have a system that demonstrate what you talk about perfectly. That is why I wrote "this behavior could be taken in as feature request if that is ok with you" as it would make sense as a FR. I still do not believe it is a bug, but I would accept it as FR.
[10 Jan 2024 16:25] Sameer Gavaskar
Apologies for the late follow up here. Yes I think we would like to have this as a feature if possible.
[14 Jan 2024 11:37] MySQL Verification Team
Thank you