MySQL Bugs: #115014: MySQL Pod is stucked during termination

Bug #115014	MySQL Pod is stucked during termination
Submitted:	15 May 2024 12:57	Modified:	17 May 2024 20:44
Reporter:	Eric Trinh	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Operator	Severity:	S2 (Serious)
Version:	2.1.2	OS:	Linux
Assigned to:		CPU Architecture:	x86

Description:

Hi,
We sometimes notice that the MYSQL Pod is stucked during termination.
We use 3 instances for HA.
The hanging is due to the Pod's finalizers which get stucked.

Here is a snapshot of the log while the pod is stucked:

  Normal  Logging  60m  kopf  diag instance mysql-2 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
  Normal  Logging  60m  kopf  mysql-2.mysql-instances.dev.svc.cluster.local:3306: pod.phase=Succeeded  deleting=True
  Normal  Logging  60m  kopf  Could not connect to mysql-0.mysql-instances.dev.svc.cluster.local:3306: error=MySQL Error (2005): mysqlsh.connect_dba: Unknown MySQL server host 'mysql-0.mysql-instances.dev.svc.cluster.local' (-2)
  Normal  Logging  60m  kopf  cluster probe: status=ClusterDiagStatus.OFFLINE online=[]
  Normal  Logging  60m  kopf  mysql: all={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>}  members={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>}  online=set()  offline={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>}  unsure=set()
  Normal  Logging  60m  kopf  Could not connect to mysql-2.mysql-instances.dev.svc.cluster.local:3306: error=MySQL Error (2005): mysqlsh.connect_dba: Unknown MySQL server host 'mysql-2.mysql-instances.dev.svc.cluster.local' (-2)
  Error   Logging  60m  kopf  Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods
  Normal  Logging  60m  kopf  diag instance mysql-0 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
  Normal  Logging  60m  kopf  mysql-0.mysql-instances.dev.svc.cluster.local:3306: pod.phase=Succeeded  deleting=True
  Normal  Logging  60m  kopf  ATTEMPTING CLUSTER REPAIR
  Error   Logging  60m  kopf  Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods

Is it related with the Change-Id I5ee1a5af6932b3565d8a1b9d80baba644dbd24c3 ? It looks like the operator is looping trying to repair the cluster instead of deleting the pod (both InstanceDiagStatus and ClusterDiagStatus are set as OFFLINE).

Also, the log entry 'RETRYING ON POD DELETE' never appears in the log file.

How to repeat:
This behavior appears randomly, but always after a Nodepool maintenance where all pods are moving from a node pool to a new one.
Usually, 1 pod is successfully moved, but remaining 2 pods (for HA) remain stucked in the old Nodepool

Suggested fix:
I suspect the logic in mysqloperator/controller/innodbcluster/cluster_controller.py (on_pod_deleted) to not handle this situation (ClusterDiagStatus.OFFLINE).

The current workaround is to remove the Pod's finalizers to unblock the termination.

It is likely that during maintenance window, the cluster cannot be repaired thus the looping (Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods) and the exception thrown.

Maybe duplicate of Bug #114893 ?

Hi,

Yes, I think this is duplicate of Bug #114893