Bug #115014 MySQL Pod is stucked during termination
Submitted: 15 May 2024 12:57 Modified: 17 May 2024 20:44
Reporter: Eric Trinh Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Operator Severity:S2 (Serious)
Version:2.1.2 OS:Linux
Assigned to: CPU Architecture:x86

[15 May 2024 12:57] Eric Trinh
Description:

Hi,
We sometimes notice that the MYSQL Pod is stucked during termination.
We use 3 instances for HA.
The hanging is due to the Pod's finalizers which get stucked.

Here is a snapshot of the log while the pod is stucked:

  Normal  Logging  60m  kopf  diag instance mysql-2 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
  Normal  Logging  60m  kopf  mysql-2.mysql-instances.dev.svc.cluster.local:3306: pod.phase=Succeeded  deleting=True
  Normal  Logging  60m  kopf  Could not connect to mysql-0.mysql-instances.dev.svc.cluster.local:3306: error=MySQL Error (2005): mysqlsh.connect_dba: Unknown MySQL server host 'mysql-0.mysql-instances.dev.svc.cluster.local' (-2)
  Normal  Logging  60m  kopf  cluster probe: status=ClusterDiagStatus.OFFLINE online=[]
  Normal  Logging  60m  kopf  mysql: all={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>}  members={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>}  online=set()  offline={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>}  unsure=set()
  Normal  Logging  60m  kopf  Could not connect to mysql-2.mysql-instances.dev.svc.cluster.local:3306: error=MySQL Error (2005): mysqlsh.connect_dba: Unknown MySQL server host 'mysql-2.mysql-instances.dev.svc.cluster.local' (-2)
  Error   Logging  60m  kopf  Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods
  Normal  Logging  60m  kopf  diag instance mysql-0 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
  Normal  Logging  60m  kopf  mysql-0.mysql-instances.dev.svc.cluster.local:3306: pod.phase=Succeeded  deleting=True
  Normal  Logging  60m  kopf  ATTEMPTING CLUSTER REPAIR
  Error   Logging  60m  kopf  Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods

Is it related with the Change-Id I5ee1a5af6932b3565d8a1b9d80baba644dbd24c3 ? It looks like the operator is looping trying to repair the cluster instead of deleting the pod (both InstanceDiagStatus and ClusterDiagStatus are set as OFFLINE).

Also, the log entry 'RETRYING ON POD DELETE' never appears in the log file.

How to repeat:
This behavior appears randomly, but always after a Nodepool maintenance where all pods are moving from a node pool to a new one.
Usually, 1 pod is successfully moved, but remaining 2 pods (for HA) remain stucked in the old Nodepool

Suggested fix:
I suspect the logic in mysqloperator/controller/innodbcluster/cluster_controller.py (on_pod_deleted) to not handle this situation (ClusterDiagStatus.OFFLINE).

The current workaround is to remove the Pod's finalizers to unblock the termination.
[15 May 2024 13:09] Eric Trinh
It is likely that during maintenance window, the cluster cannot be repaired thus the looping (Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods) and the exception thrown.
[16 May 2024 7:29] Eric Trinh
Maybe duplicate of Bug #114893 ?
[17 May 2024 20:44] MySQL Verification Team
Hi,

Yes, I think this is duplicate of Bug #114893