Description:
Hi,
We sometimes notice that the MYSQL Pod is stucked during termination.
We use 3 instances for HA.
The hanging is due to the Pod's finalizers which get stucked.
Here is a snapshot of the log while the pod is stucked:
Normal Logging 60m kopf diag instance mysql-2 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
Normal Logging 60m kopf mysql-2.mysql-instances.dev.svc.cluster.local:3306: pod.phase=Succeeded deleting=True
Normal Logging 60m kopf Could not connect to mysql-0.mysql-instances.dev.svc.cluster.local:3306: error=MySQL Error (2005): mysqlsh.connect_dba: Unknown MySQL server host 'mysql-0.mysql-instances.dev.svc.cluster.local' (-2)
Normal Logging 60m kopf cluster probe: status=ClusterDiagStatus.OFFLINE online=[]
Normal Logging 60m kopf mysql: all={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>} members={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>} online=set() offline={<MySQLPod mysql-2>, <MySQLPod mysql-0>, <MySQLPod mysql-1>} unsure=set()
Normal Logging 60m kopf Could not connect to mysql-2.mysql-instances.dev.svc.cluster.local:3306: error=MySQL Error (2005): mysqlsh.connect_dba: Unknown MySQL server host 'mysql-2.mysql-instances.dev.svc.cluster.local' (-2)
Error Logging 60m kopf Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods
Normal Logging 60m kopf diag instance mysql-0 --> InstanceDiagStatus.OFFLINE quorum=None gtid_executed=None
Normal Logging 60m kopf mysql-0.mysql-instances.dev.svc.cluster.local:3306: pod.phase=Succeeded deleting=True
Normal Logging 60m kopf ATTEMPTING CLUSTER REPAIR
Error Logging 60m kopf Handler 'on_pod_delete' failed temporarily: Cluster cannot be restored because there are unreachable pods
Is it related with the Change-Id I5ee1a5af6932b3565d8a1b9d80baba644dbd24c3 ? It looks like the operator is looping trying to repair the cluster instead of deleting the pod (both InstanceDiagStatus and ClusterDiagStatus are set as OFFLINE).
Also, the log entry 'RETRYING ON POD DELETE' never appears in the log file.
How to repeat:
This behavior appears randomly, but always after a Nodepool maintenance where all pods are moving from a node pool to a new one.
Usually, 1 pod is successfully moved, but remaining 2 pods (for HA) remain stucked in the old Nodepool
Suggested fix:
I suspect the logic in mysqloperator/controller/innodbcluster/cluster_controller.py (on_pod_deleted) to not handle this situation (ClusterDiagStatus.OFFLINE).
The current workaround is to remove the Pod's finalizers to unblock the termination.