Description:
Had a deployment get stuck finalizing and it never recovered. This was during drain-and-reboot of the entire cluster to install underlying OS security updates. The k8s cluster didn't reboot until the MySQL cluster was force deleted by removing the finalizer in the spec.
This resulted in downtime when it shouldn't have.
It looks like draining a node causes the storage to get detached before the finalizer completes. This causes the finalizer to crash and never recover.
How to repeat:
1. Install the mysql operator and use longhorn to manage storage
2. Deploy a simple MySQL cluster with the following values:
credentials:
root:
user: root
password: password
host: "%"
tls:
useSelfSigned: true
serverInstances: 3
routerInstances: 2
3. Drain a node for reboot:
kubectl drain node1 --pod-selector='app!=csi-attacher,app!=csi-provisioner' --ignore-daemonsets --delete-emptydir-data
4. The node will never drain as the MySQL pod will never terminate.
Suggested fix:
Make the finalizer more resilient and handle the case where storage has vanished.