Missing Pods in a Kubernetes StatefulSet
I got an alert today about a pod being in CrashLoopBackOff
... I went to investigate:
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
(truncated output)
default webserver-0 0/1 Running 2539 (23s ago) 6d13h
default webserver-2 1/1 Running 0 6d18h
default webserver-3 1/1 Running 0 6d18h
default webserver-6 1/1 Running 0 6d18h
default webserver-7 1/1 Running 0 6d18h
default webserver-8 1/1 Running 0 6d18h
default webserver-9 1/1 Running 0 6d18h
Notice the amount of restarts that webserver-0
has. And the fact that webserver-1
is missing...
I initially went the Windows way, trying to restart the statefulset:
$ kubectl rollout restart sts webserver
statefulset.apps/webserver restarted
... but this didn't solve the issue... Actually what's the issue?
Well, describing the pod showed that webserver-0
is trying to pull an older tag of the webserver
image, 23.03
, which isn't compatible with the version 23.04
which is running on all the other pods and microservices on the cluster... This causes it to fail the liveness probe and stay in this infinite loop:
$ kubectl describe pod webserver-0 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.03.26
Normal Pulled 25m kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 387.441222ms
Normal Pulling 24m (x2 over 25m) kubelet Pulling image "artifact.host.fqdn/some/path/webserver:v23.03.26"
Normal Pulled 24m kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 314.222387ms
Interesting enough, all the other pods in the statefulset are running on webserver
tag 23.04
:
$ kubectl describe pod webserver-8 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
$ kubectl describe pod webserver-1 | grep -i image
Error from server (NotFound): pods "webserver-1" not found
$ kubectl describe pod webserver-9 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
The statefulset is OK with regards to the tag of the webserver
image:
$ kubectl get sts webserver -o yaml | grep image
image: artifact.host.fqdn/some/path/webserver:v23.04.14
imagePullPolicy: Always
Let's look at the revision history:
$ kubectl rollout history sts webserver --revision 9 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.03.26
$ kubectl rollout history sts webserver --revision 10 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
$ kubectl rollout history sts webserver --revision 11 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
$ kubectl rollout undo sts webserver --to-revision 11
statefulset.apps/webserver skipped rollback (current template already matches revision 11)
Revision 10 was created by a helm upgrade
command and 11 by my rolling restart...
Searching the internet got me to this interesting answer that says: The more likely event here seems like one of your node deletions took down pod-1 and pod-2 after pod-0 went unhealthy. In that case, we do not attempt to recreate pod-1 or 2 till pod-0 becomes healthy again. The rationale for …
more ...