I got an alert today about a pod being in CrashLoopBackOff
... I went to investigate:
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
(truncated output)
default webserver-0 0/1 Running 2539 (23s ago) 6d13h
default webserver-2 1/1 Running 0 6d18h
default webserver-3 1/1 Running 0 6d18h
default webserver-6 1/1 Running 0 6d18h
default webserver-7 1/1 Running 0 6d18h
default webserver-8 1/1 Running 0 6d18h
default webserver-9 1/1 Running 0 6d18h
Notice the amount of restarts that webserver-0
has. And the fact that webserver-1
is missing...
I initially went the Windows way, trying to restart the statefulset:
$ kubectl rollout restart sts webserver
statefulset.apps/webserver restarted
... but this didn't solve the issue... Actually what's the issue?
Well, describing the pod showed that webserver-0
is trying to pull an older tag of the webserver
image, 23.03
, which isn't compatible with the version 23.04
which is running on all the other pods and microservices on the cluster... This causes it to fail the liveness probe and stay in this infinite loop:
$ kubectl describe pod webserver-0 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.03.26
Normal Pulled 25m kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 387.441222ms
Normal Pulling 24m (x2 over 25m) kubelet Pulling image "artifact.host.fqdn/some/path/webserver:v23.03.26"
Normal Pulled 24m kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 314.222387ms
Interesting enough, all the other pods in the statefulset are running on webserver
tag 23.04
:
$ kubectl describe pod webserver-8 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
$ kubectl describe pod webserver-1 | grep -i image
Error from server (NotFound): pods "webserver-1" not found
$ kubectl describe pod webserver-9 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
The statefulset is OK with regards to the tag of the webserver
image:
$ kubectl get sts webserver -o yaml | grep image
image: artifact.host.fqdn/some/path/webserver:v23.04.14
imagePullPolicy: Always
Let's look at the revision history:
$ kubectl rollout history sts webserver --revision 9 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.03.26
$ kubectl rollout history sts webserver --revision 10 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
$ kubectl rollout history sts webserver --revision 11 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
$ kubectl rollout undo sts webserver --to-revision 11
statefulset.apps/webserver skipped rollback (current template already matches revision 11)
Revision 10 was created by a helm upgrade
command and 11 by my rolling restart...
Searching the internet got me to this interesting answer that says: The more likely event here seems like one of your node deletions took down pod-1 and pod-2 after pod-0 went unhealthy. In that case, we do not attempt to recreate pod-1 or 2 till pod-0 becomes healthy again. The rationale for this is that users rely on the deterministic initialization order and write logic around that guarantee. To bring up the pods in arbitrary order would violate this guarantee. This might indeed be true, as this Kubernetes cluster was recently upgraded and, of course, the nodes were taken down during that event. Unfortunately
$ kubectl get events
... didn't bring up any useful information.
OK... But how do I fix it?
My initial attempt was to scale down the HorizontalPodAutoscaler and StatefulSet to 1 replica, hoping that only webserver-0
would stay alive and, in a worst case scenario, I'd delete it:
$ kubectl get hpa | grep webserver
webserver StatefulSet/webserver 51%/50% 1 1 1 63d
$ kubectl get sts webserver -o yaml | grep -i replicas
replicas: 1
availableReplicas: 6
currentReplicas: 2
readyReplicas: 6
replicas: 7
updatedReplicas: 5
This didn't work. The StatefulSet didn't want to scale down, because it wasn't healthy:
$ kubectl get pods | grep webserver
webserver-0 0/1 Running 1 (42s ago) 106s
webserver-2 1/1 Running 0 6d19h
webserver-3 1/1 Running 0 6d18h
webserver-6 1/1 Running 0 6d18h
webserver-7 1/1 Running 0 6d18h
webserver-8 1/1 Running 0 6d18h
webserver-9 1/1 Running 0 6d18h
So... I thought I'd give the StatefulSet a hand... by patching the pod itself:
$ kubectl patch pod webserver-0 -p '{"spec":{"containers":[{"name":"webserver","image":"artifact.host.fqdn/some/path/webserver:v23.04.14"}]}}'
pod/webserver-0 patched
It worked!
$ kubectl get pods | grep webserver
webserver-0 1/1 Running 7 (4m26s ago) 16m
webserver-2 1/1 Running 0 6d19h
webserver-3 1/1 Running 0 6d19h
webserver-6 1/1 Running 0 6d19h
webserver-7 1/1 Running 0 6d19h
webserver-8 1/1 Running 0 6d18h
webserver-9 1/1 Terminating 0 6d18h
I fixed back the HorizontalPodAutoscaler:
$ kubectl get hpa | grep webserver
webserver StatefulSet/webserver 104%/50% 2 40 10 63d
$ kubectl get pods | grep webserver
webserver-0 1/1 Running 7 (6m20s ago) 18m
webserver-1 0/1 Running 1 (18s ago) 84s
webserver-2 1/1 Running 0 6d19h
webserver-3 1/1 Running 0 6d19h
webserver-6 1/1 Running 0 6d19h
webserver-7 1/1 Running 0 6d19h
Pod 1 doesn't want to start:
$ kubectl describe pod webserver-0 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.04.14
Normal Pulled 19m kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 341.684123ms
Normal Pulling 18m (x2 over 19m) kubelet Pulling image "artifact.host.fqdn/some/path/webserver:v23.03.26"
Normal Pulled 18m kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 329.854725ms
$ kubectl describe pod webserver-1 | grep -i image
Image: artifact.host.fqdn/some/path/webserver:v23.03.26
Normal Pulled 2m9s kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 3.110530737s
Normal Pulling 67s (x2 over 2m12s) kubelet Pulling image "artifact.host.fqdn/some/path/webserver:v23.03.26"
Normal Pulled 66s kubelet Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 314.394334ms
Let's fix it as well:
$ kubectl patch pod webserver-1 -p '{"spec":{"containers":[{"name":"webserver","image":"artifact.host.fqdn/some/path/webserver:v23.04.14"}]}}'
pod/webserver-1 patched
Better:
$ kubectl get pods | grep webserver
webserver-0 1/1 Running 7 (8m58s ago) 21m
webserver-1 1/1 Running 3 (44s ago) 4m2s
webserver-2 1/1 Running 0 6d19h
webserver-3 1/1 Running 0 6d19h
webserver-4 1/1 Running 0 32s
webserver-5 1/1 Running 0 21s
webserver-6 1/1 Running 0 6d19h
webserver-7 1/1 Running 0 6d19h
webserver-8 1/1 Running 0 11s
webserver-9 0/1 Init:0/1 0 1s
$ kubectl get hpa | grep webserver
webserver StatefulSet/webserver 54%/50% 2 40 11 63d
$ kubectl get pods | grep webserver
webserver-0 1/1 Running 7 (9m15s ago) 21m
webserver-1 1/1 Running 3 (61s ago) 4m19s
webserver-10 0/1 Running 0 7s
webserver-2 1/1 Running 0 6d19h
webserver-3 1/1 Running 0 6d19h
webserver-4 1/1 Running 0 49s
webserver-5 1/1 Running 0 38s
webserver-6 1/1 Running 0 6d19h
webserver-7 1/1 Running 0 6d19h
webserver-8 1/1 Running 0 28s
webserver-9 1/1 Running 0 18s
Sometimes Kubernetes needs a little help, it seems...
Comments
comments powered by Disqus