Missing Pods in a Kubernetes StatefulSet

I got an alert today about a pod being in CrashLoopBackOff... I went to investigate:

$ kubectl get pods -A
NAMESPACE         NAME                                                           READY   STATUS        RESTARTS         AGE
(truncated output)
default           webserver-0                                                    0/1     Running       2539 (23s ago)   6d13h
default           webserver-2                                                    1/1     Running       0                6d18h
default           webserver-3                                                    1/1     Running       0                6d18h
default           webserver-6                                                    1/1     Running       0                6d18h
default           webserver-7                                                    1/1     Running       0                6d18h
default           webserver-8                                                    1/1     Running       0                6d18h
default           webserver-9                                                    1/1     Running       0                6d18h

Notice the amount of restarts that webserver-0 has. And the fact that webserver-1 is missing...

I initially went the Windows way, trying to restart the statefulset:

$ kubectl rollout restart sts webserver
statefulset.apps/webserver restarted

... but this didn't solve the issue... Actually what's the issue?

Well, describing the pod showed that webserver-0 is trying to pull an older tag of the webserver image, 23.03, which isn't compatible with the version 23.04 which is running on all the other pods and microservices on the cluster... This causes it to fail the liveness probe and stay in this infinite loop:

$ kubectl describe pod webserver-0 | grep -i image
    Image:         artifact.host.fqdn/some/path/webserver:v23.03.26
  Normal   Pulled     25m                   kubelet                                Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 387.441222ms
  Normal   Pulling    24m (x2 over 25m)     kubelet                                Pulling image "artifact.host.fqdn/some/path/webserver:v23.03.26"
  Normal   Pulled     24m                   kubelet                                Successfully pulled image "artifact.host.fqdn/some/path/webserver:v23.03.26" in 314.222387ms

Interesting enough, all the other pods in the statefulset are running on webserver tag 23.04:

$ kubectl describe pod webserver-8 | grep -i image
    Image:          artifact.host.fqdn/some/path/webserver:v23.04.14

$ kubectl describe pod webserver-1 | grep -i image
Error from server (NotFound): pods "webserver-1" not found

$ kubectl describe pod webserver-9 | grep -i image
    Image:          artifact.host.fqdn/some/path/webserver:v23.04.14

The statefulset is OK with regards to the tag of the webserver image:

$ kubectl get sts webserver -o yaml | grep image
        image: artifact.host.fqdn/some/path/webserver:v23.04.14
        imagePullPolicy: Always

Let's look at the revision history:

$ kubectl rollout history sts webserver --revision 9 | grep -i image
    Image:      artifact.host.fqdn/some/path/webserver:v23.03.26

$ kubectl rollout history sts webserver --revision 10 | grep -i image
    Image:      artifact.host.fqdn/some/path/webserver:v23.04.14

$ kubectl rollout history sts webserver --revision 11 | grep -i image
    Image:      artifact.host.fqdn/some/path/webserver:v23.04.14

$ kubectl rollout undo sts webserver --to-revision 11
statefulset.apps/webserver skipped rollback (current template already matches revision 11)

Revision 10 was created by a helm upgrade command and 11 by my rolling restart...

Searching the internet got me to this interesting answer that says: The more likely event here seems like one of your node deletions took down pod-1 and pod-2 after pod-0 went unhealthy. In that case, we do not attempt to recreate pod-1 or 2 till pod-0 becomes healthy again. The rationale for …

more ...

Kafka, Kubernetes and Not enough space

So I came across a situation and it took me a while to figure it out. So I'm putting this together as it might help others as well.

Let's say you have Bitnami's packaged Kafka cluster running on Kubernetes on StatefulSets, (so not managed by an operator like Strimzi). And the pods start restarting...

Looking at the logs you might see stuff like this:

-9ee616bdcf73, partition=0, highWatermark=0, lastStableOffset=0, logStartOffset=0, logEndOffset=0) with 1 segments in 2ms (32564/32564 loaded in /bitnami/kafka/data) (kafka.log.LogManager)
[2022-07-06 20:46:05,895] INFO Loaded 32564 logs in 161618ms. (kafka.log.LogManager)
[2022-07-06 20:46:05,896] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2022-07-06 20:46:05,896] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2022-07-06 20:46:05,908] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2022-07-06 20:46:05,999] INFO [kafka-log-cleaner-thread-0]: Starting (kafka.log.LogCleaner)
[2022-07-06 20:46:06,296] INFO [BrokerToControllerChannelManager broker=0 name=forwarding]: Starting (kafka.server.BrokerToControllerRequestThread)
[2022-07-06 20:46:06,419] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
[2022-07-06 20:46:06,428] INFO Awaiting socket connections on 0.0.0.0:9093. (kafka.network.Acceptor)
[2022-07-06 20:46:06,463] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Created data-plane acceptor and processors for endpoint : ListenerName(INTERNAL) (kafka.network.SocketServer)
[2022-07-06 20:46:06,464] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
[2022-07-06 20:46:06,464] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor)
[2022-07-06 20:46:06,473] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Created data-plane acceptor and processors for endpoint : ListenerName(CLIENT) (kafka.network.SocketServer)
[2022-07-06 20:46:06,481] INFO [BrokerToControllerChannelManager broker=0 name=alterIsr]: Starting (kafka.server.BrokerToControllerRequestThread)
[2022-07-06 20:46:06,508] INFO [ExpirationReaper-0-Produce]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-07-06 20:46:06,509] INFO [ExpirationReaper-0-Fetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-07-06 20:46:06,511] INFO [ExpirationReaper-0-DeleteRecords]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-07-06 20:46:06,512] INFO [ExpirationReaper-0-ElectLeader]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-07-06 20:46:06,529] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler)
[2022-07-06 20:46:06,611] INFO Creating /brokers/ids/0 (is it secure? false) (kafka.zk.KafkaZkClient)
[2022-07-06 20:46:06,642] INFO Stat of the created znode at /brokers/ids/0 is: 511242113,511242113,1657140366624,1657140366624,1,0,0,72062637399277581,364,0,511242113
 (kafka.zk.KafkaZkClient)
[2022-07-06 20:46:06,643] INFO Registered broker 0 at path /brokers/ids/0 with addresses: INTERNAL://kafka-0.kafka-headless.default.svc.cluster.local:9093,CLIENT://kafka-0.kafka-headless.default.svc.cluster.local:9092, czxid (broker epoch): 511242113 (kafka.zk.KafkaZkClient)
[2022-07-06 20:46:06,758] INFO [ControllerEventThread controllerId=0] Starting (kafka.controller.ControllerEventManager$ControllerEventThread)
[2022-07-06 20:46:06,765] INFO …
more ...