Node down - pods shown still as Running for hours, others stuck in Terminating

Hi people!

We are running a 3-node K3s cluster on ARM64 with Alpine Linux. K3s is the certified Kubernetes distribution for resource-constrained (IoT & Edge computing) devices. Our cluster looks like this

$ kubectl get nodes -o wide
NAME                   STATUS   ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
e10ccwe080c000001127   Ready    control-plane,master   53d   v1.23.8+k3s2   172.31.0.143    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e10ctwe080c000002458   Ready    <none>                 31d   v1.23.8+k3s2   10.101.115.39   <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e09ctwe080b000102186   Ready    <none>                 31d   v1.23.8+k3s2   10.101.35.11    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1

In normal conditions the cluster runs just fine. We run a variety of workloads, including some with persistent storage and CSI.

The issue we observe is when we try to test the redundancy and the fail-over of the cluster. I power-off one of the worker nodes, to simulate a hardware failure. Indeed shortly after that the Kubernetes detects that the node is down, and we get this.

$ kubectl get nodes -o wide
NAME                   STATUS     ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
e09ctwe080b000102186   Ready      <none>                 31d   v1.23.8+k3s2   10.101.35.11    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e10ccwe080c000001127   Ready      control-plane,master   53d   v1.23.8+k3s2   172.31.0.143    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e10ctwe080c000002458   NotReady   <none>                 31d   v1.23.8+k3s2   10.101.115.39   <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1

The log entries also seem to indicate that the Kubernetes has registered the node-down event, and it’s taking the respective actions. (e.g. the initial actions and the ones after the default eviction timeout of 5min)

time="2022-07-19T09:25:55+01:00" level=info msg="error in remotedialer server [400]: read tcp 172.31.0.143:6443->10.101.115.39:35794: i/o timeout"
I0719 09:26:26.486695    6745 event.go:294] "Event occurred" object="e10ctwe080c000002458" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node e10ctwe080c000002458 status is now: NodeNotReady"
time="2022-07-19T09:26:26+01:00" level=debug msg="Tunnel server egress proxy updating Node e10ctwe080c000002458 IP 10.101.115.39/32"
I0719 09:26:26.531952    6745 event.go:294] "Event occurred" object="default/ubuntu-pod-with-csi" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="Tunnel server egress proxy updating Node e10ctwe080c000002458 IP 10.101.115.39/32"
I0719 09:26:26.635176    6745 event.go:294] "Event occurred" object="default/seaweedfs-node-xrh6f" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0719 09:26:26.694040    6745 event.go:294] "Event occurred" object="kube-system/svclb-traefik-adf18ba2-vmssc" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="DesiredSet - No change(2) apps/v1, Kind=DaemonSet kube-system/svclb-traefik-adf18ba2 for svccontroller kube-system/traefik"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1357977\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedDaemonSet' Applied LoadBalancer DaemonSet kube-system/svclb-traefik-adf18ba2"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1357977\", FieldPath:\"\"}): type: 'Normal' reason: 'UpdatedIngressIP' LoadBalancer Ingress IP addresses updated: 10.101.35.11, 172.31.0.143"
I0719 09:26:26.749555    6745 event.go:294] "Event occurred" object="default/ubuntu-pod" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1360461\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedDaemonSet' Applied LoadBalancer DaemonSet kube-system/svclb-traefik-adf18ba2"
I0719 09:26:26.776583    6745 event.go:294] "Event occurred" object="dev/plain-alpine-container" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="DesiredSet - No change(2) apps/v1, Kind=DaemonSet kube-system/svclb-traefik-adf18ba2 for svccontroller kube-system/traefik"

I0719 09:31:31.934393    6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/ubuntu-pod-with-csi"
I0719 09:31:31.934393    6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/plain-alpine-container"
I0719 09:31:31.934407    6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/ubuntu-pod"
I0719 09:31:31.934859    6745 event.go:294] "Event occurred" object="default/ubuntu-pod-with-csi" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod default/ubuntu-pod-with-csi"
I0719 09:31:31.934953    6745 event.go:294] "Event occurred" object="dev/plain-alpine-container" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod dev/plain-alpine-container"
I0719 09:31:31.935025    6745 event.go:294] "Event occurred" object="default/ubuntu-pod" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod default/ubuntu-pod"

However, the actual result is

  • the pods that are part of a DaemonSet or StatefulSet, remain forever as “Running” (on the node that is already down and gone)
  • the other, simple pods are stuck in “Terminating” state also forever

Note below the pods on node e10ctwe080c000002458, long after the node was powered-off.

$ kubectl get pods -o wide -A
NAMESPACE     NAME                                      READY   STATUS        RESTARTS       AGE    IP            NODE                   NOMINATED NODE   READINESS GATES
kube-system   local-path-provisioner-6c79684f77-hf7lz   1/1     Running       20 (16h ago)   53d    10.42.0.151   e10ccwe080c000001127   <none>           <none>
kube-system   svclb-traefik-adf18ba2-qhrt8              2/2     Running       2 (16h ago)    17h    10.42.0.149   e10ccwe080c000001127   <none>           <none>
default       seaweedfs-controller-0                    4/4     Running       4 (16h ago)    4d1h   10.42.0.148   e10ccwe080c000001127   <none>           <none>
default       seaweedfs-node-f7vpm                      2/2     Running       2 (16h ago)    10d    10.42.0.150   e10ccwe080c000001127   <none>           <none>
kube-system   coredns-d76bd69b-l96lh                    1/1     Running       17 (16h ago)   53d    10.42.0.155   e10ccwe080c000001127   <none>           <none>
kube-system   traefik-df4ff85d6-7mcql                   1/1     Running       16 (16h ago)   53d    10.42.0.152   e10ccwe080c000001127   <none>           <none>
kube-system   metrics-server-7cd5fcb6b7-s9v4d           1/1     Running       20 (16h ago)   53d    10.42.0.153   e10ccwe080c000001127   <none>           <none>
kube-system   svclb-traefik-adf18ba2-chc8f              2/2     Running       2 (100m ago)   17h    10.42.1.54    e09ctwe080b000102186   <none>           <none>
default       seaweedfs-node-qp56w                      2/2     Running       6 (100m ago)   10d    10.42.1.56    e09ctwe080b000102186   <none>           <none>
default       second-ubuntu-pod-with-csi                1/1     Running       3 (16h ago)    10d    10.42.0.157   e10ccwe080c000001127   <none>           <none>
default       seaweedfs-node-xrh6f                      2/2     Running       8 (17h ago)    10d    10.42.2.70    e10ctwe080c000002458   <none>           <none>
kube-system   svclb-traefik-adf18ba2-vmssc              2/2     Running       0              17h    10.42.2.69    e10ctwe080c000002458   <none>           <none>
dev           plain-alpine-container                    1/1     Terminating   0              17h    10.42.2.72    e10ctwe080c000002458   <none>           <none>
default       ubuntu-pod-with-csi                       1/1     Terminating   0              12h    10.42.2.73    e10ctwe080c000002458   <none>           <none>
default       ubuntu-pod                                1/1     Terminating   0              17h    10.42.2.71    e10ctwe080c000002458   <none>           <none>

This wasn’t quite the result we were expecting. I was hoping that the simple pods will be deleted and recreated on another node, and those from the DeamonSet at least set to ‘Fail’ as the documentation suggests.

“If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for setting the phase of all Pods on the lost node to Failed”

Has someone else observed similar behavior? Does someone have an idea what might be wrong in our configuration, or where the issue might be?

(This seems pretty fundamental functionality. Not to work properly and not to get noticed widely already, lead us to believe for now that the issue is in our particular setup)

P.S. I post this on the Kubernetes forum because K3s is fully compatible Kubernetes distro, and it behaves completely as K8s. If people think that this is rather a K3s specific issue, I am happy to move it to the K3s forums.

Thanks in advance to all who will try to help!

This is an old bug of Kubernetes 1.18, which caused by the kubelet lost connection to the apiserver. Restart the kubelet is the workaround. And upgrade to Kubernetes 1.19 is the fix.

You talk about Kubernetes 1.18 and 1.19.
We run Kubernetes 1.23.

We have the same/related issue (also k3s v1.21.14) also on-prem.
Sometimes nodes do fail and then all pods on that node stay in terminating state. While they are replaced on other nodes, it’s still not possible to connect to some services (example: kubeflow) until the terminating pods are deleted completly or the node that went down has recovered (and deletes the old pods by itself)
It is mind boggling to me why that is, as the sole purpose of kubernetes is to make sure everything always is available.

I believe we actually found the source of our problem, and the right approach. I’ll document it here, before closing the issue, for the benefit of those who would stumble across the same.

Our main issue was that our simple pods were not re-created elsewhere when the node they run on went down. Those simple pods were practically not resilient. After digging deeper into the Kubernetes documentation, we figured our problem - we were creating directly the Pod workloads (i.e. Kind: Pod). According to the Kubernetes documentation:
“[Although] Pods are the smallest deployable units of computing that one can create and manage in Kubernetes. …You’ll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a controller), the new Pod is scheduled to run on a Node in the cluster. The Pod remains on that Node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails… Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them…” Hence when a node goes down, the kubelet on that node dies suddenly, and the Static Pods are not ‘covered’ properly anymore; and there is no process to recreate them on another node.
The proper approach for this case is to use ‘Deployment’ (kind: Deployment); even for single Pods with one replica (spec.replicas: 1). When the Pod is part of a Deployment, and the serving node fails, the Pod is re-created properly on another node.

Indeed we noticed as well that simple Pods and even Pods from a Deployment, will remain indefinitely in Terminating state on the node that is down, while the node is down. However, if/when the NotReady node gets back in service, those Pods in Terminating state get properly cleaned-up; which actually is a proper garbage-collection-mechanism.

Our second issue - Pods from DaemonSets remain forever as Running on a node that died, is attributable to the default tolerations of a DaemonSet. Namely, “Toleration Key: node.kubernetes.io/not-ready → Effect: NoExecute, i.e. DaemonSet pods will not be evicted when there are node problems such as a network partition.”
See for more details DaemonSet | Kubernetes

I hope this helps.

Ah, i didn’t realize you were actually having pods without deployments.

It’s true that it’s fine IF the node comes back online. When it doesn’t though the pods stay in this terminating state indefinitely. Then the problem is that we have a service that is supposed to connect to a running pod of that deployment, but it seems to continue redirecting to the pod that is terminating and therefore the service is unreachable even though other pods are up and running. This is where the purpose of kubernetes seems to go through the window.

I didn’t work with daemonsets yet, so I have no clue about those… :frowning: