Hi people!
We are running a 3-node K3s cluster on ARM64 with Alpine Linux. K3s is the certified Kubernetes distribution for resource-constrained (IoT & Edge computing) devices. Our cluster looks like this
$ kubectl get nodes -o wide
NAME                   STATUS   ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
e10ccwe080c000001127   Ready    control-plane,master   53d   v1.23.8+k3s2   172.31.0.143    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e10ctwe080c000002458   Ready    <none>                 31d   v1.23.8+k3s2   10.101.115.39   <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e09ctwe080b000102186   Ready    <none>                 31d   v1.23.8+k3s2   10.101.35.11    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
In normal conditions the cluster runs just fine. We run a variety of workloads, including some with persistent storage and CSI.
The issue we observe is when we try to test the redundancy and the fail-over of the cluster. I power-off one of the worker nodes, to simulate a hardware failure. Indeed shortly after that the Kubernetes detects that the node is down, and we get this.
$ kubectl get nodes -o wide
NAME                   STATUS     ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
e09ctwe080b000102186   Ready      <none>                 31d   v1.23.8+k3s2   10.101.35.11    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e10ccwe080c000001127   Ready      control-plane,master   53d   v1.23.8+k3s2   172.31.0.143    <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
e10ctwe080c000002458   NotReady   <none>                 31d   v1.23.8+k3s2   10.101.115.39   <none>        Alpine Linux v3.9   4.9.291-vsys-1.0   docker://18.9.1
The log entries also seem to indicate that the Kubernetes has registered the node-down event, and itās taking the respective actions. (e.g. the initial actions and the ones after the default eviction timeout of 5min)
time="2022-07-19T09:25:55+01:00" level=info msg="error in remotedialer server [400]: read tcp 172.31.0.143:6443->10.101.115.39:35794: i/o timeout"
I0719 09:26:26.486695    6745 event.go:294] "Event occurred" object="e10ctwe080c000002458" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node e10ctwe080c000002458 status is now: NodeNotReady"
time="2022-07-19T09:26:26+01:00" level=debug msg="Tunnel server egress proxy updating Node e10ctwe080c000002458 IP 10.101.115.39/32"
I0719 09:26:26.531952    6745 event.go:294] "Event occurred" object="default/ubuntu-pod-with-csi" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="Tunnel server egress proxy updating Node e10ctwe080c000002458 IP 10.101.115.39/32"
I0719 09:26:26.635176    6745 event.go:294] "Event occurred" object="default/seaweedfs-node-xrh6f" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0719 09:26:26.694040    6745 event.go:294] "Event occurred" object="kube-system/svclb-traefik-adf18ba2-vmssc" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="DesiredSet - No change(2) apps/v1, Kind=DaemonSet kube-system/svclb-traefik-adf18ba2 for svccontroller kube-system/traefik"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1357977\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedDaemonSet' Applied LoadBalancer DaemonSet kube-system/svclb-traefik-adf18ba2"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1357977\", FieldPath:\"\"}): type: 'Normal' reason: 'UpdatedIngressIP' LoadBalancer Ingress IP addresses updated: 10.101.35.11, 172.31.0.143"
I0719 09:26:26.749555    6745 event.go:294] "Event occurred" object="default/ubuntu-pod" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1360461\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedDaemonSet' Applied LoadBalancer DaemonSet kube-system/svclb-traefik-adf18ba2"
I0719 09:26:26.776583    6745 event.go:294] "Event occurred" object="dev/plain-alpine-container" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="DesiredSet - No change(2) apps/v1, Kind=DaemonSet kube-system/svclb-traefik-adf18ba2 for svccontroller kube-system/traefik"
I0719 09:31:31.934393    6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/ubuntu-pod-with-csi"
I0719 09:31:31.934393    6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/plain-alpine-container"
I0719 09:31:31.934407    6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/ubuntu-pod"
I0719 09:31:31.934859    6745 event.go:294] "Event occurred" object="default/ubuntu-pod-with-csi" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod default/ubuntu-pod-with-csi"
I0719 09:31:31.934953    6745 event.go:294] "Event occurred" object="dev/plain-alpine-container" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod dev/plain-alpine-container"
I0719 09:31:31.935025    6745 event.go:294] "Event occurred" object="default/ubuntu-pod" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod default/ubuntu-pod"
However, the actual result is
- the pods that are part of a DaemonSet or StatefulSet, remain forever as āRunningā (on the node that is already down and gone)
- the other, simple pods are stuck in āTerminatingā state also forever
Note below the pods on node e10ctwe080c000002458, long after the node was powered-off.
$ kubectl get pods -o wide -A
NAMESPACE     NAME                                      READY   STATUS        RESTARTS       AGE    IP            NODE                   NOMINATED NODE   READINESS GATES
kube-system   local-path-provisioner-6c79684f77-hf7lz   1/1     Running       20 (16h ago)   53d    10.42.0.151   e10ccwe080c000001127   <none>           <none>
kube-system   svclb-traefik-adf18ba2-qhrt8              2/2     Running       2 (16h ago)    17h    10.42.0.149   e10ccwe080c000001127   <none>           <none>
default       seaweedfs-controller-0                    4/4     Running       4 (16h ago)    4d1h   10.42.0.148   e10ccwe080c000001127   <none>           <none>
default       seaweedfs-node-f7vpm                      2/2     Running       2 (16h ago)    10d    10.42.0.150   e10ccwe080c000001127   <none>           <none>
kube-system   coredns-d76bd69b-l96lh                    1/1     Running       17 (16h ago)   53d    10.42.0.155   e10ccwe080c000001127   <none>           <none>
kube-system   traefik-df4ff85d6-7mcql                   1/1     Running       16 (16h ago)   53d    10.42.0.152   e10ccwe080c000001127   <none>           <none>
kube-system   metrics-server-7cd5fcb6b7-s9v4d           1/1     Running       20 (16h ago)   53d    10.42.0.153   e10ccwe080c000001127   <none>           <none>
kube-system   svclb-traefik-adf18ba2-chc8f              2/2     Running       2 (100m ago)   17h    10.42.1.54    e09ctwe080b000102186   <none>           <none>
default       seaweedfs-node-qp56w                      2/2     Running       6 (100m ago)   10d    10.42.1.56    e09ctwe080b000102186   <none>           <none>
default       second-ubuntu-pod-with-csi                1/1     Running       3 (16h ago)    10d    10.42.0.157   e10ccwe080c000001127   <none>           <none>
default       seaweedfs-node-xrh6f                      2/2     Running       8 (17h ago)    10d    10.42.2.70    e10ctwe080c000002458   <none>           <none>
kube-system   svclb-traefik-adf18ba2-vmssc              2/2     Running       0              17h    10.42.2.69    e10ctwe080c000002458   <none>           <none>
dev           plain-alpine-container                    1/1     Terminating   0              17h    10.42.2.72    e10ctwe080c000002458   <none>           <none>
default       ubuntu-pod-with-csi                       1/1     Terminating   0              12h    10.42.2.73    e10ctwe080c000002458   <none>           <none>
default       ubuntu-pod                                1/1     Terminating   0              17h    10.42.2.71    e10ctwe080c000002458   <none>           <none>
This wasnāt quite the result we were expecting. I was hoping that the simple pods will be deleted and recreated on another node, and those from the DeamonSet at least set to āFailā as the documentation suggests.
āIf a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for setting the phase of all Pods on the lost node to Failedā
Has someone else observed similar behavior? Does someone have an idea what might be wrong in our configuration, or where the issue might be?
(This seems pretty fundamental functionality. Not to work properly and not to get noticed widely already, lead us to believe for now that the issue is in our particular setup)
P.S. I post this on the Kubernetes forum because K3s is fully compatible Kubernetes distro, and it behaves completely as K8s. If people think that this is rather a K3s specific issue, I am happy to move it to the K3s forums.
Thanks in advance to all who will try to help!
