Hi people!
We are running a 3-node K3s cluster on ARM64 with Alpine Linux. K3s is the certified Kubernetes distribution for resource-constrained (IoT & Edge computing) devices. Our cluster looks like this
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
e10ccwe080c000001127 Ready control-plane,master 53d v1.23.8+k3s2 172.31.0.143 <none> Alpine Linux v3.9 4.9.291-vsys-1.0 docker://18.9.1
e10ctwe080c000002458 Ready <none> 31d v1.23.8+k3s2 10.101.115.39 <none> Alpine Linux v3.9 4.9.291-vsys-1.0 docker://18.9.1
e09ctwe080b000102186 Ready <none> 31d v1.23.8+k3s2 10.101.35.11 <none> Alpine Linux v3.9 4.9.291-vsys-1.0 docker://18.9.1
In normal conditions the cluster runs just fine. We run a variety of workloads, including some with persistent storage and CSI.
The issue we observe is when we try to test the redundancy and the fail-over of the cluster. I power-off one of the worker nodes, to simulate a hardware failure. Indeed shortly after that the Kubernetes detects that the node is down, and we get this.
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
e09ctwe080b000102186 Ready <none> 31d v1.23.8+k3s2 10.101.35.11 <none> Alpine Linux v3.9 4.9.291-vsys-1.0 docker://18.9.1
e10ccwe080c000001127 Ready control-plane,master 53d v1.23.8+k3s2 172.31.0.143 <none> Alpine Linux v3.9 4.9.291-vsys-1.0 docker://18.9.1
e10ctwe080c000002458 NotReady <none> 31d v1.23.8+k3s2 10.101.115.39 <none> Alpine Linux v3.9 4.9.291-vsys-1.0 docker://18.9.1
The log entries also seem to indicate that the Kubernetes has registered the node-down event, and it’s taking the respective actions. (e.g. the initial actions and the ones after the default eviction timeout of 5min)
time="2022-07-19T09:25:55+01:00" level=info msg="error in remotedialer server [400]: read tcp 172.31.0.143:6443->10.101.115.39:35794: i/o timeout"
I0719 09:26:26.486695 6745 event.go:294] "Event occurred" object="e10ctwe080c000002458" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node e10ctwe080c000002458 status is now: NodeNotReady"
time="2022-07-19T09:26:26+01:00" level=debug msg="Tunnel server egress proxy updating Node e10ctwe080c000002458 IP 10.101.115.39/32"
I0719 09:26:26.531952 6745 event.go:294] "Event occurred" object="default/ubuntu-pod-with-csi" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="Tunnel server egress proxy updating Node e10ctwe080c000002458 IP 10.101.115.39/32"
I0719 09:26:26.635176 6745 event.go:294] "Event occurred" object="default/seaweedfs-node-xrh6f" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0719 09:26:26.694040 6745 event.go:294] "Event occurred" object="kube-system/svclb-traefik-adf18ba2-vmssc" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="DesiredSet - No change(2) apps/v1, Kind=DaemonSet kube-system/svclb-traefik-adf18ba2 for svccontroller kube-system/traefik"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1357977\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedDaemonSet' Applied LoadBalancer DaemonSet kube-system/svclb-traefik-adf18ba2"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1357977\", FieldPath:\"\"}): type: 'Normal' reason: 'UpdatedIngressIP' LoadBalancer Ingress IP addresses updated: 10.101.35.11, 172.31.0.143"
I0719 09:26:26.749555 6745 event.go:294] "Event occurred" object="default/ubuntu-pod" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"kube-system\", Name:\"traefik\", UID:\"adf18ba2-8ef1-4e7d-b0a1-b77c42015e54\", APIVersion:\"v1\", ResourceVersion:\"1360461\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedDaemonSet' Applied LoadBalancer DaemonSet kube-system/svclb-traefik-adf18ba2"
I0719 09:26:26.776583 6745 event.go:294] "Event occurred" object="dev/plain-alpine-container" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
time="2022-07-19T09:26:26+01:00" level=debug msg="DesiredSet - No change(2) apps/v1, Kind=DaemonSet kube-system/svclb-traefik-adf18ba2 for svccontroller kube-system/traefik"
I0719 09:31:31.934393 6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/ubuntu-pod-with-csi"
I0719 09:31:31.934393 6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="dev/plain-alpine-container"
I0719 09:31:31.934407 6745 taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/ubuntu-pod"
I0719 09:31:31.934859 6745 event.go:294] "Event occurred" object="default/ubuntu-pod-with-csi" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod default/ubuntu-pod-with-csi"
I0719 09:31:31.934953 6745 event.go:294] "Event occurred" object="dev/plain-alpine-container" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod dev/plain-alpine-container"
I0719 09:31:31.935025 6745 event.go:294] "Event occurred" object="default/ubuntu-pod" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Marking for deletion Pod default/ubuntu-pod"
However, the actual result is
- the pods that are part of a DaemonSet or StatefulSet, remain forever as “Running” (on the node that is already down and gone)
- the other, simple pods are stuck in “Terminating” state also forever
Note below the pods on node e10ctwe080c000002458, long after the node was powered-off.
$ kubectl get pods -o wide -A
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system local-path-provisioner-6c79684f77-hf7lz 1/1 Running 20 (16h ago) 53d 10.42.0.151 e10ccwe080c000001127 <none> <none>
kube-system svclb-traefik-adf18ba2-qhrt8 2/2 Running 2 (16h ago) 17h 10.42.0.149 e10ccwe080c000001127 <none> <none>
default seaweedfs-controller-0 4/4 Running 4 (16h ago) 4d1h 10.42.0.148 e10ccwe080c000001127 <none> <none>
default seaweedfs-node-f7vpm 2/2 Running 2 (16h ago) 10d 10.42.0.150 e10ccwe080c000001127 <none> <none>
kube-system coredns-d76bd69b-l96lh 1/1 Running 17 (16h ago) 53d 10.42.0.155 e10ccwe080c000001127 <none> <none>
kube-system traefik-df4ff85d6-7mcql 1/1 Running 16 (16h ago) 53d 10.42.0.152 e10ccwe080c000001127 <none> <none>
kube-system metrics-server-7cd5fcb6b7-s9v4d 1/1 Running 20 (16h ago) 53d 10.42.0.153 e10ccwe080c000001127 <none> <none>
kube-system svclb-traefik-adf18ba2-chc8f 2/2 Running 2 (100m ago) 17h 10.42.1.54 e09ctwe080b000102186 <none> <none>
default seaweedfs-node-qp56w 2/2 Running 6 (100m ago) 10d 10.42.1.56 e09ctwe080b000102186 <none> <none>
default second-ubuntu-pod-with-csi 1/1 Running 3 (16h ago) 10d 10.42.0.157 e10ccwe080c000001127 <none> <none>
default seaweedfs-node-xrh6f 2/2 Running 8 (17h ago) 10d 10.42.2.70 e10ctwe080c000002458 <none> <none>
kube-system svclb-traefik-adf18ba2-vmssc 2/2 Running 0 17h 10.42.2.69 e10ctwe080c000002458 <none> <none>
dev plain-alpine-container 1/1 Terminating 0 17h 10.42.2.72 e10ctwe080c000002458 <none> <none>
default ubuntu-pod-with-csi 1/1 Terminating 0 12h 10.42.2.73 e10ctwe080c000002458 <none> <none>
default ubuntu-pod 1/1 Terminating 0 17h 10.42.2.71 e10ctwe080c000002458 <none> <none>
This wasn’t quite the result we were expecting. I was hoping that the simple pods will be deleted and recreated on another node, and those from the DeamonSet at least set to ‘Fail’ as the documentation suggests.
“If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for setting the phase of all Pods on the lost node to Failed”
Has someone else observed similar behavior? Does someone have an idea what might be wrong in our configuration, or where the issue might be?
(This seems pretty fundamental functionality. Not to work properly and not to get noticed widely already, lead us to believe for now that the issue is in our particular setup)
P.S. I post this on the Kubernetes forum because K3s is fully compatible Kubernetes distro, and it behaves completely as K8s. If people think that this is rather a K3s specific issue, I am happy to move it to the K3s forums.
Thanks in advance to all who will try to help!