Testing nic failure on bare-metal kubernetes cluster

urvik · May 5, 2020, 11:35pm

I have a 4 node cluster which runs in HA mode ie 3 masters and 1 worker nodes. I am trying to test nic failure on a node and to simulate that I ran ifdown eth0 which is the public nic on which k8s is running on node 2.

I was expecting the node to be marked not ready and pods to evict and failover to other nodes. However, that isn’t the case. The pods and nodes remain in “ready” state. I have defined readiness probes for some of the pods but it isn’t causing pod eviction as I see Warning Unhealthy 71m (x2 over 78m) kubelet, nbso-2 Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded on pod describe.

On the node which I ran ifdown:

systemctl status kubelet reports healthy.
kubectl from that node continues to work (get nodes, pods etc)
node is being reported healthy
pods on the node are being reported healthy

However when I try to exec into the pod or other pods try to communicate with the pod, it throws an error as expected.
Error from server: error dialing backend: dial tcp xx.xx.xx.xx:10250: connect: no route to host

I want to understand if this is the expected behavior. Shouldn’t the node be marked not ready in this case which would lead to eviction of the pods. And if that doesn’t happen for some reason, because of the readiness probe failing, shouldn’t the pod be marked “not ready”?

Please let me know if more details are required. Thank you!

Cluster information:

Kubernetes version: v1.18.1
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: RHEL 7.7
CNI and version: 0.3.1
CRI and version: 18.06.2-ce

Topic		Replies	Views
How to detect readiness when node becomes unavailable? General Discussions	1	2198	July 26, 2019
Traffic to a Pod located in a Dead Node General Discussions	2	1779	August 23, 2019
Why "kubectl get pods" show pod still running while its node was poweroff? General Discussions	3	3815	March 20, 2021
Pods show running.... but node was shut down 10 minutes ago General Discussions	3	944	January 17, 2020
Node down - pods shown still as Running for hours, others stuck in Terminating General Discussions	5	9123	August 4, 2022

Testing nic failure on bare-metal kubernetes cluster

Cluster information:

Related topics