I have a 4 node cluster which runs in HA mode ie 3 masters and 1 worker nodes. I am trying to test nic failure on a node and to simulate that I ran ifdown eth0 which is the public nic on which k8s is running on node 2.
I was expecting the node to be marked not ready and pods to evict and failover to other nodes. However, that isn’t the case. The pods and nodes remain in “ready” state. I have defined readiness probes for some of the pods but it isn’t causing pod eviction as I see Warning  Unhealthy  71m (x2 over 78m)  kubelet, nbso-2  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded on pod describe.
On the node which I ran ifdown:
- systemctl status kubelet reports healthy.
- kubectl from that node continues to work (get nodes, pods etc)
- node is being reported healthy
- pods on the node are being reported healthy
However when I try to exec into the pod or other pods try to communicate with the pod, it throws an error as expected.
Error from server: error dialing backend: dial tcp xx.xx.xx.xx:10250: connect: no route to host
I want to understand if this is the expected behavior. Shouldn’t the node be marked not ready in this case which would lead to eviction of the pods. And if that doesn’t happen for some reason, because of the readiness probe failing, shouldn’t the pod be marked “not ready”?
Please let me know if more details are required. Thank you!
Cluster information:
Kubernetes version: v1.18.1
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: RHEL 7.7
CNI and version: 0.3.1
CRI and version: 18.06.2-ce