How To Test A Node For Failure [Force A Node To Fail]

Question

I have a few master and worker nodes and I would like to see what the best practices are for testing node failures. I was wanting to trigger some event that would cause the health check /livez or /readyz to show not ok either by turning off a service, etc. Any suggestions?

Cluster information:

Kubernetes version: 1.20
Host OS: Ubuntu-18.04.5-Server-amd64

Can you reboot the box or initiate hard power off (from like cloud provider portal or esxi)?

off the top of my head i like to test power, network, and storage. for bonus points you can test deeper/weirder things like limits, dns, and random entropy.

Power:

Standard test: Shutdown -r now

this will just test if your node can survive a graceful reboot. consider this must pass.

Mean test: pull power

this will test that your node can survive a unexpected ungraceful shutdown. consider this needing to pass for production.

Network:

Standard test: disable the interface/pull the network cable

you should see that the node goes unready, Kubernetes reschedules pods. consider this needing to pass.

Diabolical test: insert jitter/packet drops/lower nic speed

Kubernetes will absolutely not pass this perfectly gracefully, use this to understand what networking being bad rather than down does.

Filesystem:

Standard test: use any method you like to fill up the Kubernetes filesystem

Kubernetes should start rescheduling pods on the node. consider this as needing to pass

mean test: change the kubernetes filesystem to read only

use this to observe what happens.

Diabolical test: unmount the Kubernetes filesystem.

use this to observe what happens.

1 Like

Kill the kubelet. Or try freezing a process.

https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt