Question
I have a few master and worker nodes and I would like to see what the best practices are for testing node failures. I was wanting to trigger some event that would cause the health check /livez or /readyz to show not ok either by turning off a service, etc. Any suggestions?
Cluster information:
Kubernetes version: 1.20
Host OS: Ubuntu-18.04.5-Server-amd64
Can you reboot the box or initiate hard power off (from like cloud provider portal or esxi)?
off the top of my head i like to test power, network, and storage. for bonus points you can test deeper/weirder things like limits, dns, and random entropy.
Power:
Standard test: Shutdown -r now
this will just test if your node can survive a graceful reboot. consider this must pass.
Mean test: pull power
this will test that your node can survive a unexpected ungraceful shutdown. consider this needing to pass for production.
Network:
Standard test: disable the interface/pull the network cable
you should see that the node goes unready, Kubernetes reschedules pods. consider this needing to pass.
Diabolical test: insert jitter/packet drops/lower nic speed
Kubernetes will absolutely not pass this perfectly gracefully, use this to understand what networking being bad rather than down does.
Filesystem:
Standard test: use any method you like to fill up the Kubernetes filesystem
Kubernetes should start rescheduling pods on the node. consider this as needing to pass
mean test: change the kubernetes filesystem to read only
use this to observe what happens.
Diabolical test: unmount the Kubernetes filesystem.
use this to observe what happens.
1 Like