How To Test A Node For Failure [Force A Node To Fail]

ThePieMonster · January 7, 2021, 3:56am

Question

I have a few master and worker nodes and I would like to see what the best practices are for testing node failures. I was wanting to trigger some event that would cause the health check /livez or /readyz to show not ok either by turning off a service, etc. Any suggestions?

Cluster information:

Kubernetes version: 1.20
Host OS: Ubuntu-18.04.5-Server-amd64

CrankyCoder · January 8, 2021, 3:39pm

Can you reboot the box or initiate hard power off (from like cloud provider portal or esxi)?

benhemp · January 9, 2021, 12:41am

off the top of my head i like to test power, network, and storage. for bonus points you can test deeper/weirder things like limits, dns, and random entropy.

Power:

Standard test: Shutdown -r now

this will just test if your node can survive a graceful reboot. consider this must pass.

Mean test: pull power

this will test that your node can survive a unexpected ungraceful shutdown. consider this needing to pass for production.

Network:

Standard test: disable the interface/pull the network cable

you should see that the node goes unready, Kubernetes reschedules pods. consider this needing to pass.

Diabolical test: insert jitter/packet drops/lower nic speed

Kubernetes will absolutely not pass this perfectly gracefully, use this to understand what networking being bad rather than down does.

Filesystem:

Standard test: use any method you like to fill up the Kubernetes filesystem

Kubernetes should start rescheduling pods on the node. consider this as needing to pass

mean test: change the kubernetes filesystem to read only

use this to observe what happens.

Diabolical test: unmount the Kubernetes filesystem.

use this to observe what happens.

cocampbe · January 10, 2021, 11:27pm

Kill the kubelet. Or try freezing a process.

https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt

Topic		Replies	Views
How to detect readiness when node becomes unavailable? General Discussions	1	2171	July 26, 2019
Pods show running.... but node was shut down 10 minutes ago General Discussions	3	912	January 17, 2020
Testing nic failure on bare-metal kubernetes cluster General Discussions	0	758	May 5, 2020
How to reschedule pod on another node if node fails? How to speed up rescheduling? General Discussions	1	12513	July 17, 2019
Regarding to failover with hardware failure General Discussions	4	638	January 19, 2022

How To Test A Node For Failure [Force A Node To Fail]

Question

Cluster information:

Related topics