We do have a cluster where all API servers are failing regularly (about 400 times each).
When I get a description of the pods, I get something like:
State: Running Started: Mon, 03 Apr 2023 03:00:39 +0200 Last State: Terminated Reason: Error Exit Code: 137 Started: Mon, 06 Mar 2023 03:00:41 +0100 Finished: Mon, 03 Apr 2023 03:00:38 +0200 Ready: True Restart Count: 450 Requests: cpu: 250m Liveness: http-get https://*.*.*.*:6443/livez delay=10s timeout=15s period=10s #success=1 #failure=8 Readiness: http-get https://*.*.*.*:6443/readyz delay=0s timeout=15s period=1s #success=1 #failure=3 Startup: http-get https://*.*.*.*:6443/livez delay=10s timeout=15s period=10s #success=1 #failure=30
When looking on the web, it seams that the exit code means the pod was killed (
exit code: 128+9 (SIGKILL))
However, the reason should be
OOM killed. In our case we do have the reason
Furthermore, there’s no trace of any process having been
How can I troubleshoot that further?
Have to pods been killed because of the ‘Readiness / Liveness’ probes?
Thanks in advance for any pointer.