We do have a cluster where all API servers are failing regularly (about 400 times each).
When I get a description of the pods, I get something like:
State: Running
Started: Mon, 03 Apr 2023 03:00:39 +0200
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 06 Mar 2023 03:00:41 +0100
Finished: Mon, 03 Apr 2023 03:00:38 +0200
Ready: True
Restart Count: 450
Requests:
cpu: 250m
Liveness: http-get https://*.*.*.*:6443/livez delay=10s timeout=15s period=10s #success=1 #failure=8
Readiness: http-get https://*.*.*.*:6443/readyz delay=0s timeout=15s period=1s #success=1 #failure=3
Startup: http-get https://*.*.*.*:6443/livez delay=10s timeout=15s period=10s #success=1 #failure=30
When looking on the web, it seams that the exit code means the pod was killed (exit code: 128+9 (SIGKILL)
)
However, the reason should be OOM killed
. In our case we do have the reason Error
.
Furthermore, there’s no trace of any process having been OOM killed
.
How can I troubleshoot that further?
Have to pods been killed because of the ‘Readiness / Liveness’ probes?
Thanks in advance for any pointer.