Kube-apiserver being restarted more than 400 times (each) with exit code 137 (non OOM killed)

jaep · April 3, 2023, 8:43am

We do have a cluster where all API servers are failing regularly (about 400 times each).

When I get a description of the pods, I get something like:

    State:          Running
      Started:      Mon, 03 Apr 2023 03:00:39 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 06 Mar 2023 03:00:41 +0100
      Finished:     Mon, 03 Apr 2023 03:00:38 +0200
    Ready:          True
    Restart Count:  450
    Requests:
      cpu:        250m
    Liveness:     http-get https://*.*.*.*:6443/livez delay=10s timeout=15s period=10s #success=1 #failure=8
    Readiness:    http-get https://*.*.*.*:6443/readyz delay=0s timeout=15s period=1s #success=1 #failure=3
    Startup:      http-get https://*.*.*.*:6443/livez delay=10s timeout=15s period=10s #success=1 #failure=30

When looking on the web, it seams that the exit code means the pod was killed (exit code: 128+9 (SIGKILL))

However, the reason should be OOM killed. In our case we do have the reason Error.

Furthermore, there’s no trace of any process having been OOM killed.

How can I troubleshoot that further?
Have to pods been killed because of the ‘Readiness / Liveness’ probes?

Thanks in advance for any pointer.

Vijay_Chandra_007 · April 4, 2023, 10:07am

hey @jaep for OMM out of memory. for the pod how much recourses you assigned at request and limits section in your Yaml file. try to increase the pod recourse. before that check your node recourses.

jaep · April 4, 2023, 11:37am

Our current configuration is a CPU request of 200m. This should not have an impact on the node since it is a request (only evaluated at scheduling time) and the value is really low.

Vijay_Chandra_007 · April 4, 2023, 11:48am

reconfigure CNI in your case you said OMM error that mean you have to increase your pod Resources Memory and CPU. delete the pods they will recreate by stateful set.

Topic		Replies	Views
Subprocess Killed with a 137 error General Discussions deployment	4	4740	May 1, 2024
How can we tell if the OOMKilled in k8s is because the node is running out of memory and thus killing the pod, or if the pod itself is being killed because the memory it has requested exceeds the limt declaration limit? General Discussions development	0	947	December 4, 2023
Correctly handle OOM killed job General Discussions	1	1441	June 13, 2019
Kubectl execution in pod getting OOMKilled General Discussions development	4	1060	September 26, 2023
Exit code 137 - Pods terminated General Discussions	2	16610	January 16, 2020

Kube-apiserver being restarted more than 400 times (each) with exit code 137 (non OOM killed)

Related topics