Subprocess Killed with a 137 error

jvc22 · November 23, 2022, 5:18pm

Hi All,

Forgive me for not including all the cluster information in this ticket, we have several clusters so this is a general question, any help would be much appreciated.

We have a node pod that shells out some heavy lifting to a bash script. The problem is the bash script is being killed with a 137 (memory) error.

The node pod is not being killed and the bash script’s memory is well below the pod’s resource limit.

Googling for an age yielded nothing as everyone is talking about pods being OOMd but, in this case, the pod stays up, just the script is killed.

Does anyone have any ideas where I can do some more digging?

Many thanks!

thockin · November 23, 2022, 5:44pm

exit code 137 is signal 9 (128 + 9 = 137). Signal 9 is SIGKILL. It’s not just OOM that could cause this - anything which can signal your process could cause it.

If ti was OOM, you would see something in dmesg usually, which might give you a clue.

jvc22 · November 24, 2022, 5:59pm

Thanks for the info

dmesg yielded:

[  +0.000002] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=...,mems_allowed=0,oom_memcg=/kubepods/burstable/...,task_memcg=/kubepods/burstable/pod...

[ +0.000022] Memory cgroup out of memory: Killed process 1463994 ...

That is defo the issue! Now I have somewhere to dig!

I will report back.

Thanks again!

jvc22 · November 25, 2022, 2:06pm

Yeah so this was an OOM because the firing up of the bash (as a sub-process) tipped the pod over its total allocation and the node terminated the sub-process. I watched dmesg with

sudo dmesg -wH

and then varied the resource limits on our dev cluster and observed that

if the main process exceeds the limits.memory. the pod is restarted
if a subprocess pushes the resources over the limits.memory the subprocess is killed but the pod remains running.
if the limits.memory is high enough the pod is not restarted and the sub process executes ok

Thanks again

madmaccs · May 1, 2024, 3:43pm

The exit code of the process was 137 because that is how cgroup memory violations are handled in the cluster. Doing a describe on the pod, or checking its status after it crashed should have shown that it was evicted due to oom. (You may see 137 for failed health checks too).

So exit code 137 can be deceptive because they can be OOM or it could be some foreign process sending a kill signal. You can enable audit logs to see if its a foreign process .

But in my experience if the process is being killed in an expected way by kubernetes (OOM violation, failed heath check, etc) when describing the pod there should be a corresponding event.

Topic		Replies	Views
POD shows return code 137 - what are the defaults? General Discussions	3	1132	March 12, 2024
How can we tell if the OOMKilled in k8s is because the node is running out of memory and thus killing the pod, or if the pod itself is being killed because the memory it has requested exceeds the limt declaration limit? General Discussions development	1	1711	December 13, 2024
Kubelet doesn't recognize child process oom General Discussions	3	2193	June 16, 2019
Kube-apiserver being restarted more than 400 times (each) with exit code 137 (non OOM killed) General Discussions	3	1104	April 4, 2023
Correctly handle OOM killed job General Discussions	1	1686	June 13, 2019

Subprocess Killed with a 137 error

Related topics