Subprocess Killed with a 137 error

Hi All,

Forgive me for not including all the cluster information in this ticket, we have several clusters so this is a general question, any help would be much appreciated.

We have a node pod that shells out some heavy lifting to a bash script. The problem is the bash script is being killed with a 137 (memory) error.

The node pod is not being killed and the bash script’s memory is well below the pod’s resource limit.

Googling for an age yielded nothing as everyone is talking about pods being OOMd but, in this case, the pod stays up, just the script is killed.

Does anyone have any ideas where I can do some more digging?

Many thanks!

exit code 137 is signal 9 (128 + 9 = 137). Signal 9 is SIGKILL. It’s not just OOM that could cause this - anything which can signal your process could cause it.

If ti was OOM, you would see something in dmesg usually, which might give you a clue.

Thanks for the info :slight_smile:

dmesg yielded:

[  +0.000002] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=...,mems_allowed=0,oom_memcg=/kubepods/burstable/...,task_memcg=/kubepods/burstable/pod...

[ +0.000022] Memory cgroup out of memory: Killed process 1463994 ...

That is defo the issue! Now I have somewhere to dig!

I will report back.

Thanks again!

Yeah so this was an OOM because the firing up of the bash (as a sub-process) tipped the pod over its total allocation and the node terminated the sub-process. I watched dmesg with

sudo dmesg -wH

and then varied the resource limits on our dev cluster and observed that

  • if the main process exceeds the limits.memory. the pod is restarted
  • if a subprocess pushes the resources over the limits.memory the subprocess is killed but the pod remains running.
  • if the limits.memory is high enough the pod is not restarted and the sub process executes ok

Thanks again :slight_smile:

The exit code of the process was 137 because that is how cgroup memory violations are handled in the cluster. Doing a describe on the pod, or checking its status after it crashed should have shown that it was evicted due to oom. (You may see 137 for failed health checks too).

So exit code 137 can be deceptive because they can be OOM or it could be some foreign process sending a kill signal. You can enable audit logs to see if its a foreign process .

But in my experience if the process is being killed in an expected way by kubernetes (OOM violation, failed heath check, etc) when describing the pod there should be a corresponding event.