Forgive me for not including all the cluster information in this ticket, we have several clusters so this is a general question, any help would be much appreciated.
We have a node pod that shells out some heavy lifting to a bash script. The problem is the bash script is being killed with a 137 (memory) error.
The node pod is not being killed and the bash script’s memory is well below the pod’s resource limit.
Googling for an age yielded nothing as everyone is talking about pods being OOMd but, in this case, the pod stays up, just the script is killed.
Does anyone have any ideas where I can do some more digging?
exit code 137 is signal 9 (128 + 9 = 137). Signal 9 is SIGKILL. It’s not just OOM that could cause this - anything which can signal your process could cause it.
If ti was OOM, you would see something in
dmesg usually, which might give you a clue.
Thanks for the info
[ +0.000002] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=...,mems_allowed=0,oom_memcg=/kubepods/burstable/...,task_memcg=/kubepods/burstable/pod...
[ +0.000022] Memory cgroup out of memory: Killed process 1463994 ...
That is defo the issue! Now I have somewhere to dig!
I will report back.
Yeah so this was an OOM because the firing up of the bash (as a sub-process) tipped the pod over its total allocation and the node terminated the sub-process. I watched dmesg with
sudo dmesg -wH
and then varied the resource limits on our dev cluster and observed that
- if the main process exceeds the limits.memory. the pod is restarted
- if a subprocess pushes the resources over the limits.memory the subprocess is killed but the pod remains running.
- if the limits.memory is high enough the pod is not restarted and the sub process executes ok