Forgive me for not including all the cluster information in this ticket, we have several clusters so this is a general question, any help would be much appreciated.
We have a node pod that shells out some heavy lifting to a bash script. The problem is the bash script is being killed with a 137 (memory) error.
The node pod is not being killed and the bash script’s memory is well below the pod’s resource limit.
Googling for an age yielded nothing as everyone is talking about pods being OOMd but, in this case, the pod stays up, just the script is killed.
Does anyone have any ideas where I can do some more digging?
exit code 137 is signal 9 (128 + 9 = 137). Signal 9 is SIGKILL. It’s not just OOM that could cause this - anything which can signal your process could cause it.
If ti was OOM, you would see something in dmesg usually, which might give you a clue.
[ +0.000002] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=...,mems_allowed=0,oom_memcg=/kubepods/burstable/...,task_memcg=/kubepods/burstable/pod...
[ +0.000022] Memory cgroup out of memory: Killed process 1463994 ...
That is defo the issue! Now I have somewhere to dig!
Yeah so this was an OOM because the firing up of the bash (as a sub-process) tipped the pod over its total allocation and the node terminated the sub-process. I watched dmesg with
sudo dmesg -wH
and then varied the resource limits on our dev cluster and observed that
if the main process exceeds the limits.memory. the pod is restarted
if a subprocess pushes the resources over the limits.memory the subprocess is killed but the pod remains running.
if the limits.memory is high enough the pod is not restarted and the sub process executes ok