Forgive me for not including all the cluster information in this ticket, we have several clusters so this is a general question, any help would be much appreciated.
We have a node pod that shells out some heavy lifting to a bash script. The problem is the bash script is being killed with a 137 (memory) error.
The node pod is not being killed and the bash script’s memory is well below the pod’s resource limit.
Googling for an age yielded nothing as everyone is talking about pods being OOMd but, in this case, the pod stays up, just the script is killed.
Does anyone have any ideas where I can do some more digging?
exit code 137 is signal 9 (128 + 9 = 137). Signal 9 is SIGKILL. It’s not just OOM that could cause this - anything which can signal your process could cause it.
If ti was OOM, you would see something in dmesg usually, which might give you a clue.
[ +0.000002] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=...,mems_allowed=0,oom_memcg=/kubepods/burstable/...,task_memcg=/kubepods/burstable/pod...
[ +0.000022] Memory cgroup out of memory: Killed process 1463994 ...
That is defo the issue! Now I have somewhere to dig!
Yeah so this was an OOM because the firing up of the bash (as a sub-process) tipped the pod over its total allocation and the node terminated the sub-process. I watched dmesg with
sudo dmesg -wH
and then varied the resource limits on our dev cluster and observed that
if the main process exceeds the limits.memory. the pod is restarted
if a subprocess pushes the resources over the limits.memory the subprocess is killed but the pod remains running.
if the limits.memory is high enough the pod is not restarted and the sub process executes ok
The exit code of the process was 137 because that is how cgroup memory violations are handled in the cluster. Doing a describe on the pod, or checking its status after it crashed should have shown that it was evicted due to oom. (You may see 137 for failed health checks too).
So exit code 137 can be deceptive because they can be OOM or it could be some foreign process sending a kill signal. You can enable audit logs to see if its a foreign process .
But in my experience if the process is being killed in an expected way by kubernetes (OOM violation, failed heath check, etc) when describing the pod there should be a corresponding event.