Cluster information:
Kubernetes version: 1.24.9
Cloud being used: bare-metal
Installation method:
Host OS: Ubuntu
CNI and version:
CRI and version:
resources:
limits:
cpu: '6'
memory: 30000Mi
requests:
cpu: '2'
memory: 4000Mi
Hello together,
I’m on a cluster with two worker nodes with 500 GB memory each.
When trying to start a pod with requested 40 GB and 100 GB limit, I get a return code 137.
The namespace has no limit - as far as I can see - and the worker node has several hundred GB free memory.
Most of the other PODs have no requests / limits - what makes me think about possible default values for requested memory. Is it possible, that the defaults are % of the host memory?
Where / what could I look for to evaluate the issue?
Thanks in advance,
Alex
Code 137 is signal 9, meaning a hard kill. It could be the OOM handler but could be something else, too.
Your example shows a 4 Gi request, not 40.
Yes, I had to reduce my request / limit of the deployment/pod in order to get it running.
Does OOM handler mean, that my issue is most probably related to the OS (ubuntu)?
Or do I have to dig deeper into k8s?
I’m a bit irritated, since the OS has so much free memory, but my process is killed / doesn’t start.
If it is OOM, there should be kernel logd (dmesg). OOM would happen when that container tries to use more than its LIMIT, and is entirely driven by the kernel.
It’s POSSIBLE that the container runtime is setting up the cgroup wrong, but I would expect many such bug reports very quickly. You can look at the memory*
files in the Pod’s cgroup.
It’s POSSIBLE that your process is spiking usage very quickly and so you are not seeing it.
Is it possible that something else is killing the pod? Signal 9 is also delivered when a Pod is deleted in the API.