POD shows return code 137 - what are the defaults?

alexgordtop · March 12, 2024, 12:35pm

Cluster information:

Kubernetes version: 1.24.9
Cloud being used: bare-metal
Installation method:
Host OS: Ubuntu
CNI and version:
CRI and version:

resources:
  limits:
    cpu: '6'
    memory: 30000Mi
  requests:
    cpu: '2'
    memory: 4000Mi

Hello together,

I’m on a cluster with two worker nodes with 500 GB memory each.
When trying to start a pod with requested 40 GB and 100 GB limit, I get a return code 137.

The namespace has no limit - as far as I can see - and the worker node has several hundred GB free memory.

Most of the other PODs have no requests / limits - what makes me think about possible default values for requested memory. Is it possible, that the defaults are % of the host memory?

Where / what could I look for to evaluate the issue?

Thanks in advance,
Alex

thockin · March 12, 2024, 2:47pm

Code 137 is signal 9, meaning a hard kill. It could be the OOM handler but could be something else, too.

Your example shows a 4 Gi request, not 40.

alexgordtop · March 12, 2024, 3:02pm

Yes, I had to reduce my request / limit of the deployment/pod in order to get it running.

Does OOM handler mean, that my issue is most probably related to the OS (ubuntu)?
Or do I have to dig deeper into k8s?

I’m a bit irritated, since the OS has so much free memory, but my process is killed / doesn’t start.

thockin · March 12, 2024, 3:36pm

If it is OOM, there should be kernel logd (dmesg). OOM would happen when that container tries to use more than its LIMIT, and is entirely driven by the kernel.

It’s POSSIBLE that the container runtime is setting up the cgroup wrong, but I would expect many such bug reports very quickly. You can look at the memory* files in the Pod’s cgroup.

It’s POSSIBLE that your process is spiking usage very quickly and so you are not seeing it.

Is it possible that something else is killing the pod? Signal 9 is also delivered when a Pod is deleted in the API.

Topic		Replies	Views
Subprocess Killed with a 137 error General Discussions deployment	4	6049	May 1, 2024
Pod memory overflow General Discussions	1	117	December 13, 2024
Kube-apiserver being restarted more than 400 times (each) with exit code 137 (non OOM killed) General Discussions	3	1049	April 4, 2023
Why k8s defaults the requests to the limits? General Discussions	2	751	July 16, 2024
Question about settings CPU and Mem requests and limits General Discussions	1	571	September 20, 2020

POD shows return code 137 - what are the defaults?

Cluster information:

Related topics