Node "not ready" state when sum of all running pods exceed node capacity

I have 5 nodes running in k8s cluster and with around 30 pods.
Some of the pods usually take high memory. At one stage we found a node went to “not ready” state when the sum of memory of all running pods exceeded node memory. Anyhow, I increased the resource request memory to high value for high memory pods but shouldn’t node controller kill all the pods and restarts all instead of making a node to “not ready” state?
Suppose 4 pods were already running in a node and scheduler allowed another pod to get added in that node as resource request memory is within the node left memory capacity. Now over a period of time due to some reason all pods memory started increasing and although each pod memory is still under the individual resource memory limit value but sum of all pods memory exceeds the node memory and this causes the node to “not ready” state.
Is there any way to overcome this situation?
Due to this all pods get shifted to other node or some pods to pending as it has higher resource request value.

Please help me how to handle this.

Cluster information:

Kubernetes version: 1.10.6
Cloud being used: (put bare-metal if not on a public cloud) AWS
Installation method:
Host OS:
CNI and version:
CRI and version:

Not sure I follow. Pod mem usage is RSS men usage?

Are you running with swap disabled, right? In that case, I don’t see how process (pods) can use more mem than the node has.

Am I missing something? Is it possible that some other thing happened?

Thanks Rata for quick response.
Not sure I follow. Pod mem usage is RSS men usage?

  • Yes, pod mem usage is RSS mem usage.

Are you running with swap disabled, right? In that case, I don’t see how process (pods) can use more mem than the node has.

  • Yes, running with swap disabled. Suppose 3 pods are running in a 32 GB memory node and each pod currently taking 8 GB RSS mem with each resource memory limit 10GB. Now going to add a new pod with resource request value 1GB and resource memory limit 10GB, so all 4 pods are running fine for a while. After a while, RSS memory of a new pod started increasing and reached to 8GB. Now total RSS memory of pods = 32 GB which is equal to node memory capacity. At this stage, if any pod memory goes further (still less than limit 10GB) then it was making that node to “not ready” state. I was seeing this and speculating this scenario. If the pods memory trying to exceed node then ideally what should happen? Looks to me this situation is making the node to “not ready” state instead of restarting all the pods.

Ohh, I see now. Thanks!

If that happens, the kernel OOM killer should kill processes to bring mem usage in the node back to normal. That is the ““ideal”” (because it is, of course, not perfect nor very nice). Do you see in the kernel logs (or dmesg if haven’t rebooted) output that says OOM or similar?

Of course processes can have different scores and adjustments for the OOM and you might be in a weird situation where that is not very effective? Sounds weird, though.

What happens after a while the system is not ready? Does it recover? Maybe the OOM is starting late or something, bit after a while it kills some processes and frees RAM and the node recovers?

If those process are killed, will they start with small amount of RAM usage? Or grow big fast? If they will grow big fast, they can be killed and restarted very soon and the node does not recover?

And just curious, can you reproduce if you write a simple program that uses 32GB and run it in the node? (Like, just a malloc and write something to every byte so it is actually allocated by the kernel). You can try with 30/32/34, something close to node mem capacity.

Now, regarding the kubernetes layer, why do you have so big difference between request and limit? Is that in purpose?

I am trying to understand your situation better.

Node has 32GB in total.

3 Pods, Pods 1 to 3: Has following request & limit
8GB Request & 10 GB Limits

4th Pod has
1 GB Request & 10 GB Limits.

When 4th pod memory reaches 8GB, nodes goes in to Not Ready state?

Question

  1. Is above summary of your situation right ?
  2. How do you know the nodes crashes when 4th pod memory reaches 8GB ?

Yes, the well described situation is right.

I m seeing that the node is in not ready state and pods r moved to other node and now this node is under pressure.

  1. Can you do a describe for that node on working and non-working state ?
  2. How do you know that when pod-4 reaches 8GB, node crashes ?

Node is in running state as seen in aws console but k8s is not running any pods on it.

I have datadog alert setup for each pod

  1. Can you do a describe on the node and paste the output ?
  2. Can also throw some light on the metric you are throwing alert on ?

Unfortunately I can’t provide describe node now as we have restarted the k8s cluster with 2 additional nodes and increased memory request value to a high value for some memory intensive pods.

You can see this

https://medium.com/retailmenot-engineering/what-happens-when-a-kubernetes-pod-uses-too-much-memory-or-too-much-cpu-82165022f489

Here he is having only one pod running in a node and making node to notready state.

Yes, I had seen OOM error in the log but it made the node to notready state. Ideally, it should have killed the pods to restart.
I have increased the request memory value now for high memory intensive pods but trying to fix at kubernetes config level.
Do you know how can we make OOM to trigger fast so that pods could be killed and at least not to go to notready state?

Oh, if you are seeing the OOM, it makes sense for me. But is the node able to become ready again after some time?

Please have a look at this documentation doc, that explains this a little more and some best practices: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior

I think that seems very useful, and maybe the flags to the kubelet can do the trick :slight_smile:

Thanks. This is a great help. I will also try to upgrade my k8s version from 1.10.6 to 1.16.0

Please keep us posted on your findings!

If some situation is still not good, we can try to open an issue and even create some patches if we come up with ideas to improve it :slight_smile:

I’ve not seen that myself (the OOM kills the big mem using process in my clusters), but I’m definitely interested to see what is possible if your cluster setups have different requirements and things don’t bode so well there