I have 5 nodes running in k8s cluster and with around 30 pods.
Some of the pods usually take high memory. At one stage we found a node went to “not ready” state when the sum of memory of all running pods exceeded node memory. Anyhow, I increased the resource request memory to high value for high memory pods but shouldn’t node controller kill all the pods and restarts all instead of making a node to “not ready” state?
Suppose 4 pods were already running in a node and scheduler allowed another pod to get added in that node as resource request memory is within the node left memory capacity. Now over a period of time due to some reason all pods memory started increasing and although each pod memory is still under the individual resource memory limit value but sum of all pods memory exceeds the node memory and this causes the node to “not ready” state.
Is there any way to overcome this situation?
Due to this all pods get shifted to other node or some pods to pending as it has higher resource request value.
Please help me how to handle this.
Cluster information:
Kubernetes version: 1.10.6
Cloud being used: (put bare-metal if not on a public cloud) AWS
Installation method:
Host OS:
CNI and version:
CRI and version:
Thanks Rata for quick response. Not sure I follow. Pod mem usage is RSS men usage?
Yes, pod mem usage is RSS mem usage.
Are you running with swap disabled, right? In that case, I don’t see how process (pods) can use more mem than the node has.
Yes, running with swap disabled. Suppose 3 pods are running in a 32 GB memory node and each pod currently taking 8 GB RSS mem with each resource memory limit 10GB. Now going to add a new pod with resource request value 1GB and resource memory limit 10GB, so all 4 pods are running fine for a while. After a while, RSS memory of a new pod started increasing and reached to 8GB. Now total RSS memory of pods = 32 GB which is equal to node memory capacity. At this stage, if any pod memory goes further (still less than limit 10GB) then it was making that node to “not ready” state. I was seeing this and speculating this scenario. If the pods memory trying to exceed node then ideally what should happen? Looks to me this situation is making the node to “not ready” state instead of restarting all the pods.
If that happens, the kernel OOM killer should kill processes to bring mem usage in the node back to normal. That is the ““ideal”” (because it is, of course, not perfect nor very nice). Do you see in the kernel logs (or dmesg if haven’t rebooted) output that says OOM or similar?
Of course processes can have different scores and adjustments for the OOM and you might be in a weird situation where that is not very effective? Sounds weird, though.
What happens after a while the system is not ready? Does it recover? Maybe the OOM is starting late or something, bit after a while it kills some processes and frees RAM and the node recovers?
If those process are killed, will they start with small amount of RAM usage? Or grow big fast? If they will grow big fast, they can be killed and restarted very soon and the node does not recover?
And just curious, can you reproduce if you write a simple program that uses 32GB and run it in the node? (Like, just a malloc and write something to every byte so it is actually allocated by the kernel). You can try with 30/32/34, something close to node mem capacity.
Now, regarding the kubernetes layer, why do you have so big difference between request and limit? Is that in purpose?
Unfortunately I can’t provide describe node now as we have restarted the k8s cluster with 2 additional nodes and increased memory request value to a high value for some memory intensive pods.
Yes, I had seen OOM error in the log but it made the node to notready state. Ideally, it should have killed the pods to restart.
I have increased the request memory value now for high memory intensive pods but trying to fix at kubernetes config level.
Do you know how can we make OOM to trigger fast so that pods could be killed and at least not to go to notready state?
If some situation is still not good, we can try to open an issue and even create some patches if we come up with ideas to improve it
I’ve not seen that myself (the OOM kills the big mem using process in my clusters), but I’m definitely interested to see what is possible if your cluster setups have different requirements and things don’t bode so well there