I have always had this confusion and it seems there are different opinions on this topic.
When we specify a request memory and cpu for a pod, kubernetes uses this to schedule the pod on a node with that much memory and cpu available on that node at that point.
Now, the question is, does this pod reserve this much requested memory and cpu upfront even if it doesn’t use that much amount at runtime?
Secondly, if used memory is lesser than requested memory, still the pod continue to reserve the requested memory for itself thus the margin between request and used memory is not made available to other pods on that node?
Example scenario: for pod A, request memory is 500MB and limit is 600MB, pod is scheduled on node with 1GB available memory. Another pod B with request memory 500MB is scheduled on same node later.
Now, pod A is using 500MB however pod B is using 300MB.
Later pod A needs to use 600MB. Now, pod B is still using 300MB out of requested 500MB. So, will pod A be allowed to use 600MB since pod B is using lesser than it requested or since pod B requested and reserved 500MB upfront, it won’t release that even if it is using only 300MB?
Cluster information:
Kubernetes version: 1.21.9
Cloud being used: (put bare-metal if not on a public cloud) : Azure AKS
Installation method: AKS managed service on Azure cloud
Host OS: RHEL
CNI and version:
CRI and version:
No, they are not reserved in the sense you describe. How they work is slightly different from each other.
In the case you laid out, Pod A can use the 600Mi. The problem comes when pod B eventually needs to use it and it isn’t available. The OS will try to free up memory from anywhere (pod A is strictly under its own limit, so it is not specifically victimized) and if it can’t find pages quickly enough it will OOM. So pod A caused OOM for pod B. Hence the oft repeated guidance to always set memory limit == request.
CPU is more easily reclaimed and pod A would just get less CPU time in the future, if pod B wanted to use it’s request.
Thanks @thockin for the reply… so as I understand this, when the OOM for pod B will occur, essentially the used memory on node should reflect as 100% in the kubernetes node metrics, is that correct?
that is a system-OOM (as opposed to a local cgroup OOM).
@thockin According to you, when OOM happens at the system level, the OS scores the process and decides who should be killed based on the score, so in this case, pod B is not necessarily killed, right?
In the case of system OOM, the kernel uses heuristics to decide what to do. Pod B is the one requesting memory, so it is the one which triggers the OOM condition. That doesn’t mean it will be the one to die.
@thockin Yes, I agree with you. I think there will definitely be pods killed and memory released for use by podB, as podB’s requests need to be fulfilled.
PS: Thank you for your reply. Sorry that I didn’t reply in time because I didn’t receive the notice.