Reading through the Resouce Management section of the docs it states (emphasis mine) that
The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node
So, what happens in the following scenario?
I have a node with 32gb of memory on it and swap disabled. My pods have a request of 8gb of memory and no limit configured. There are 2 pods currently executing using 14gb of memory, meaning there is only 4gb of memory left. So 28gb of memory is actually in use (leaving only 4gb free), but only 16gb has been requested (technically leaving a budget of 32gb).
- Can a third pod be scheduled on this node?
- What would the effect be of setting limits of 12gb per pod? (I’m assuming they would be terminated immediately as soon as they went over 12gb)
- What would the effect be of settings limits of 16gb per pod? (I’m assuming nothing, but some documentation makes it sound like those pods will be terminated to make way for other pods)
- Can a third pod be scheduled on this node?
Yes, but it may not be able to get 8GB of memory until one of the other pods gives it up.
- What would the effect be of setting limits of 12gb per pod?
As soon as they hit 12GB usage, further allocations would cause the OS to try to reclaim memory from that pod, and if that fails, kill it as OOM.
- What would the effect be of settings limits of 16gb per pod?
Same as (2) but at a higher number. Any pod whose usage is greater than its request MAY be killed at any time, to service another pod whose usage is less than its request.
1 Like
Thanks for that. Following on from Q1 then, is there a built-in way of preventing that third pod from being scheduled at that point?
We have quite a varied workload which we’re looking at processing using Jobs in Kubernetes with Keda. Some of the tasks will only take 1gb memory, some may take 14. They’re evictable and will happily restart, but if there’s less than 8gb of memory available then it’s probably not worth starting a new one.
I’ve been looking into the memory pressure taint, but it doesn’t look like you can configure that just with a “don’t schedule any more pods” option. If you configure that it looks like it will start evicting pods to get back below the threshold.
The tasks are usually quite short running (less than 15 minutes) and bursty in nature (2 peaks per day) so what I’m currently considering is using a soft-eviction threshold of (for example) <8gb
and then setting a eviction-soft-grace-period
value of 15 minutes. If I’m reading that right there will have to be <8gb of memory available for 15 minutes before it starts considering evicting nodes. What I don’t know is whether the taint (to prevent new nodes being scheduled) will be applied immediately.
is there a built-in way of preventing that third pod from being scheduled at that point?
Not really. You overcommitted the machine. For CPU it would not be a problem, because CPU is a “time-based” rather than “space-based” (aka compressible / incompressible).
The more deterministic pattern might be to limit to 8GiB and if a given task fails with OOM, make it ramp up the resource request. If it fails at 8GB, try it with 10. If it fails with 10, try it with 12.
Alternatively, you can “right size” pods with the (currently alpha) in-place resize API. So you could write a controller which looks at ACTUAL usage over time and in-place update. Start with request=8G and limit=16G, then watch. If actual usage is 12G, change the request to 12G (and maybe the limit). NOW the scheduler is aware of the truth.
Interesting. I’ll keep my eye on that API. We’re a few months off being able to migrate to this I reckon so waiting for that API to GA might be our best bet.
I think a very rudimentary fallback would be to have a custom pod running on the node that looks at available memory of the node and adds/removes a taint as appropriate.
I appreciate that technically I’ve overprovisioned the node, the scheduler could be aware of that before scheduling the third pod
Just found this post that is asking the same question. Not sure why I missed it when searching earlier