How can I correctly set Kubernetes pod eviction limits, to avoid system OOM killer?

Cluster information:

Kubernetes version: v1.18.3
Cloud being used: bare metal
Installation method: Rancher (RKE)
Host OS: Redhat 7
CRI and version: docker 19.3.12

I’m trying to set up eviction thresholds and resource reservations in such a way that there is always at least 1GiB of memory available.

Going on the documentation regarding resource reservations and out-of-resource handling, I figured setting the following eviction policy would suffice:

--eviction-hard=memory.available<1Gi

However, in practice, this does not work at all, as the computation the kubelet does seems to be different from the computation the kernel does when it needs to determine whether or not the OOM killer needs to be invoked. E.g. when I load up my system with a bunch of pods running an artificial memory hog, I get the following report from free -m:

Total:      15866
Used:       14628
free:       161
shared:     53
buff/cache: 1077
available:  859

According to the kernel, there’s 859 MiB memory available. Yet, the kubelet does not invoke its eviction policy. In fact, I’ve been able to invoke the system OOM killer before the kubelet eviction policy was invoked, even when ramping up memory usage incredibly slowly (to allow the kubelet housekeeing control loop to sleep 10 seconds, as per its default configuration).

I’ve found this script which used to be in Kubernetes documentation and is supposed to calculate the available memory in the same way the Kubelet does. I ran it in parallel to free -m above and got the following result:

memory.available_in_mb 1833

That’s almost 1000M difference!

Now, I understand the calculation was by design, but that leaves me with the obvious question: how can I reliably manage system resource usage so that the system OOM killer does not get invoked? What eviction policy can I set so the kubelet will start evicting pods when there’s less than a gigabyte of memory available?

I’ve been able to trigger the eviction mechanism by simply increasing the eviction-hard limit to 2Gi, here too, though, the actual available memory at the time the eviction started was way, way lower than the limit set in the configuration.

Additionally, I tried the example from the out-or-resource handling doc page, which states that setting the eviction-hard to 500 Mi and the system-reserved at 1.5Gi would result in pods being evicted when there’s half a gig of memory left. That has the same issues as what I described above.

1 Like