Node get unstable with high memory usage

ajfriesen · December 22, 2020, 10:19am

Problem

I am working at a cloud provider and we offer managed kubernetes.
In rare cases we have worker nodes which get overloaded and the host OS is completely unresponsive.
The host can answer via ping but ssh is often not possible anymore and the Node status goes to NotReady for this node since kubelet can not answer anymore.
During that time k8s will not reschedule pods from that host which is something nobody wants to happen. When a node goes down it should reschedule workload.

We nailed this down to memory consumption by the workload which is running an that host.
The typical behaviour which we see is, that pods are consuming way more memory then there is available. This leads after some time to high disk IO.
After the disk IO is at 100% for that specific storage the node is NotReady anymore and will not respond.

Solution after some investigation

After reading some documentation and searching the internet for this related topic we did some changes to the worker nodes.

Changes

switched the cgroup-driver from cgroupsfs to systemd for docker and kubelet
- Reserve Compute Resources for System Daemons | Kubernetes
Created a podruntime.slice where we put docker, containerd and kubelet according
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md#recommended-cgroups-setup

/podruntime.slice                     
/podruntime.slice/containerd.service  
/podruntime.slice/docker.service      
/podruntime.slice/kubelet.service

enabled resource accounting in systemd

/etc/systemd/system.conf.d/accounting.conf
[Manager]
DefaultCPUAccounting=yes
DefaultMemoryAccounting=yes
DefaultBlockIOAccounting=yes
DefaultTasksAccounting=yes
DefaultIOAccounting=yes
DefaultIPAccounting=yes

set eviction-hard to 1GB of memory
- https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#eviction-thresholds

We did not want to mess with system-reserved and kube-reserved.
We do not know how much resources different clusters do take right now and do not want to kill processes.

The test with this changes show more stable workers when applying pods with lots of memory compared to the setup with cgroupfs.
But it is still possible to get the worker in the unresponsive state, just less likely.

Question

Are there any other measures which one could do to make the system more stable?

We were also thinking of a userspace oom killer like earlyoom. But the problem is, that this is not transparent for the k8s user. Since it will not be visible as a system oom or a eviction.

Cluster information:

Kubernetes version:

1.16
1.17
1.18
1.19

Cloud being used:

bare metal

Installation method:

kubeadm
docker

Host OS:

ubuntu 18.04

Topic		Replies	Views
Scheduling-according-to-the-available-memory-of-the-node General Discussions	5	1968	November 29, 2023
Node "not ready" state when sum of all running pods exceed node capacity General Discussions	13	7759	September 29, 2019
Rescheduling pod after scale up General Discussions	7	14337	February 7, 2022
Nodes crashed. node.kubernetes.io/unreachable:NoSchedule taint microk8s microk8s	3	3783	March 22, 2023
Is available memory taken into consideration when scheduling pods? General Discussions	5	564	May 17, 2024

Node get unstable with high memory usage

Problem

Solution after some investigation

Question

Cluster information:

Related topics