Node get unstable with high memory usage

Problem

I am working at a cloud provider and we offer managed kubernetes.
In rare cases we have worker nodes which get overloaded and the host OS is completely unresponsive.
The host can answer via ping but ssh is often not possible anymore and the Node status goes to NotReady for this node since kubelet can not answer anymore.
During that time k8s will not reschedule pods from that host which is something nobody wants to happen. When a node goes down it should reschedule workload.

We nailed this down to memory consumption by the workload which is running an that host.
The typical behaviour which we see is, that pods are consuming way more memory then there is available. This leads after some time to high disk IO.
After the disk IO is at 100% for that specific storage the node is NotReady anymore and will not respond.

Solution after some investigation

After reading some documentation and searching the internet for this related topic we did some changes to the worker nodes.

Changes

/podruntime.slice                     
/podruntime.slice/containerd.service  
/podruntime.slice/docker.service      
/podruntime.slice/kubelet.service     
  • enabled resource accounting in systemd
/etc/systemd/system.conf.d/accounting.conf
[Manager]
DefaultCPUAccounting=yes
DefaultMemoryAccounting=yes
DefaultBlockIOAccounting=yes
DefaultTasksAccounting=yes
DefaultIOAccounting=yes
DefaultIPAccounting=yes

We did not want to mess with system-reserved and kube-reserved.
We do not know how much resources different clusters do take right now and do not want to kill processes.

The test with this changes show more stable workers when applying pods with lots of memory compared to the setup with cgroupfs.
But it is still possible to get the worker in the unresponsive state, just less likely.

Question

Are there any other measures which one could do to make the system more stable?

We were also thinking of a userspace oom killer like earlyoom. But the problem is, that this is not transparent for the k8s user. Since it will not be visible as a system oom or a eviction.

Cluster information:

Kubernetes version:

  • 1.16
  • 1.17
  • 1.18
  • 1.19

Cloud being used:

  • bare metal

Installation method:

  • kubeadm
  • docker

Host OS:

  • ubuntu 18.04