I am working at a cloud provider and we offer managed kubernetes.
In rare cases we have worker nodes which get overloaded and the host OS is completely unresponsive.
The host can answer via ping but ssh is often not possible anymore and the Node status goes to
NotReady for this node since kubelet can not answer anymore.
During that time k8s will not reschedule pods from that host which is something nobody wants to happen. When a node goes down it should reschedule workload.
We nailed this down to memory consumption by the workload which is running an that host.
The typical behaviour which we see is, that pods are consuming way more memory then there is available. This leads after some time to high disk IO.
After the disk IO is at 100% for that specific storage the node is
NotReady anymore and will not respond.
Solution after some investigation
After reading some documentation and searching the internet for this related topic we did some changes to the worker nodes.
- switched the cgroup-driver from cgroupsfs to systemd for docker and kubelet
- Created a podruntime.slice where we put docker, containerd and kubelet according
/podruntime.slice /podruntime.slice/containerd.service /podruntime.slice/docker.service /podruntime.slice/kubelet.service
- enabled resource accounting in systemd
/etc/systemd/system.conf.d/accounting.conf [Manager] DefaultCPUAccounting=yes DefaultMemoryAccounting=yes DefaultBlockIOAccounting=yes DefaultTasksAccounting=yes DefaultIOAccounting=yes DefaultIPAccounting=yes
- set eviction-hard to 1GB of memory
We did not want to mess with system-reserved and kube-reserved.
We do not know how much resources different clusters do take right now and do not want to kill processes.
The test with this changes show more stable workers when applying pods with lots of memory compared to the setup with cgroupfs.
But it is still possible to get the worker in the unresponsive state, just less likely.
Are there any other measures which one could do to make the system more stable?
We were also thinking of a userspace oom killer like earlyoom. But the problem is, that this is not transparent for the k8s user. Since it will not be visible as a system oom or a eviction.
Cloud being used:
- bare metal
- ubuntu 18.04