I’ve run out of ideas regarding this subject and am looking for new ideas.
I don’t believe the specifics of the cluster are all that important but we run a custom Kubernetes 1.12.2 cluster on GCE that uses cluster-autoscaler to scale GCE Migs when we need node capacity. We observe the following when a new node is needed:
- CA boots a new node in the appropriate MIG
- The VM boots and starts the kubelet
- as the kubelet starts up it gets a set of daemonset pods to run that provide services at a node level, such as logging, etc
- while those DS pods are still in the process of starting up the kubelet marks the Node as “ready”
- The scheduler notices the nodeReady condition and schedules “regular” non-DS pods on the node
I believe this is normal expected behavior but it causes problems for us. Those node level services are not done starting by the time the non-DS pods start up and some of the of the non-DS pods rely on the node level services which causes them to get into bad states and ultimately fail. We also have a deployment that will scale up hundreds of pods at a time and as soon as a node becomes Ready there will be a thundering herd that causes resource contention slowing down the starting of the DS pods even more.
We do run in a cloud and should expect some level of entropy and I do think there is a little opportunity to optimize around the DS pods. However, I would really like a more certain less racey way to block non-DS pods while DS pods are starting.
A couple of ideas I had come up with are adding a “shadow” taint and the other artificially setting the MaxPods == DS pods when the node boots. I call the first idea a shadow-taint because this taint was added to the list of taints to add to the node in the kubelet configuration but was not exposed to the MIG metadata so the autoscaler didn’t know about it. The MaxPods manipulation worked in a similar way by booting the node with a kubelet configuration that equaled the number of DS pods. Both solutions worked roughly the same way with a simple shell script in the init of the booting node that waited for the DS pods to finish starting, or a timeout, and removed the blocking condition, by removing the shadow-taint or setting MaxPods back to default of 200. Both of these solutions actually worked for the most part and did block non-DS pods even after the node went into a Ready state until the condition was removed. The problem is that in each case as soon as the node became Ready and the non-DS pods were not scheduled to the node the cluster-autoscaler would then double down and boot another node. For the deployment I mentioned earlier that could spawn hundreds of pods it would WAY over allocate nodes costing us considerable money.
I could lengthen the interval time of the cluster-autoscaler but we would lose the reactiveness of booting nodes to meet demand and it would still not be deterministic.
There is a small project that I discovered after writing my solutions that is a Kubernetes controller and uses a node taint very similarly to my shadow-taint but has the same cluster-autoscaler pitfalls as my solution. It can be found here: https://github.com/mikkeloscar/kube-node-ready-controller
I really hope something could be added to Kubernetes to help address this problem, I am surprised more people have not had problems with this.