Start regular pods after all daemonset pods are Running

I’ve run out of ideas regarding this subject and am looking for new ideas.

I don’t believe the specifics of the cluster are all that important but we run a custom Kubernetes 1.12.2 cluster on GCE that uses cluster-autoscaler to scale GCE Migs when we need node capacity. We observe the following when a new node is needed:

  • CA boots a new node in the appropriate MIG
  • The VM boots and starts the kubelet
  • as the kubelet starts up it gets a set of daemonset pods to run that provide services at a node level, such as logging, etc
  • while those DS pods are still in the process of starting up the kubelet marks the Node as “ready”
  • The scheduler notices the nodeReady condition and schedules “regular” non-DS pods on the node

I believe this is normal expected behavior but it causes problems for us. Those node level services are not done starting by the time the non-DS pods start up and some of the of the non-DS pods rely on the node level services which causes them to get into bad states and ultimately fail. We also have a deployment that will scale up hundreds of pods at a time and as soon as a node becomes Ready there will be a thundering herd that causes resource contention slowing down the starting of the DS pods even more.

We do run in a cloud and should expect some level of entropy and I do think there is a little opportunity to optimize around the DS pods. However, I would really like a more certain less racey way to block non-DS pods while DS pods are starting.

A couple of ideas I had come up with are adding a “shadow” taint and the other artificially setting the MaxPods == DS pods when the node boots. I call the first idea a shadow-taint because this taint was added to the list of taints to add to the node in the kubelet configuration but was not exposed to the MIG metadata so the autoscaler didn’t know about it. The MaxPods manipulation worked in a similar way by booting the node with a kubelet configuration that equaled the number of DS pods. Both solutions worked roughly the same way with a simple shell script in the init of the booting node that waited for the DS pods to finish starting, or a timeout, and removed the blocking condition, by removing the shadow-taint or setting MaxPods back to default of 200. Both of these solutions actually worked for the most part and did block non-DS pods even after the node went into a Ready state until the condition was removed. The problem is that in each case as soon as the node became Ready and the non-DS pods were not scheduled to the node the cluster-autoscaler would then double down and boot another node. For the deployment I mentioned earlier that could spawn hundreds of pods it would WAY over allocate nodes costing us considerable money.

I could lengthen the interval time of the cluster-autoscaler but we would lose the reactiveness of booting nodes to meet demand and it would still not be deterministic.

There is a small project that I discovered after writing my solutions that is a Kubernetes controller and uses a node taint very similarly to my shadow-taint but has the same cluster-autoscaler pitfalls as my solution. It can be found here: https://github.com/mikkeloscar/kube-node-ready-controller

I really hope something could be added to Kubernetes to help address this problem, I am surprised more people have not had problems with this.

Thanks
Ryan

Several questions:

  1. What happens if everything is running and the DS pods crash for some minutes? Would the applications recover automatically when the DS pods recover?

I’m sure you know, but is worth to mention that in that regard in can cause problems if the applications need some other application (the DS) to be always running and doesn’t handle the failure nicely.

Is it an option to modify the applications to handle failures of this other app, running in the DS, nicely?

  1. The taint thing seems reasonable to me. But why does it take so much time to start all pods? Are you running with more pods per node than the verified? Are changing that, maybe? Would it be an option to not do it (if you are)?

  2. I think while some determinism might be needed sometimes, if you can get away without it, it’s better. Probably most people using kubernetes aren’t doing it?

  3. Have you checked if kube-proxy, for example, had something special? It’s kind of needed for the node to be ready. Not sure if it has some special treatment but might be worth checking.

Good luck and please share what you end up doing! :slight_smile:

rata, Thanks for the questions!

  1. Let me first say, yes we do have apps that do rely on the DS pods for some services, usually those things can just die and get respawned, I’m not really that concerned with those. I mainly mentioned it because it does affect us. My main problem is the thundering herd of non-DS, it is an analytics workload that swamps a node when it has a spike in scaling.

  2. It’s about 7 DS pods, one of the DS pods just takes forever, maybe 30 seconds, for the app to init, a couple of others are the containers are pretty large and I should try to shrink them which is part of the optimizations of the DS pods I mentioned earlier. As for your other questions, I am sorry but I’m not sure what you are asking. verified?

  3. Yeah for sure almost everyone is getting away without it, but that doesn’t mean its not a problem for us. Just like in an old init system you wouldn’t start your webserver before your DB was up would you?

  4. I have not checked the kube-proxy angle, just off the top of my head I’m not sure what I’d be looking for in there.

My point overall is, having a Node in a known good state before starting non-DS pods would be beneficial to both me as an operator and my users who run workloads on my cluster.

Ryan

  1. Not sure I follow. So pods restart if the DS fail? And the problem is that several pods restarting make DS pods take more time to start?

  2. 30 seconds? Are using cpu limits?

And yes, there is a flag of X pods per node. Are changing it? That number is verified in the scaling benchmarks, not others.

  1. Sure. Just answering maybe why no one hit it

  2. Check if kube-proxy has something to be scheduled before the node is ready or other pods start. Maybe it does something, and you can use the same.

  3. Have you checked https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ ? Would it work for you?

Hi

  1. Yes they do fail and get restarted. There are 2 distinct problems:
    A) (low priority) some pods fail because they rely on DS pods to be running
    B) (very high priority) There is a workload that spikes the number of pods that it creates, which causes our cluster nodes to have to scale horizontally, the non-DS pods get scheduled onto nodes as soon as the kubelet reports Ready even though all the DS pods are not yet fully running. This rush of pods can be overwhelming to the node starving it of some resources further delaying the startup of DS pods.

  2. Some of these pods rely on other services which can be slow to respond sometimes. We do use limits as a guarantee to give the DS pods resources.

Yes as one of my remediation strategies I attempted to modify the maxpods per node value.

  1. Well someone has hit it before because of the existence of the github link I sent for ‘kube-node-ready-controller’

  2. kube-proxy is a DS in my cluster, it also suffers starting when there is a rush of pods starting on a new node.

  3. prod priority preemption was something that I looked into but I didn’t mention it before. The reason is it only really affects the pod scheduler. In a simplified view it only seems to affect the order the scheduler considers placing pods on the node. So even if the DS pods had crazy high priority they would land there first, which they already do, but that has nothing to do with guaranteeing that those pods have fully started before lower priority pods get scheduled onto the node. Yes it would help get the DS pods onto the node first, but they are already scheduled to the node before it even goes Ready, so in this case it doesn’t help.

There’s been some proposals to formalize a “node readiness” extension.
Check out https://github.com/kubernetes/community/pull/2640

  1. Ohh, now I understand. Thanks!
  2. Regarding that, be aware of problems when setting cpu limits and
    workarounds (more improvements in k8s 1.12 also). See issue:
    https://github.com/kubernetes/kubernetes/issues/51135. It may not
    affect you much, as it’s more important for latency sensitive apps. I
    don’t think it has anything to do with this.
  3. Good point
  4. Oh, so it sucks! I see Tim’s link for a proposal. That should be
    the path going forward, but it might take a while to land in a
    kubernetes release.
  5. Oh, sorry it doesn’t help :frowning:

Meanwhile, if you need a workaround until the proposal is implemented…

Why do these DS pods take so much to start? Is it pulling the image of
so many pods what makes it slow? In that case, you can probably
pre-install the DS images and it will start faster. Not sure if it
will be “fast-enough”, but it can REALLY help if that is the problem
or part of it.

If it’s a CPU limit making it slow to start, maybe you can avoid cpu
limits on the DS (only cpu requests) and it might help on burst?
Specially if all the rest are limited.

Not sure I can think of something else, sorry :frowning: