Cluster reboot results in kube-controller-manager to crashloopbackoff apparently on master election

on-prem

#1

when we reboot our 1.12.1 , 1.12.2, or 1.12.3 cluster with 3 nodes, where each node is a master, it never appears to be able to start back up on it’s own and kube-controller-manager pods circle in constant loop of trying to get up for a few seconds, trying to elect primary master and fails.

logs don’t look very helpful, but maybe someone can suggest something to pinpoint and figure this one out.
kube-system kube-proxy-k8s-n02 0/1 Pending 0 1s
kube-system kube-proxy-k8s-n02 1/1 Running 0 2s
kube-system kube-controller-manager-k8s-n02 0/1 Error 103 170m
kube-system kube-controller-manager-k8s-n02 0/1 CrashLoopBackOff 103 170m
kube-system kube-controller-manager-k8s-n01 1/1 Running 36 171m

and sample of the pod’s log can be seen here:

sample of the logs of crashing kube-controller-manager

Any help or direction would be much appreciated. thank you


#2

Have you tried forcing the pod to restart or just removing it and letting the deployment spin up a new one?

Looks like it’s trying to find the servicetoken and failing, letting it start fresh might kick in the token creation.

I0205 17:41:03.027555       1 serving.go:293] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
W0205 17:41:16.228175       1 authentication.go:371] failed to read in-cluster kubeconfig for delegated authentication: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0205 17:41:16.228278       1 authentication.go:233] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0205 17:41:16.228320       1 authentication.go:236] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0205 17:41:16.228405       1 authorization.go:158] failed to read in-cluster kubeconfig for delegated authorization: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

#3

Thank you. Yes we deleted pods a bunch of times in hopes that it will begin to “see” itself.
This is an absolutely reproducible behavior on our cluster, every time we fully turn off all nodes and start them up, we end up with that constant cycle of start/crashloopbackoff with exact same message.

we then begin to delete random pods that we know are not necessarily needed, followed by constant deletion of kube_controller_manager pods, followed by one-server-at-a-time reboots, and eventually, after few hours of battling with it, it just begins to work on its own. we are getting ready to move our major apps to production, and this is a major problem that we wouldn’t want to face in production environment without knowing how to fix it quickly.


#4

Ya I’ve had that issue to in the past. We’ve taken to doing rolling updates so we only bring down one node at a time and that seems to work ok. At least I haven’t seen that happen in the past few months.

What are you using to manage k8s?


#5

unfortunately, i just found out that after we reboot just 1 node - it takes down kube-controller-manager after boot up on all nodes. so rebooting one at a time not super helpful, but sounded very promising.

deployed with kubespray, on 1.12.3 currently


#6

When the kube-controller manager goes down are the logs any different than what you posted earlier?


#7

Still the same errors. It eventually came up. Ended up rebooting all servers several times until they all magically got kube-controller-manager running again. still super scary to be able to switch any production workload until we know how to troubleshoot issues like this.

Everything else we’ve been able to figure out on our own until now. But not this one. Is there any detail that you could think of that could be useful in trying to pinpoint the issue, or ways to rectify when this happens ?


#8

Off the top of my head I can’t think of any single issue that would be causing, perhaps some network configs or even something in the kubespray configs.

You said all the nodes were masters, do they host workloads as well? If so you may want to try seperating those process into their own nodes, ie 3 master (control-plane, etcd) and 3 workers. That might help isolate where the issue is.