Cluster reboot results in kube-controller-manager to crashloopbackoff apparently on master election

virtuman · February 14, 2019, 3:19am

when we reboot our 1.12.1 , 1.12.2, or 1.12.3 cluster with 3 nodes, where each node is a master, it never appears to be able to start back up on it’s own and kube-controller-manager pods circle in constant loop of trying to get up for a few seconds, trying to elect primary master and fails.

logs don’t look very helpful, but maybe someone can suggest something to pinpoint and figure this one out.
kube-system kube-proxy-k8s-n02 0/1 Pending 0 1s
kube-system kube-proxy-k8s-n02 1/1 Running 0 2s
kube-system kube-controller-manager-k8s-n02 0/1 Error 103 170m
kube-system kube-controller-manager-k8s-n02 0/1 CrashLoopBackOff 103 170m
kube-system kube-controller-manager-k8s-n01 1/1 Running 36 171m

and sample of the pod’s log can be seen here:

sample of the logs of crashing kube-controller-manager

gist.github.com

https://gist.github.com/virtuman/c5110b5f8b9e44d85dc8b49e64b19649

failing kube-controller-manager pod logs

[root@k8s-n01 ~]# kubectl -n kube-system logs -f kube-controller-manager-k8s-n01
I0205 17:40:57.646786       1 feature_gate.go:206] feature gates: &{map[PersistentLocalVolumes:true VolumeScheduling:true]}
I0205 17:40:57.646988       1 flags.go:33] FLAG: --address="0.0.0.0"
I0205 17:40:57.647005       1 flags.go:33] FLAG: --allocate-node-cidrs="false"
I0205 17:40:57.647015       1 flags.go:33] FLAG: --allow-untagged-cloud="false"
I0205 17:40:57.647022       1 flags.go:33] FLAG: --allow-verification-with-non-compliant-keys="false"
I0205 17:40:57.647031       1 flags.go:33] FLAG: --alsologtostderr="false"
I0205 17:40:57.647038       1 flags.go:33] FLAG: --application-metrics-count-limit="100"
I0205 17:40:57.647046       1 flags.go:33] FLAG: --attach-detach-reconcile-sync-period="1m0s"
I0205 17:40:57.647069       1 flags.go:33] FLAG: --authentication-kubeconfig=""

This file has been truncated. show original

journalctl -f -u kubelet

Feb 05 17:22:02 k8s-n03 kubelet[10525]: W0205 17:22:02.820128   10525 reflector.go:270] object-"kubedb-postgres"/"pg-custom-config": watch of *v1.ConfigMap ended with: too old resource version: 23674758 (23676539)

kubectl -n kube-system get po

kube-system   kube-proxy-k8s-n02   0/1   Pending   0     1s
kube-system   kube-proxy-k8s-n02   1/1   Running   0     2s
kube-system   kube-controller-manager-k8s-n02   0/1   Error   103   170m
kube-system   kube-controller-manager-k8s-n02   0/1   CrashLoopBackOff   103   170m
kube-system   kube-controller-manager-k8s-n01   1/1   Running   36    171m

Any help or direction would be much appreciated. thank you

macintoshprime · February 14, 2019, 2:30pm

Have you tried forcing the pod to restart or just removing it and letting the deployment spin up a new one?

Looks like it’s trying to find the servicetoken and failing, letting it start fresh might kick in the token creation.

I0205 17:41:03.027555       1 serving.go:293] Generated self-signed cert (/var/run/kubernetes/kube-controller-manager.crt, /var/run/kubernetes/kube-controller-manager.key)
W0205 17:41:16.228175       1 authentication.go:371] failed to read in-cluster kubeconfig for delegated authentication: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0205 17:41:16.228278       1 authentication.go:233] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0205 17:41:16.228320       1 authentication.go:236] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0205 17:41:16.228405       1 authorization.go:158] failed to read in-cluster kubeconfig for delegated authorization: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

virtuman · February 14, 2019, 2:45pm

Thank you. Yes we deleted pods a bunch of times in hopes that it will begin to “see” itself.
This is an absolutely reproducible behavior on our cluster, every time we fully turn off all nodes and start them up, we end up with that constant cycle of start/crashloopbackoff with exact same message.

we then begin to delete random pods that we know are not necessarily needed, followed by constant deletion of kube_controller_manager pods, followed by one-server-at-a-time reboots, and eventually, after few hours of battling with it, it just begins to work on its own. we are getting ready to move our major apps to production, and this is a major problem that we wouldn’t want to face in production environment without knowing how to fix it quickly.

macintoshprime · February 14, 2019, 3:03pm

Ya I’ve had that issue to in the past. We’ve taken to doing rolling updates so we only bring down one node at a time and that seems to work ok. At least I haven’t seen that happen in the past few months.

What are you using to manage k8s?

virtuman · February 18, 2019, 5:59pm

unfortunately, i just found out that after we reboot just 1 node - it takes down kube-controller-manager after boot up on all nodes. so rebooting one at a time not super helpful, but sounded very promising.

deployed with kubespray, on 1.12.3 currently

macintoshprime · February 19, 2019, 1:58am

When the kube-controller manager goes down are the logs any different than what you posted earlier?

virtuman · February 19, 2019, 2:18am

Still the same errors. It eventually came up. Ended up rebooting all servers several times until they all magically got kube-controller-manager running again. still super scary to be able to switch any production workload until we know how to troubleshoot issues like this.

Everything else we’ve been able to figure out on our own until now. But not this one. Is there any detail that you could think of that could be useful in trying to pinpoint the issue, or ways to rectify when this happens ?

macintoshprime · February 19, 2019, 1:29pm

Off the top of my head I can’t think of any single issue that would be causing, perhaps some network configs or even something in the kubespray configs.

You said all the nodes were masters, do they host workloads as well? If so you may want to try seperating those process into their own nodes, ie 3 master (control-plane, etcd) and 3 workers. That might help isolate where the issue is.

Topic		Replies	Views
CrashLoopBackOff when reboot the master why? General Discussions	0	558	December 29, 2022
Etcd and kube-apiserver pods in CrashLoopBackOff state after node reboot General Discussions	5	14589	December 29, 2022
First master crashed, all other nodes impacted. Why? General Discussions	0	791	March 31, 2020
CrashLoopBackOff in kube-proxy why? General Discussions	0	1154	December 29, 2022
Cluster communication General Discussions	2	2110	July 29, 2019

Cluster reboot results in kube-controller-manager to crashloopbackoff apparently on master election

Related topics