Cluster information:
Kubernetes version: v1.22.5
Cloud being used: bare metal
Installation method: kubespray
Host OS: Ubuntu 20.04 LTS
CNI and version: Cilium v1.10.5
CRI and version: Containerd
Hey folks, yesterday I had an accidental forced reboot of my 6-node on-prem k8s cluster (deployed with kubespray) and I haven’t been able to restore services of the control plane components, especially apiserver. I have traced through most of the logs of kube-apiserver / kube-controller-manager / kube-scheduler / etcd, and I notice that whenever apiserver tries to stores pods info into etcd, I always get the very similar timeouts like below:
I0406 12:38:41.181671 1 trace.go:205] Trace[1740655399]: "Create" url:/api/v1/namespaces/spark/pods,user-agent:kube-controller-manager/v1.22.5 (linux/amd64) kubernetes/5c99e2a/system:serviceaccount:kube-system:statefulset-controller,audit-id:9a78c800-4d82-4c8c-acd2-7dec593cc9b4,client:10.195.137.30,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (06-Apr-2022 12:38:07.180) (total time: 34000ms):
Trace[1740655399]: ---"About to convert to expected version" 5ms (12:38:07.186)
Trace[1740655399]: ---"Conversion done" 0ms (12:38:07.186)
Trace[1740655399]: ---"About to store object in database" 0ms (12:38:07.186)
Trace[1740655399]: [34.000920383s] [34.000920383s] END
I0406 12:38:41.239336 1 trace.go:205] Trace[1444081903]: "Create" url:/api/v1/namespaces/gitlab-kubernetes-agent/pods,user-agent:kube-controller-manager/v1.22.5 (linux/amd64) kubernetes/5c99e2a/system:serviceaccount:kube-system:replicaset-controller,audit-id:e05bf10a-6320-4be9-aeac-169b9e377687,client:10.195.137.30,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (06-Apr-2022 12:38:07.238) (total time: 34000ms):
Trace[1444081903]: ---"About to convert to expected version" 0ms (12:38:07.238)
Trace[1444081903]: ---"Conversion done" 0ms (12:38:07.238)
Trace[1444081903]: ---"About to store object in database" 0ms (12:38:07.238)
Trace[1444081903]: [34.000522847s] [34.000522847s] END
As you can see, apiserver is always stuck for exactly 34s and then I get an Timeout: request did not complete within requested timeout - context deadline exceeded
error and pods end up not getting reconciled. Strange thing is, I cannot find these requests from apiserver in etcd log even in debug log level. Apiserver can communicate with etcd though since it is insert events into etcd continuously.
I really need some hints on how I should proceed. What is this 34s timeout? Why is etcd not getting these requests in logs? What is happening within that 34s?
I would appreciate any answers or guesses! please let me know if I should create an issue in github instead