Etcd and kube-apiserver pods in CrashLoopBackOff state after node reboot

Hi all!

i am starting my journey to kubernetes and had an issue after the reboot of a control plane.

When the node rebooted (after being cordonned and drained), two pods are not working anymore and I am not able to understand what to do for it…

If you can guide me though this troubleshot, i’d be delighted :slight_smile:

Here is the info I could get on my own :

root@cp1:~# k describe po etcd-cp1 -n kube-system
Name:                 etcd-cp1
[...]
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.1.211:2379
                      kubernetes.io/config.hash: b945c554cd159abab172304751cd173f
                      kubernetes.io/config.mirror: b945c554cd159abab172304751cd173f
                      kubernetes.io/config.seen: 2022-11-25T08:59:42.181749638+01:00
                      kubernetes.io/config.source: file
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:               Running
IP:                   192.168.1.211
IPs:
  IP:           192.168.1.211
Controlled By:  Node/cp1
[...]
      etcd
      --advertise-client-urls=https://192.168.1.211:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --experimental-initial-corrupt-check=true
      --initial-advertise-peer-urls=https://192.168.1.211:2380
      --initial-cluster=cp1=https://192.168.1.211:2380
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.211:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://192.168.1.211:2380
      --name=cp1
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1

    Liveness:     http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:      http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
[...]
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Warning  BackOff  2m15s (x494 over 97m)  kubelet  Back-off restarting failed container

root@cp1:~# systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Tue 2022-11-29 22:48:23 CET; 11s ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 13301 (kubelet)
    Tasks: 15 (limit: 2357)
   Memory: 40.3M
   CGroup: /system.slice/kubelet.service
           └─13301 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=uni

nov. 29 22:48:26 cp1 kubelet[13301]: E1129 22:48:26.638040   13301 configmap.go:193] Couldn't get configMap kube-system/kube-proxy: failed to sync configmap cache: timed out waiting for the condition
nov. 29 22:48:26 cp1 kubelet[13301]: E1129 22:48:26.638489   13301 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/configmap/024b1c25-446e-4771-8a47-b451efd060dd-kube-proxy podName:024b1c25-446e-4771-8a47-b451efd
nov. 29 22:48:27 cp1 kubelet[13301]: I1129 22:48:27.343969   13301 scope.go:110] "RemoveContainer" containerID="043d16ad5496fd25a3a7c3d0707cbf686e8406ecb5537623656c25ad8324ec28"
nov. 29 22:48:27 cp1 kubelet[13301]: E1129 22:48:27.344335   13301 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 10s restarting failed container=etcd po
[...]
root@cp1:~# k logs etcd-cp1 -n kube-system
{"level":"fatal","ts":"2022-11-29T21:50:06.662Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: crc mismatch","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
root@cp1:~# k logs kube-apiserver-cp1 -n kube-system
I1129 21:51:04.641798       1 server.go:558] external host was not specified, using 192.168.1.211
I1129 21:51:04.642602       1 server.go:158] Version: v1.24.8
I1129 21:51:04.643047       1 server.go:160] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1129 21:51:05.215110       1 shared_informer.go:255] Waiting for caches to sync for node_authorizer
I1129 21:51:05.216612       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1129 21:51:05.216638       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
[...]
W1129 21:51:05.222142       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
[...]
E1129 21:51:25.222973       1 run.go:74] "command failed" err="context deadline exceeded"

netstat tells me that the node is not listening on port 2379, 2380, 2381 nor 6443.

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.24.6
Cloud being used: bare-metal
Installation method: repo + kubeadm init --control-plane-endpoint=“192.168.1.250:8090” --upload-certs --apiserver-advertise-address=192.168.1.211 --pod-network-cidr=10.50.0.0/16

Host OS: Debian 10

CNI and version: Weave (?)

CRI and version:

root@cp1:~# crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  1.6.8
RuntimeApiVersion:  v1

Thanks in advance for any hint you can give to understand what happened et what do now!

Kindly,
Krys

Hello Bro,

Your issue has been solved ? if yes, the tell me the actions performed.
Appearly, i have the same issue :grinning:

Thks

Nope, I still have the issue and had no answer at the moment :pleading_face:

:frowning_face:
Okay i check in deep !

Nothing will start without etcd first, and looking at the error there:

It looks like it’s been corrupted.

These pages might have some useful info:

Ooooh! Thanks for the hints, I will give a look!