Hi all!
i am starting my journey to kubernetes and had an issue after the reboot of a control plane.
When the node rebooted (after being cordonned and drained), two pods are not working anymore and I am not able to understand what to do for it…
If you can guide me though this troubleshot, i’d be delighted
Here is the info I could get on my own :
root@cp1:~# k describe po etcd-cp1 -n kube-system
Name: etcd-cp1
[...]
Annotations: kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.1.211:2379
kubernetes.io/config.hash: b945c554cd159abab172304751cd173f
kubernetes.io/config.mirror: b945c554cd159abab172304751cd173f
kubernetes.io/config.seen: 2022-11-25T08:59:42.181749638+01:00
kubernetes.io/config.source: file
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 192.168.1.211
IPs:
IP: 192.168.1.211
Controlled By: Node/cp1
[...]
etcd
--advertise-client-urls=https://192.168.1.211:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--client-cert-auth=true
--data-dir=/var/lib/etcd
--experimental-initial-corrupt-check=true
--initial-advertise-peer-urls=https://192.168.1.211:2380
--initial-cluster=cp1=https://192.168.1.211:2380
--key-file=/etc/kubernetes/pki/etcd/server.key
--listen-client-urls=https://127.0.0.1:2379,https://192.168.1.211:2379
--listen-metrics-urls=http://127.0.0.1:2381
--listen-peer-urls=https://192.168.1.211:2380
--name=cp1
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
--peer-client-cert-auth=true
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
--snapshot-count=10000
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Liveness: http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
Startup: http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m15s (x494 over 97m) kubelet Back-off restarting failed container
root@cp1:~# systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2022-11-29 22:48:23 CET; 11s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 13301 (kubelet)
Tasks: 15 (limit: 2357)
Memory: 40.3M
CGroup: /system.slice/kubelet.service
└─13301 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=uni
nov. 29 22:48:26 cp1 kubelet[13301]: E1129 22:48:26.638040 13301 configmap.go:193] Couldn't get configMap kube-system/kube-proxy: failed to sync configmap cache: timed out waiting for the condition
nov. 29 22:48:26 cp1 kubelet[13301]: E1129 22:48:26.638489 13301 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/configmap/024b1c25-446e-4771-8a47-b451efd060dd-kube-proxy podName:024b1c25-446e-4771-8a47-b451efd
nov. 29 22:48:27 cp1 kubelet[13301]: I1129 22:48:27.343969 13301 scope.go:110] "RemoveContainer" containerID="043d16ad5496fd25a3a7c3d0707cbf686e8406ecb5537623656c25ad8324ec28"
nov. 29 22:48:27 cp1 kubelet[13301]: E1129 22:48:27.344335 13301 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 10s restarting failed container=etcd po
[...]
root@cp1:~# k logs etcd-cp1 -n kube-system
{"level":"fatal","ts":"2022-11-29T21:50:06.662Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: crc mismatch","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
root@cp1:~# k logs kube-apiserver-cp1 -n kube-system
I1129 21:51:04.641798 1 server.go:558] external host was not specified, using 192.168.1.211
I1129 21:51:04.642602 1 server.go:158] Version: v1.24.8
I1129 21:51:04.643047 1 server.go:160] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1129 21:51:05.215110 1 shared_informer.go:255] Waiting for caches to sync for node_authorizer
I1129 21:51:05.216612 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1129 21:51:05.216638 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
[...]
W1129 21:51:05.222142 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
[...]
E1129 21:51:25.222973 1 run.go:74] "command failed" err="context deadline exceeded"
netstat tells me that the node is not listening on port 2379, 2380, 2381 nor 6443.
Asking for help? Comment out what you need so we can get more information to help you!
Cluster information:
Kubernetes version: 1.24.6
Cloud being used: bare-metal
Installation method: repo + kubeadm init --control-plane-endpoint=“192.168.1.250:8090” --upload-certs --apiserver-advertise-address=192.168.1.211 --pod-network-cidr=10.50.0.0/16
Host OS: Debian 10
CNI and version: Weave (?)
CRI and version:
root@cp1:~# crictl version
Version: 0.1.0
RuntimeName: containerd
RuntimeVersion: 1.6.8
RuntimeApiVersion: v1
Thanks in advance for any hint you can give to understand what happened et what do now!
Kindly,
Krys