K8s Single Etcd node in Crashloopbackoff

Josh_Ruybal · February 16, 2021, 10:42pm

Cluster information:

Kubernetes version: 1.17.11
Cloud being used: Baremetal, AWS
Installation method: Kubernetes Yum Repo
Host OS: Centos 7
CNI and version: Calico 3.16.0
CRI and version: Docker 19.03.12

Situation

One of our Kubernetes etcd pods is stuck in a crashloopbackoff. Logs show the error “state.commit is out of range”. Some investigation shows that it’s likely corrupted data, and SOP would be to remove the etcd member delete the data and rejoin.

However that’s made difficult by the fact that these pods are created by kubeadm. We’re unable to find details about the pods such as where the persistent volume is, how to re-add a member within the pod, etc.

Has anyone had to do this before? I can’t find any documentation on this. The closest I’ve found was kubeadm reset with the delete etcd pod flag.

EDIT: It’s worth noting that because the etcd pod won’t come up, the kubernetes API pod for this control node won’t come up either. We’re currently trying to point the API pod to a different etcd pod.

vnandha · February 17, 2021, 1:26am

It seems that the kubeadm setup is using hostPath for the etcd pod.

  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data

Do you by chance have etcd backup? if so you can restore to the previous known state. or see if you can recover etcd data using that data directory

For, Pointing API to use different etcd pod: If the etcd pod comes up with fresh (no data), then the entire cluster will have empty resources. Like new cluster altogether.

Josh_Ruybal · February 17, 2021, 10:26pm

Thanks for that. You’re right, the data is stored on the host itself. We do have backups, and I’m considering two options:

Restore the data on the host from a backup.
Remove the corrupted etcd member, delete the data, and add a new member.

But I have a few questions: If I were to restore the data dirs from a backup before the corruption, would the non-corrupted members be affected when I brought up the restored pod?

Alternatively, if I removed the corrupted data and did not do a restore—instead brought up a new pod with the example env variables:

 export ETCD_NAME="member4"
 export ETCD_INITIAL_CLUSTER="member2=http://10.0.0.2:2380,member3=http://10.0.0.3:2380,member4=http://10.0.0.4:2380"
 export ETCD_INITIAL_CLUSTER_STATE=existing

Would the pod be added back into the cluster? (Following instructions here: Operating etcd clusters for Kubernetes - Kubernetes )

I’m trying to determine the safest plan of action.

Topic		Replies	Views
Etcd and kube-apiserver pods in CrashLoopBackOff state after node reboot General Discussions	5	14403	December 29, 2022
ETCD restor went successfully but required master node reboot General Discussions	0	253	March 12, 2024
Apiserver-pod and etcd-pod in CrashLoopBackOff status General Discussions	0	873	December 29, 2022
ETCD backup !ssues General Discussions	22	48679	June 19, 2023
Etcd member down after chaning OS from Centos7 to Rocky Linux9 General Discussions	0	310	September 21, 2023

K8s Single Etcd node in Crashloopbackoff

Cluster information:

Situation

Related topics