K8s Single Etcd node in Crashloopbackoff

Cluster information:

Kubernetes version: 1.17.11
Cloud being used: Baremetal, AWS
Installation method: Kubernetes Yum Repo
Host OS: Centos 7
CNI and version: Calico 3.16.0
CRI and version: Docker 19.03.12

Situation

One of our Kubernetes etcd pods is stuck in a crashloopbackoff. Logs show the error “state.commit is out of range”. Some investigation shows that it’s likely corrupted data, and SOP would be to remove the etcd member delete the data and rejoin.

However that’s made difficult by the fact that these pods are created by kubeadm. We’re unable to find details about the pods such as where the persistent volume is, how to re-add a member within the pod, etc.

Has anyone had to do this before? I can’t find any documentation on this. The closest I’ve found was kubeadm reset with the delete etcd pod flag.

EDIT: It’s worth noting that because the etcd pod won’t come up, the kubernetes API pod for this control node won’t come up either. We’re currently trying to point the API pod to a different etcd pod.

It seems that the kubeadm setup is using hostPath for the etcd pod.

  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data

Do you by chance have etcd backup? if so you can restore to the previous known state. or see if you can recover etcd data using that data directory

For, Pointing API to use different etcd pod: If the etcd pod comes up with fresh (no data), then the entire cluster will have empty resources. Like new cluster altogether.

Thanks for that. You’re right, the data is stored on the host itself. We do have backups, and I’m considering two options:

  1. Restore the data on the host from a backup.
  2. Remove the corrupted etcd member, delete the data, and add a new member.

But I have a few questions: If I were to restore the data dirs from a backup before the corruption, would the non-corrupted members be affected when I brought up the restored pod?

Alternatively, if I removed the corrupted data and did not do a restore—instead brought up a new pod with the example env variables:

 export ETCD_NAME="member4"
 export ETCD_INITIAL_CLUSTER="member2=http://10.0.0.2:2380,member3=http://10.0.0.3:2380,member4=http://10.0.0.4:2380"
 export ETCD_INITIAL_CLUSTER_STATE=existing

Would the pod be added back into the cluster? (Following instructions here: Operating etcd clusters for Kubernetes - Kubernetes )

I’m trying to determine the safest plan of action.