Kubernetes version: 1.17.11
Cloud being used: Baremetal, AWS
Installation method: Kubernetes Yum Repo
Host OS: Centos 7
CNI and version: Calico 3.16.0
CRI and version: Docker 19.03.12
Situation
One of our Kubernetes etcd pods is stuck in a crashloopbackoff. Logs show the error “state.commit is out of range”. Some investigation shows that it’s likely corrupted data, and SOP would be to remove the etcd member delete the data and rejoin.
However that’s made difficult by the fact that these pods are created by kubeadm. We’re unable to find details about the pods such as where the persistent volume is, how to re-add a member within the pod, etc.
Has anyone had to do this before? I can’t find any documentation on this. The closest I’ve found was kubeadm reset with the delete etcd pod flag.
EDIT: It’s worth noting that because the etcd pod won’t come up, the kubernetes API pod for this control node won’t come up either. We’re currently trying to point the API pod to a different etcd pod.
Do you by chance have etcd backup? if so you can restore to the previous known state. or see if you can recover etcd data using that data directory
For, Pointing API to use different etcd pod: If the etcd pod comes up with fresh (no data), then the entire cluster will have empty resources. Like new cluster altogether.
Thanks for that. You’re right, the data is stored on the host itself. We do have backups, and I’m considering two options:
Restore the data on the host from a backup.
Remove the corrupted etcd member, delete the data, and add a new member.
But I have a few questions: If I were to restore the data dirs from a backup before the corruption, would the non-corrupted members be affected when I brought up the restored pod?
Alternatively, if I removed the corrupted data and did not do a restore—instead brought up a new pod with the example env variables: