Cluster information:
Kubernetes version: 1.20
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: ubuntu 18
CNI and version: weave-net
CRI and version: docker 20
Context
Just trying to break-and-fix to play around with disaster recovery:
- Using kubeadm, I created a cluster with 1 master and 2 workers (total 3 virtualbox headless vm’s) and cluster worked fine:
kubectl get nodes
andget pods -n kube-system
showed everything ready - I create a secret (as a sentinel for later)
- I backed up the etcd db and saved the etcd static pod yaml to the host and completely deleted the master.
- Then I recreated the master node, ran kubeadm init and installed weave-net into the cluster. The
kubectl get pod -n kube-system
showed all pods running andkubectl get nodes
showed the master ready. - Then I restored the etcd db snapshot and edited the etcd static pod yaml (data dir and initial cluster token). A
kubectl get secrets
showed the secret I had created. HOWEVER, I noticed that the weave-net (CNI plugin) was getting authorization problems; so I deleted it an re-applied it. It ran fine after that.
Question
Two problems remain:
- coredns pods are non-ready because they seem (based on their logs) to be unauthorized to use the cluster
- the worker nodes are not authorized to rejoin the cluster
For #1, I’m guessing it is because the etcd I restored does not have a service account token that matches what is mounted in the coredns deployment. How do I fix that?
I’m hoping that once I fix #1, for #2 I will just need to execute the kubeadm join on each node but I suspect they will all have the same issue as coredns, because their service account token will be obsolete too.