etcd has it own set of certificates for members to communicate securely. According to documentation, these expire after 3 years, so they are unlikely to be expired, but it could be worth to check them (just in case).
Red Hat recomends to configure an NTP server to avoid time differences on the clock of the nodes. If you were able to “turn back the clock”, you should check it your server’s clocks are in synch. If the difference is above a certain threshold, it may cause issues trusting the certificates.
The apiserver and the etcd are static pods and I believe they should be started by the kubelet: kubeadm - Implementation details - Constants and well-known values and paths (Later in this page there is a list of the steps taken by the kubeadm preflight check followed by the list of certificates it creates (and its location); this may be useful debugging your cluster.)
The kubelet reads the manifests for the static pods from the filesystem, not the etcd, so failing to connect to the etcd database cannot be the cause of the kubelet not starting. I’ve been able to find this document on how the kubelet starts from the kops documentation: kubelet start
Kubelet starts up, starts (and restarts) all the containers in /etc/kubernetes/manifests.
It also tries to contact the API server (which the master kubelet will itself eventually start), register the node. Once a node is registered, kube-controller-manager will allocate it a PodCIDR, which is an allocation of the k8s-network IP range. kube-controller-manager updates the node, setting the PodCIDR field. Once kubelet sees this allocation, it will set up the local bridge with this CIDR, which allows docker to start. Before this happens, only pods that have hostNetwork will work - so all the “core” containers run with hostNetwork=true.
This is interesting, as it seems that if the node is not able to register, the kube-controller-manager
will not allocate the PodCIDR, which I guess will make the pods unable to get an IP and communicate with each other.
It’s not clear what happens in a control plane node, but acording to
[the kubelet] tries to contact the API server (which the master kubelet will itself eventually start)
seems that the kubelet starting the api-server in a “master” node should be able to register the “local node” in the apiserver started by itself.
If it’s not happening, something may be preventing the kubelet to communicate with the API Server… I think that the kubelet authenticates to the apiserver using a certificate signed by the cluster’s CA; maybe when the cluster started (whith the clock set in the present, aka, 2022), it “flagged” the kubelet’s certificate as expired, and it has a record of it being expired. If that’s the way it works, setting the clock in the past will not “un-expire” the certificate…
As I don’t know how the apiserver managed expired certificates, the best course of action would be assuming the certificate is expired and (try to) follow the official documentation Troubleshooting - Kubelet client certificate rotation fails. You could check if that’s the way to go by observing the logs of the kube-apiserver:
you might see errors such as x509: certificate has expired or is not yet valid
in kube-apiserver logs.
The problem is that the process described in the documentation requires launching commands in a “healthy” control plane node, and you don’t have one. But you should be able to manually create a certificate for the kubelet, sign it with the CA certificate and start the kubelet using this “trusted” certificate…
From the logs you provide, the etcd member seems to recognize that it’s part of an existing cluster and it tries to connect to the other members of the cluster:
failed to create etcd client, but the server is already initialized as member "rhcpm01.osc01.nix.mds.xyz"`
…but connecting to other etcd cluster members fail;
Error while dialing dial tcp 10.0.0.107:2379: connect: connection refused
...
ALL_ETCD_ENDPOINTS=https://10.0.0.106:2379,https://10.0.0.108:2379,https://10.0.0.107:2379
The message Waiting for ports 2379, 2380 and 9978 to be released
comes from etcd/pod.yaml from the cluster-etcd-operator
source code.
It is consistent with the pod network being down (althought it’s not conclusive).
I would focus on checking the kubelet and making sure it starts (or tries to) and look into the kubelet’s logs to see why it fails.
At the end fo the log you provide there’s a message about a crc mismatch from the etcd:
etcdmain: walpb: crc mismatch
This may be BAD NEWS, according to Discussion: etcd: walpb: crc mismatch or ETCD data gets corrupted with error "read wal error (walpb: crc mismatch)… There is a similar message in etcd fails with error “C | etcdserver: read wal error (wal: crc mismatch) and cannot be repaired” (requires access to Red Hat Support Site) and points in the same direction:
- This issue might occur due to a
CRC
mismatch that can happen if there is a bit rot on disk or a file-system
corruption. From the etcd logs
, it was seen that the WAL file
is broken which means that the WAL file
has been corrupted by filesystem
issues or etc.
You mentioned that the cluster you are trying to recover failed some time ago; do you remember if there was an outage or some other unclean shutdown that could affect the database?
Likewise, interested in how fragile the Kubernetes w/ OpenShift is under things like these events causing possible corruption so interested where and how things could break and what fixes are possible to get things back up.
Have you ever heard of this guy, Murphy ? That’s why backups and disaster recovery plans are not optional. Things fail; as they say, it’s not a matter of “if”, but “when”.
As you have access to the ETCD_DATA_DIR=/var/lib/etcd
, you may be able to copy the contents of the folder and try to recreate (or recover) the manifests of everything you had deployed on the cluster. Maybe you could use Backing up etcd data as a reference.
Applications running in your cluster store data in volumes. Depending on your storage backend, your data might be already safe somewhere outside the cluster. If your volumes were “local”, it will depend on how data is stored in your application’s volumes. If you deployed a containerized NFS server, pinning it to an “storage node”, for example, your NFS-exported folders will be located as a regular folder in this “storage node” filesystem.
With CoreOs this might be a little more complicated than that, but the point is that it MAY be possible to get data back just copying folders; containers are just “regular” linux procecesses isolated from other processes; if you had an application that needed a license.key
file to run and the only place where it was “saved” was in one of the containers deployed on the “failed” cluster, you may be able to get it deep diving in your worker’s filesystem, “navigating” to the folder where the “container” files would be and copying it to another linux box.
To be safe, I would start backing up the etcd database and all your applications.
I would recommend using Velero for backing up your applications. It was developed by Joe Beda (one of the “creators” of Kubernetes) at Heptio (accquired by VMware), but it’s free and opensource. Red Hat uses it to backup and migrate clusters as part of the Migration Toolkit for Containers and its an easy to use and solid solution. Most tutorials and guides on how to configure Velero use AWS S3 buckets although other options are available; if you want to try it, the easiest way would be using MinIO to have S3 API-compatible on-prem “buckets”.
Best regards,
Xavi