Problem after updating certificates with Kubeadm in kubernetes onpremise bare metal cluster

cesarjv · September 9, 2024, 3:52pm

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.22
Cloud being used: bare-metal
Installation method: OnPremise
Host OS: Red Hat Enterprise Linux release 8.10
CNI and version:
CRI and version:

Good morning, I currently have a problem after updating the certificates of my Kubernetes cluster with the kubeadm kubeadm certs renew all command, we have a Kubernetes cluster in On-Premise that has 2 masters and 6 worker nodes, after the update we lost management of the cluster, what was done was apply the aforementioned command and update the certificates in one of the masters and replicate the /etc/kubernetes folder with the certificates in the other, so that both masters had their certificates renewed:

Renewal Master Principal TMT102 (10.164.5.236) and Renewal of Secondary Master TCOLD013 (10.161.169.26)

However, the issue is that the apiserver pod does not start and in both masters it gives the following error:

*I0909 15:46:36.537724       1 server.go:553] external host was not specified, using 10.164.5.236*
*I0909 15:46:36.538897       1 server.go:161] Version: v1.22.0*
*I0909 15:46:37.156242       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer*
*I0909 15:46:37.158840       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.*
*I0909 15:46:37.158879       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.*
*I0909 15:46:37.161155       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.*
*I0909 15:46:37.161190       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.*
*Error: context deadline exceeded*

If we validate the status of etcd on both masters, it says the following on the main master:

*[root@TMT102 jenkinsqa]# systemctl status etcd*
*● etcd.service - etcd*
*   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)*
*   Active: active (running) since Fri 2024-09-06 08:41:51 -04; 3 days ago*
*     Docs: https://github.com/coreos*
* Main PID: 921 (etcd)*
*    Tasks: 10 (limit: 23184)*
*   Memory: 70.3M*
*   CGroup: /system.slice/etcd.service*
*           └─921 /usr/local/bin/etcd --name TMT102 --cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/e>*

*Sep 09 11:49:08 TMT102 etcd[921]: health check for peer 38b126bffa9e7ff7 could not connect: x509: certificate has expired or is not yet valid*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36578" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36588" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36590" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36600" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36616" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36632" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36644" (error "remote error: tls: bad certificate", ServerName "")*
*Sep 09 11:49:08 TMT102 etcd[921]: health check for peer 38b126bffa9e7ff7 could not connect: x509: certificate has expired or is not yet valid*
*Sep 09 11:49:08 TMT102 etcd[921]: rejected connection from "10.161.169.26:36648" (error "remote error: tls: bad certificate", ServerName "")*

If we validate the status of the kubelet, we fail to recognize the node:

[root@TMT102 jenkinsqa]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Fri 2024-09-06 08:41:52 -04; 3 days ago
     Docs: https://kubernetes.io/docs/
 Main PID: 1055 (kubelet)
    Tasks: 17 (limit: 23184)
   Memory: 117.9M
   CGroup: /system.slice/kubelet.service
           └─1055 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/va>

Sep 09 11:50:02 TMT102 kubelet[1055]: E0909 11:50:02.966092    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.066193    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.167212    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.267684    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.368502    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.468755    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.569086    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.670261    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.771753    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"
Sep 09 11:50:03 TMT102 kubelet[1055]: E0909 11:50:03.872367    1055 kubelet.go:2407] "Error getting node" err="node \"tmt102\" not found"

With these errors we cannot connect to manage the Kubernetes cluster, what could be happening, it is an OnPremise Bare Metal installation

C_Morris · October 2, 2024, 8:25am

Hello, I had a similar problem last week on a similar setup (3 nodes baremetal servers).
Scenario: certs expired and we had to renew them to regain access to administration (cluster was still running ok but we lost admin access).
Certs were managed by kubeadm and we did kubeadm renew all like you did and after that apiserver and etcd stopped running with a TLS error in logs and our whole cluster was down - all nodes.
What we tried to fix it: - renewing certs manually by copying everything in k8/pki folder on all nodes. This fix didnt worked - apiserver started but etcd was still down. The problem was the following in our case: I will copy paste from k8 page and provide url below: Warning: On nodes created with kubeadm init, prior to kubeadm version 1.17, there is a [bug](https://github.com/kubernetes/kubeadm/issues/1753) where you manually have to modify the contents of kubelet.conf. After kubeadm initfinishes, you should updatekubelet.confto point to the rotated kubelet client certificates, by replacingclient-certificate-dataandclient-key-data` with:
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
Certificate Management with kubeadm | Kubernetes
This fixed the issue - we upgraded from 1.27 → 28,29,30 and now on v 1.30 I tested automatic renewal with kubeadm and works fine no more bugs.
Hope this helps.

Topic		Replies	Views
Kubeadm showing errors about reading configuration from the cluster and missing etcd certificates General Discussions	0	145	August 22, 2024
Can't bring kubernetes cluster back to live General Discussions	0	5029	November 20, 2020
Kubectl with valid certificates can't authenticate General Discussions	0	300	September 27, 2024
Kubelet fails to start after cert renewal General Discussions	0	699	January 18, 2022
Unable to connect to the server: x509: certificate has expired or is not yet valid General Discussions	4	40982	June 28, 2021

Problem after updating certificates with Kubeadm in kubernetes onpremise bare metal cluster

Cluster information:

Related topics