2 Master 1 Worker left, but Cluster is Unhealthy, Broken

Cluster information:

Kubernetes version: OpenShift 4.17
Cloud being used: bare-metal
Installation method: UPI
Host OS: RHCOS
CNI and version: OVN
CRI and version: CRIO

A friend reached me becuase there Bare Metal Cluster wasnt reachable anymore. Via Console or wir CLI nothing worked.

I have resetet the Kubeconfig and was able to login in to the cluster via SSH.
I checked the Certificates and approved them manualy.

But when i go for

oc get co

i see lot of Core operators like, auth, Kube-apiserver, controller-manage are having problems.

Actualy the server has 2 Healthy Master and 1 Worker. The Other MAster and Worker are Offline.

I dont now how to fix the problem so that the cluster can repair them self? How to bring up the broken Operators?

NAME                          STATUS                        ROLES                  AGE    VERSION
osbm01d.art.mycluster.com   Ready                         control-plane,master   140d   v1.30.7
osbm02d.art.mycluster.com   NotReady,SchedulingDisabled   control-plane,master   140d   v1.30.7
osbm03d.art.mycluster.com   Ready                         control-plane,master   140d   v1.30.7
osbm04d.art.mycluster.com   NotReady                      worker                 140d   v1.30.7
osbm05d.art.mycluster.com   Ready                         worker                 140d   v1.30.7
osbm06d.art.mycluster.com   NotReady                      worker                 112d   v1.30.7

[root@osbm01d ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-dc5dcd52e205239689f59ae9e61869c6   False     True       False      3              0                   0                     0                      140d
worker   rendered-worker-25d67158507b510644b6f61cda311c26   False     True       False      3              0                   0                     0                      140d




[root@osbm01d ~]# oc get co

authentication                             4.17.12   False       True          True       3d18h   OAuthServerDeploymentAvailable: no oauth-openshift.openshift-authentication pods available on any node....
baremetal                                  4.17.12   True        False         False      140d
cloud-controller-manager                   4.17.12   True        False         False      140d
cloud-credential                           4.17.12   True        False         False      140d
cluster-autoscaler                         4.17.12   True        False         False      140d
config-operator                            4.17.12   True        False         False      140d
console                                    4.17.12   False       True          True       3d17h   DeploymentAvailable: 0 replicas available for console deployment...
control-plane-machine-set                  4.17.12   True        False         False      140d
csi-snapshot-controller                    4.17.12   True        False         False      140d
dns                                        4.17.12   True        True          False      140d    DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6."
etcd                                       4.17.12   True        False         True       140d    EtcdCertSignerControllerDegraded: EtcdCertSignerController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:3823696859783888709 name:"osbm01d.art.mycluster.com" peerURLs:"https://10.16.1.11:2380" clientURLs:"https://10.16.1.11:2379"  Healthy:true Took:1.364858ms Error:<nil>} {Member:ID:10653968431121011491 name:"osbm03d.art.mycluster.com" peerURLs:"https://10.16.1.13:2380" clientURLs:"https://10.16.1.13:2379"  Healthy:true Took:1.821787ms Error:<nil>} {Member:ID:17548280362638621453 name:"osbm02d.art.mycluster.com" peerURLs:"https://10.16.1.12:2380" clientURLs:"https://10.16.1.12:2379"  Healthy:false Took:29.996881963s Error:health check failed: context deadline exceeded}]...
image-registry                             4.17.12   True        True          False      3d22h   Progressing: The registry is ready...
ingress                                    4.17.12   True        True          True       3d19h   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
insights                                   4.17.12   True        False         False      140d
kube-apiserver                             4.17.12   True        True          True       140d    NodeControllerDegraded: The master nodes not ready: node "osbm02d.art.mycluster.com" not ready since 2025-02-07 13:56:10 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager                    4.17.12   True        True          True       140d    GarbageCollectorDegraded: error fetching rules: client_error: client error: 401...
kube-scheduler                             4.17.12   True        True          True       140d    NodeControllerDegraded: The master nodes not ready: node "osbm02d.art.mycluster.com" not ready since 2025-02-07 13:56:10 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator              4.17.12   True        False         False      3d22h
machine-api                                4.17.12   True        False         False      140d
machine-approver                           4.17.12   True        False         False      140d
machine-config                             4.17.12   True        False         True       140d    Failed to resync 4.17.12 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 3, unavailable: 3)]
marketplace                                4.17.12   True        False         False      140d
monitoring                                 4.17.12   False       True          True       3d21h   UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 1 unavailable replicas
network                                    4.17.12   True        True          True       140d    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2025-03-03T12:51:56Z...
openshift-apiserver                        4.17.12   False       False         True       3d17h   APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node....
openshift-controller-manager               4.17.12   True        True          False      3d19h   Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3...
openshift-samples                          4.17.12   True        False         False      41d
operator-lifecycle-manager                 4.17.12   True        False         False      140d
operator-lifecycle-manager-catalog         4.17.12   True        False         False      140d
operator-lifecycle-manager-packageserver   4.17.12   True        False         False      3d19h
service-ca                                 4.17.12   True        True          False      140d    Progressing: ...
storage                                    4.17.12   True        False         False      140d

1 Like