Cluster information:
Kubernetes version: OpenShift 4.17
Cloud being used: bare-metal
Installation method: UPI
Host OS: RHCOS
CNI and version: OVN
CRI and version: CRIO
A friend reached me becuase there Bare Metal Cluster wasnt reachable anymore. Via Console or wir CLI nothing worked.
I have resetet the Kubeconfig and was able to login in to the cluster via SSH.
I checked the Certificates and approved them manualy.
But when i go for
oc get co
i see lot of Core operators like, auth, Kube-apiserver, controller-manage are having problems.
Actualy the server has 2 Healthy Master and 1 Worker. The Other MAster and Worker are Offline.
I dont now how to fix the problem so that the cluster can repair them self? How to bring up the broken Operators?
NAME STATUS ROLES AGE VERSION
osbm01d.art.mycluster.com Ready control-plane,master 140d v1.30.7
osbm02d.art.mycluster.com NotReady,SchedulingDisabled control-plane,master 140d v1.30.7
osbm03d.art.mycluster.com Ready control-plane,master 140d v1.30.7
osbm04d.art.mycluster.com NotReady worker 140d v1.30.7
osbm05d.art.mycluster.com Ready worker 140d v1.30.7
osbm06d.art.mycluster.com NotReady worker 112d v1.30.7
[root@osbm01d ~]# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-dc5dcd52e205239689f59ae9e61869c6 False True False 3 0 0 0 140d
worker rendered-worker-25d67158507b510644b6f61cda311c26 False True False 3 0 0 0 140d
[root@osbm01d ~]# oc get co
authentication 4.17.12 False True True 3d18h OAuthServerDeploymentAvailable: no oauth-openshift.openshift-authentication pods available on any node....
baremetal 4.17.12 True False False 140d
cloud-controller-manager 4.17.12 True False False 140d
cloud-credential 4.17.12 True False False 140d
cluster-autoscaler 4.17.12 True False False 140d
config-operator 4.17.12 True False False 140d
console 4.17.12 False True True 3d17h DeploymentAvailable: 0 replicas available for console deployment...
control-plane-machine-set 4.17.12 True False False 140d
csi-snapshot-controller 4.17.12 True False False 140d
dns 4.17.12 True True False 140d DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 6."
etcd 4.17.12 True False True 140d EtcdCertSignerControllerDegraded: EtcdCertSignerController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:3823696859783888709 name:"osbm01d.art.mycluster.com" peerURLs:"https://10.16.1.11:2380" clientURLs:"https://10.16.1.11:2379" Healthy:true Took:1.364858ms Error:<nil>} {Member:ID:10653968431121011491 name:"osbm03d.art.mycluster.com" peerURLs:"https://10.16.1.13:2380" clientURLs:"https://10.16.1.13:2379" Healthy:true Took:1.821787ms Error:<nil>} {Member:ID:17548280362638621453 name:"osbm02d.art.mycluster.com" peerURLs:"https://10.16.1.12:2380" clientURLs:"https://10.16.1.12:2379" Healthy:false Took:29.996881963s Error:health check failed: context deadline exceeded}]...
image-registry 4.17.12 True True False 3d22h Progressing: The registry is ready...
ingress 4.17.12 True True True 3d19h The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
insights 4.17.12 True False False 140d
kube-apiserver 4.17.12 True True True 140d NodeControllerDegraded: The master nodes not ready: node "osbm02d.art.mycluster.com" not ready since 2025-02-07 13:56:10 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager 4.17.12 True True True 140d GarbageCollectorDegraded: error fetching rules: client_error: client error: 401...
kube-scheduler 4.17.12 True True True 140d NodeControllerDegraded: The master nodes not ready: node "osbm02d.art.mycluster.com" not ready since 2025-02-07 13:56:10 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator 4.17.12 True False False 3d22h
machine-api 4.17.12 True False False 140d
machine-approver 4.17.12 True False False 140d
machine-config 4.17.12 True False True 140d Failed to resync 4.17.12 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 3, unavailable: 3)]
marketplace 4.17.12 True False False 140d
monitoring 4.17.12 False True True 3d21h UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 1 unavailable replicas
network 4.17.12 True True True 140d DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2025-03-03T12:51:56Z...
openshift-apiserver 4.17.12 False False True 3d17h APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node....
openshift-controller-manager 4.17.12 True True False 3d19h Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3...
openshift-samples 4.17.12 True False False 41d
operator-lifecycle-manager 4.17.12 True False False 140d
operator-lifecycle-manager-catalog 4.17.12 True False False 140d
operator-lifecycle-manager-packageserver 4.17.12 True False False 3d19h
service-ca 4.17.12 True True False 140d Progressing: ...
storage 4.17.12 True False False 140d