I also asked this question on Stack Overflow but so far got no answers. This Forum is probably better suited for it.
Kubernetes Version: 1.23.17-gke.5600
Recently, our Kubernetes cluster experienced an issue with cert-manager after an automatic update by Google. We were using v0.15.0, which required an update anyway.
I initially thought it wouldn’t be a big problem. My plan was to uninstall the old cert-manager, manually delete service accounts, roles, and other related components, back up certificates and issuers, modify them to use the new API version, delete CRDs, install v1.9.1 using Helm, and finally restore the certificates and issuers. Most of these steps went smoothly, except for the certificate restoration. However, since new certificates could be created, it didn’t concern me too much.
Unfortunately, something during this process caused a significant issue. Approximately one minute after restoring the issuers, I was unexpectedly disconnected from the Kubernetes UI and couldn’t reconnect. Furthermore, I lost all ability to interact with the cluster. Both kubectl and Helm commands now time out without any response. I am unable to deploy, delete, or edit anything. The Google Cloud Console also fails to display any workloads or resources within the cluster. While basic details about the cluster can still be shown, any changes result in timeouts.
On the bright side, everything on the cluster is still running fine, although no new certificates are being created. Additionally, I can view logs using the Google Logs Explorer, and I can still SSH into the VM of the cluster’s node, where I can see that the Docker containers are running. However, that’s the extent of my access.
I would greatly appreciate assistance from experts in GKE or Kubernetes. I’m wondering what might have caused this issue. Is it possible that I accidentally deleted something crucial from the cluster? Could the new cert-manager be blocking something, perhaps while attempting to create new certificates? What steps can I take to restore the connection to the cluster without access to kubectl?
I have attempted the following troubleshooting steps:
- Executing various kubectl and helm commands to obtain information about the cluster.
- Unfortunately, there has been no response from these commands, resulting in timeouts.
- Accessing Kubernetes Workloads through the Google Cloud Console.
- However, I encountered the error message: “Missing data from clusters ** (deadline exceeded).”
- Upscaling the cluster using the Cloud Console.
- This action resulted in the error message: “All cluster resources were brought up, but only 1 node out of 2 has registered; the cluster may be unhealthy.”
- Modifying the node image type via the Cloud Console.
- However, I received the error: “K8s resource was not found: k8sclient: 7 - failed waiting for node registration of **: 404 status code returned. Requested resource not found.”
- Analyzing the logs.
- In the logs, I noticed that Cert Manager repeatedly displays the following entries approximately every 5 minutes:
k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1.Challenge: request to convert CR from an invalid group/version: acme.cert-manager.io/v1alpha2
k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1.Challenge: failed to list *v1.Challenge: request to convert CR from an invalid group/version: acme.cert-manager.io/v1alpha2
- This behavior is perplexing since I have already updated the API versions of the cluster issuers to
cert-manager.io/v1
.
- Additionally, the konnectivity agent repeatedly logs the following entries:
"conn write failure" err="write tcp ----:41618->----:10250: use of closed network connection" connectionID=3436
"Exiting remoteToProxy" connectionID=3437
"Exiting proxyToRemote" connectionID=3437
- However, I believe these occurred before the problem arose.
- If there are any other locations where I should look, please let me know.
- In the logs, I noticed that Cert Manager repeatedly displays the following entries approximately every 5 minutes:
- Accessing the Node VM through SSH and pausing the Cert Manager Docker containers.
- Unfortunately, pausing these containers did not have any impact.
- Some of them were automatically restarted, which I assume is normal Kubernetes behavior.
- Using the
top
command via SSH to identify any processes consuming excessive resources.- However, I did not observe anything unusual in the resource utilization.
If you have any further suggestions or areas where I should investigate, please advise.
I suppose it should be possible to restart whatever service is acting up through the node’s VM.
Thank you for any help.