Cluster information:
Kubernetes version: 1.25
Cloud being used: AWS
Installation method: Console
Host OS: Amazon Linux 2
CNI and version: Amazon VPC CNI; version: v1.12.5-eksbuild.2
CRI and version: Containerd; version: 1.6.6
Hey Folks, I’m running an EKS Cluster with a few worker nodes (scaling configured through Karpenter). Now for some reason, from time to time, a node gets stuck in a NotReady state, rendering all the pods Terminating state. To fix (remove the node, since it’s unreachable in any way) someone has to manually delete the worker node, which is operationally heavy & very inefficient.
Is there a tool or any kind of configuration option that would give the possibility to delete a node automatically, if the kubelet from the said node does not respond in N amount of minutes?
From what I’ve seen, there doesn’t seem to be such an option in aws-termination-handler & karpenter.
Would appreciate any feedback & help on this. Thanks!