I have taken over two clusters (stage & prod) and learning kubes as I go. A problem has manifested on stage cluster and I’m not sure how to even begin troubleshooting; any pointers would be appreciated.
I can’t reliably communicate with the cluster.
Some commands are working ok:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-spot-handler-k8s-spot-termination-handler-4njjl 1/1 Running 0 48d
kube-system aws-spot-handler-k8s-spot-termination-handler-5dmfq 1/1 Running 0 49d
kube-system aws-spot-handler-k8s-spot-termination-handler-5kpck 1/1 Running 0 69d
...
While other commands always return an error:
kubectl get nodes
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
From what I can tell; any commands involving nodes will fail/timeout. Prior to this problem I had noticed a few calico
pods were failing readiness probes. So I’m making a vague assumption that problem is due to failing networking due to failing calico pods. Being a Kubes newbie I’m not sure where to even start.
Since it’s the stage cluster, I was happy to drain nodes with unhealthy calico pods and let kubes spin up new nodes, but the drain
command keeps timing out as well.
As a side note, I had also noticed that some of the calico pods are using a “lot” of memory (over 4Gb), and am wondering if that’s related to them failing readiness probes.
Suggestions on how to troubleshoot??