Kubernetes troubleshooting worst case: offline, behind a web proxy, on locked down networks

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.28.2
Cloud being used: AWS govcloud
Installation method: cloudformations / chef / manual
Host OS: redhat 8.8
CNI and version: Calico v3.24.5
CRI and version: containerd.io 1.6.31-3.1.el8

I have a handful of nodes in AWS that are not connected to the internet, and are behind a web proxy for the limited web services available on the network.

I have pods that remain stuck in “ContainerCreating” with very little feedback. The one common error with all of them is:
“Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up network for sandbox “(imagine a 32 bit uuid here)”: plugin type “calico” failed (add): netplugin failed but error parsing its diagnostic message “No valid options provided. Usage:\n”: invalid character ‘N’ looking for beginning of value”

Kubelet and containerd both throw that message repeatedly for any pods stuck in creating.

The nature of the message makes me think that something is trying to reach something else, and is getting a proxy instead OR that there’s some calico command that isn’t completing. The error is un-Google-able; no one has ever recorded that error in google search space.

I’ve been staring at it for three weeks, and in the course of that time have tried pretty much everything you can think of. BUT I do know I have’t tried everything, ever. I do know that this exact network was working with kubernetes 1.14 on redhat7.

Is there any way for me to get better debug logs from, say, metrics-server while it tries to spawn? Any (essentially) manual methods for picking through a pod’s initialization? Does kubernetes have any troubleshooting or debugging, or is it just: start over again, again, again?

That error comes from here:

So it looks like possibly the calico container is not being pass the correct parameters its expecting.
Invalid character sounds like a JSON parsing error.
Have you checked the configmaps being passed to Calico to make sure they are valid?

After slogging through some things over the weekend, I found the actual root cause thanks to your note kick starting my brain: the calico binary on my nodes wasn’t correct, it was a mis-marked 3.24 version. I re-installed calico 3.27 and all is well. It’s a very, very tough environment to start troubleshooting with, and this one was particularly silly since I could not easily narrow down where the problem originated.