Kubernetes troubleshooting worst case: offline, behind a web proxy, on locked down networks

blixco · May 23, 2024, 4:18pm

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.28.2
Cloud being used: AWS govcloud
Installation method: cloudformations / chef / manual
Host OS: redhat 8.8
CNI and version: Calico v3.24.5
CRI and version: containerd.io 1.6.31-3.1.el8

I have a handful of nodes in AWS that are not connected to the internet, and are behind a web proxy for the limited web services available on the network.

I have pods that remain stuck in “ContainerCreating” with very little feedback. The one common error with all of them is:
“Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up network for sandbox “(imagine a 32 bit uuid here)”: plugin type “calico” failed (add): netplugin failed but error parsing its diagnostic message “No valid options provided. Usage:\n”: invalid character ‘N’ looking for beginning of value”

Kubelet and containerd both throw that message repeatedly for any pods stuck in creating.

The nature of the message makes me think that something is trying to reach something else, and is getting a proxy instead OR that there’s some calico command that isn’t completing. The error is un-Google-able; no one has ever recorded that error in google search space.

I’ve been staring at it for three weeks, and in the course of that time have tried pretty much everything you can think of. BUT I do know I have’t tried everything, ever. I do know that this exact network was working with kubernetes 1.14 on redhat7.

Is there any way for me to get better debug logs from, say, metrics-server while it tries to spawn? Any (essentially) manual methods for picking through a pod’s initialization? Does kubernetes have any troubleshooting or debugging, or is it just: start over again, again, again?

stephendotcarter · May 24, 2024, 11:13am

That error comes from here:

calico/node/cmd/calico-node/main.go at 23ae58b62765b14aa2c5952b2fc6c40155731a79 · projectcalico/calico · GitHub

So it looks like possibly the calico container is not being pass the correct parameters its expecting.
Invalid character sounds like a JSON parsing error.
Have you checked the configmaps being passed to Calico to make sure they are valid?

blixco · May 28, 2024, 3:04pm

After slogging through some things over the weekend, I found the actual root cause thanks to your note kick starting my brain: the calico binary on my nodes wasn’t correct, it was a mis-marked 3.24 version. I re-installed calico 3.27 and all is well. It’s a very, very tough environment to start troubleshooting with, and this one was particularly silly since I could not easily narrow down where the problem originated.

Topic		Replies	Views
Kubernetes Calico node issues General Discussions	0	934	May 2, 2024
Failed to deploy calico for kubernetes1.21 version Chinese network	0	1852	August 10, 2021
My PODs are stuck creating General Discussions development , network	4	4810	October 27, 2021
Microk8s: Failed to create pod sandbox: rpc error: code microk8s microk8s	3	14842	November 24, 2022
Calico reporting errors during the deployment of the K8S binary cluster General Discussions development , podcast , network	0	395	November 20, 2024

Kubernetes troubleshooting worst case: offline, behind a web proxy, on locked down networks

Cluster information:

Related topics