I’ve got a strange situation with my microk8s setup and it’s stumped every engineer I’ve spoken to. Exhaustive searches all over the web (including this forum) turned up no instances of the same issue.
I’m hoping to document the issue here in the hope that—not only could someone be able to help—but since this seems to be a novel situation, whatever fix may arise can be documented here for future victims.
I have a single-node microk8s cluster running 1.30.1 on an Ubuntu 24.04 LTS system.
The core issue is: after rebooting the server, the microk8s cluster appears online but does not work properly. Here are symptoms:
- Our application deployments and pods do not spin up
- Running
kubectl top nodes
shows either “error: Metrics API not available” or “The connection to the server 1.2.3.4:16443 was refused - did you specify the right host or port?” depending on when it is run. This suggests some component within microk8s that answerskubectl top nodes
is crash-looping, changing the result depending on precisely when the command is run - Viewing the output of
ifconfig
shows there is novxlan.calico
interface present. - Attempting to use tools like
k9s
to look at pods/deployments/services/anything is hit-or-miss. Sometimes what you’re requesting is loaded fine. Other times you get various errors involving connections, permissions, etc. It behaves as if microk8s is crash-looping; when it is running for a moment it is able to answer requests, but then it dies and any requests made in that moment spawn connection errors since the service is down. You can runkubectl describe
commands 10 times and 2 of them will yield results while the other 8 produce connection errors.
Here’s the strange part:
Running sudo microk8s inspect
to produce an inspection report fixes the issue every time.
Immediately following the execution of microk8s inspect
, our application deployments spin up successfully, kubectl top nodes
shows proper node metrics, and ifconfig
shows the vxlan.calico
interface is both present and online.
My team and I have gone over each step in the inspect script and as far as we can tell the microk8s inspect
command simply reads various things and dumps them to files; it should not be making changes.
However it very clearly is making some kind of change since it reliably resolves the issue every time microk8s inspect
is run after a reboot.
After a week of debugging, we’re almost at the point where we’re going to add microk8s inspect
to a startup script to ensure it “fixes” whatever problem upon reboot, but that feels like such a dirty hack.
We really would like to understand WHAT microk8s inspect
is fixing so we can make whatever configuration change is necessary to resolve the underlying root cause directly.
I am at the mercy of you—a curious and helpful member of the Kubernetes community—to help in diagnosing this. What info/data would YOU look at to understand what’s wrong?
Thank you for your help.