Microk8s broken after reboot, but "microk8s inspect" fixes it every time. How?

I’ve got a strange situation with my microk8s setup and it’s stumped every engineer I’ve spoken to. Exhaustive searches all over the web (including this forum) turned up no instances of the same issue.
I’m hoping to document the issue here in the hope that—not only could someone be able to help—but since this seems to be a novel situation, whatever fix may arise can be documented here for future victims.

I have a single-node microk8s cluster running 1.30.1 on an Ubuntu 24.04 LTS system.

The core issue is: after rebooting the server, the microk8s cluster appears online but does not work properly. Here are symptoms:

  • Our application deployments and pods do not spin up
  • Running kubectl top nodes shows either “error: Metrics API not available” or “The connection to the server 1.2.3.4:16443 was refused - did you specify the right host or port?” depending on when it is run. This suggests some component within microk8s that answers kubectl top nodes is crash-looping, changing the result depending on precisely when the command is run
  • Viewing the output of ifconfig shows there is no vxlan.calico interface present.
  • Attempting to use tools like k9s to look at pods/deployments/services/anything is hit-or-miss. Sometimes what you’re requesting is loaded fine. Other times you get various errors involving connections, permissions, etc. It behaves as if microk8s is crash-looping; when it is running for a moment it is able to answer requests, but then it dies and any requests made in that moment spawn connection errors since the service is down. You can run kubectl describe commands 10 times and 2 of them will yield results while the other 8 produce connection errors.

:arrow_right::arrow_right::arrow_right:Here’s the strange part:

Running sudo microk8s inspect to produce an inspection report fixes the issue every time.

Immediately following the execution of microk8s inspect, our application deployments spin up successfully, kubectl top nodes shows proper node metrics, and ifconfig shows the vxlan.calico interface is both present and online.

My team and I have gone over each step in the inspect script and as far as we can tell the microk8s inspect command simply reads various things and dumps them to files; it should not be making changes.
However it very clearly is making some kind of change since it reliably resolves the issue every time microk8s inspect is run after a reboot.

After a week of debugging, we’re almost at the point where we’re going to add microk8s inspect to a startup script to ensure it “fixes” whatever problem upon reboot, but that feels like such a dirty hack.

We really would like to understand WHAT microk8s inspect is fixing so we can make whatever configuration change is necessary to resolve the underlying root cause directly.

I am at the mercy of you—a curious and helpful member of the Kubernetes community—to help in diagnosing this. What info/data would YOU look at to understand what’s wrong?

Thank you for your help.

1 Like

It’s been a few more days of hacking and we finally identified the root cause for the above.

Some other symptoms:

  • In the microk8s.daemon-kubelite log, you see Error: open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory

The root cause was identified as the lack of nf_conntrack kernel module.

I found this: microk8s - kube-proxy errors and quits with nf_conntrack_max: no such file - Ask Ubuntu

and I agree with the poster there.

If microk8s depends on the nf_conntrack module, why doesn’t it ensure it is installed?

Moreover, why does the microk8s inspect command seemingly enable the nf_conntrack module when all it is supposed to be doing is inspecting? Enabling a kernel module is an incredibly sensitive action, and to do it silently is pretty crazy.

Once the nf_conntrack module was installed and configured to remain installed upon boot, all of the above problems disappeared and microk8s came online successfully after each reboot.

I hope this solves your problem, future dude.

Confirmed, I have this exact problem with the same setup as you. I went the GPT route to help resolve this for me and below is what it came up with:

sudo modprobe nf_conntrack

sudo apt-get update

sudo apt-get install conntrack

sudo sysctl -w net.netfilter.nf_conntrack_max=131072

echo “net.netfilter.nf_conntrack_max=131072” | sudo tee -a /etc/sysctl.conf

sudo sysctl -p

sudo apt-get install linux-modules-extra-$(uname -r)

cat etc/sysctl.conf | grep nf_conntrack_max

sudo nano /etc/systemd/system/load-conntrack.service

Add the below to the file

[Unit]
Description=Load nf_conntrack module and set nf_conntrack_max
After=network.target

[Service]
Type=oneshot
ExecStart=/sbin/modprobe nf_conntrack
ExecStart=/sbin/sysctl -w net.netfilter.nf_conntrack_max=131072
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

sudo systemctl enable load-conntrack.service

sudo reboot