Changing CIDR broke my cluster. Need help to recover

I have a 3 nodes 1.26 microk8s HA cluster running perfectly since a few months. I wanted to change the pods CIDR from 10.1.x.x to 100.1.x.x in order to be able to communicate with devices on the nodes internal network also using 10.1.x.x.

So I followed this guide https://microk8s.io/docs/change-cidr. I did each step on each nodes before moving to the next step (of course only stopping one at a time).

Looking at pods afterwards, most were stuck either “terminating” or “ContainerCreating”.
I then realized this was because the nodes themselves were stuck in “NotReady”. Only node 1 came back online:

kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
jldocker-2 NotReady 175d v1.26.1 10.255.0.8 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://1.6.8
jldocker-3 NotReady 175d v1.26.1 10.255.0.9 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://Unknown
jldocker-1 Ready 175d v1.26.1 10.255.0.7 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://1.6.8

But ssh on the nodes, their status said they are fine:
user@jldocker-2:~$ microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 10.255.0.7:19001 10.255.0.8:19001 10.255.0.9:19001
datastore standby nodes: none

I then tried a “microk8s inspect” but it hangs after “Copy disk usage information to the final report tarball”. After 10 minutes I had to cancel it.

Describe show me kubelet don’t post information:

kubectl describe node jldocker-2
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
NetworkUnavailable False Tue, 07 Feb 2023 22:52:12 -0500 Tue, 07 Feb 2023 22:52:12 -0500 CalicoIsUp Calico is running on this node
MemoryPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.

I checked, and the kubelite (is that it?) service seems to be running:

snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite
Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelite.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-02-22 20:13:59 UTC; 47min ago
Main PID: 428589 (kubelite)
Tasks: 13 (limit: 4611)

So after a while I just tried to rebuilt everything. I launched microk8s forget on nodes 2-3, and then microk8s remove-node. The cluster came back healthy with only one node.

I then tried to rejoin node 2 with add-node / join commands and it completed completed successfully but the kubectl get nodes still show just one?!?

Also I noticed if I try any command using microk8s.kubectl from the nodes that do not appear in the list I get a bunch of “E0222 23:06:10.442663 479768 memcache.go:255] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request” and then the actual correct output.

Any one can help me get out of this mess?

My only working node left (node1) seemed to have filled the hard drive during that operation and was at 2gb left (since it took all the load). I gave it additional disk, rebooted all three Ubuntu machines, and the nodes connected and all errors disappeared.

Can’t say I like how all this happened, doesn’t inspire much confidence in the stability, but at least problem is resolved.

My experience has been that a microk8s restart takes much longer than advertised i.e. have to wait a long time between commands or perform a manual restart of the systemd service or do a reboot.

So I’m not surprised the config cleared up after a reboot.