Changing CIDR broke my cluster. Need help to recover

Dunge · February 23, 2023, 1:24am

I have a 3 nodes 1.26 microk8s HA cluster running perfectly since a few months. I wanted to change the pods CIDR from 10.1.x.x to 100.1.x.x in order to be able to communicate with devices on the nodes internal network also using 10.1.x.x.

So I followed this guide https://microk8s.io/docs/change-cidr. I did each step on each nodes before moving to the next step (of course only stopping one at a time).

Looking at pods afterwards, most were stuck either “terminating” or “ContainerCreating”.
I then realized this was because the nodes themselves were stuck in “NotReady”. Only node 1 came back online:

kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
jldocker-2 NotReady 175d v1.26.1 10.255.0.8 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://1.6.8
jldocker-3 NotReady 175d v1.26.1 10.255.0.9 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://Unknown
jldocker-1 Ready 175d v1.26.1 10.255.0.7 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://1.6.8

But ssh on the nodes, their status said they are fine:
user@jldocker-2:~$ microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 10.255.0.7:19001 10.255.0.8:19001 10.255.0.9:19001
datastore standby nodes: none

I then tried a “microk8s inspect” but it hangs after “Copy disk usage information to the final report tarball”. After 10 minutes I had to cancel it.

Describe show me kubelet don’t post information:

kubectl describe node jldocker-2
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
NetworkUnavailable False Tue, 07 Feb 2023 22:52:12 -0500 Tue, 07 Feb 2023 22:52:12 -0500 CalicoIsUp Calico is running on this node
MemoryPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.

I checked, and the kubelite (is that it?) service seems to be running:

snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite
Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelite.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-02-22 20:13:59 UTC; 47min ago
Main PID: 428589 (kubelite)
Tasks: 13 (limit: 4611)

So after a while I just tried to rebuilt everything. I launched microk8s forget on nodes 2-3, and then microk8s remove-node. The cluster came back healthy with only one node.

I then tried to rejoin node 2 with add-node / join commands and it completed completed successfully but the kubectl get nodes still show just one?!?

Also I noticed if I try any command using microk8s.kubectl from the nodes that do not appear in the list I get a bunch of “E0222 23:06:10.442663 479768 memcache.go:255] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request” and then the actual correct output.

Any one can help me get out of this mess?

Dunge · February 23, 2023, 5:54am

My only working node left (node1) seemed to have filled the hard drive during that operation and was at 2gb left (since it took all the load). I gave it additional disk, rebooted all three Ubuntu machines, and the nodes connected and all errors disappeared.

Can’t say I like how all this happened, doesn’t inspire much confidence in the stability, but at least problem is resolved.

miah0x41 · May 29, 2023, 2:28pm

My experience has been that a microk8s restart takes much longer than advertised i.e. have to wait a long time between commands or perform a manual restart of the systemd service or do a reboot.

So I’m not surprised the config cleared up after a reboot.

Topic		Replies	Views
CIDR change fails microk8s	1	1108	September 14, 2021
Changing the pods CIDR in a MicroK8s cluster microk8s docs	10	9005	September 11, 2024
Remaining nodes in 3 node cluster becoming "NotReady" when I power off 1 node microk8s	1	276	August 28, 2024
Fresh microk8s 1.19.0 stalled, never becomes ready microk8s	2	735	September 16, 2020
Unable to change pod cidr range on a dual stack cluster General Discussions network	0	41	August 28, 2024

Changing CIDR broke my cluster. Need help to recover

Related topics