I have a 3 nodes 1.26 microk8s HA cluster running perfectly since a few months. I wanted to change the pods CIDR from 10.1.x.x to 100.1.x.x in order to be able to communicate with devices on the nodes internal network also using 10.1.x.x.
So I followed this guide https://microk8s.io/docs/change-cidr. I did each step on each nodes before moving to the next step (of course only stopping one at a time).
Looking at pods afterwards, most were stuck either “terminating” or “ContainerCreating”.
I then realized this was because the nodes themselves were stuck in “NotReady”. Only node 1 came back online:
kubectl get nodes -o wide
jldocker-2 NotReady 175d v1.26.1 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://1.6.8
jldocker-3 NotReady 175d v1.26.1 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://Unknown
jldocker-1 Ready 175d v1.26.1 Ubuntu 20.04.5 LTS 5.4.0-125-generic containerd://1.6.8
But ssh on the nodes, their status said they are fine:
user@jldocker-2:~$ microk8s status
microk8s is running
high-availability: yes
datastore master nodes:
datastore standby nodes: none
I then tried a “microk8s inspect” but it hangs after “Copy disk usage information to the final report tarball”. After 10 minutes I had to cancel it.
Describe show me kubelet don’t post information:
kubectl describe node jldocker-2
Type Status LastHeartbeatTime LastTransitionTime Reason Message
NetworkUnavailable False Tue, 07 Feb 2023 22:52:12 -0500 Tue, 07 Feb 2023 22:52:12 -0500 CalicoIsUp Calico is running on this node
MemoryPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 22 Feb 2023 14:55:26 -0500 Wed, 22 Feb 2023 15:01:44 -0500 NodeStatusUnknown Kubelet stopped posting node status.
I checked, and the kubelite (is that it?) service seems to be running:
snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite
Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelite.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-02-22 20:13:59 UTC; 47min ago
Main PID: 428589 (kubelite)
Tasks: 13 (limit: 4611)
So after a while I just tried to rebuilt everything. I launched microk8s forget on nodes 2-3, and then microk8s remove-node. The cluster came back healthy with only one node.
I then tried to rejoin node 2 with add-node / join commands and it completed completed successfully but the kubectl get nodes still show just one?!?
Also I noticed if I try any command using microk8s.kubectl from the nodes that do not appear in the list I get a bunch of “E0222 23:06:10.442663 479768 memcache.go:255] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request” and then the actual correct output.
Any one can help me get out of this mess?