Nodes crashed. node.kubernetes.io/unreachable:NoSchedule taint

Ubuntu 20.04.5 LTS
microk8s 1.26.1 installed via snap
3 nodes HA clusters

I came back one morning and my cluster that was working fine the previous day was completely down. Unlike my previous similar experience, this time the disks still have space remaining.

microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.

microk8s start
nothing for a few minutes, then exit with no message status still say not running.

I tried to check the logs using sudo journalctl -u snap.microk8s.daemon-kubelite but there’s too much stuff, couldn’t find anything relevant.

I then rebooted the nodes (sudo reboot) and they actually came back online status Ready.

The condition look good:

Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Thu, 23 Feb 2023 04:36:12 +0000 Thu, 23 Feb 2023 04:36:12 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled

But there is a Taint preventing any workload to get scheduled on them:

Taints: node.kubernetes.io/unreachable:NoSchedule

Events on pods:

0/3 nodes are available: 3 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling…

Events seems to crash the kubelet service on loop:

Warning InvalidDiskCapacity 9m38s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientPID 9m38s kubelet Node jldocker-1 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 9m38s kubelet Updated Node Allocatable limit across pods
Normal NodeHasNoDiskPressure 9m38s kubelet Node jldocker-1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 9m38s kubelet Node jldocker-1 status is now: NodeHasSufficientMemory
Normal Starting 9m38s kubelet Starting kubelet.
Normal Starting 7m7s kubelet Starting kubelet.
Warning InvalidDiskCapacity 7m7s kubelet invalid capacity 0 on image filesystem
Normal NodeHasNoDiskPressure 7m6s kubelet Node jldocker-1 status is now: NodeHasNoDiskPressure
Normal NodeAllocatableEnforced 7m6s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientPID 7m6s kubelet Node jldocker-1 status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 7m6s kubelet Node jldocker-1 status is now: NodeHasSufficientMemory
Normal Starting 2m57s kubelet Starting kubelet.
Warning InvalidDiskCapacity 2m57s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 2m57s kubelet Node jldocker-1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m57s kubelet Node jldocker-1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m57s kubelet Node jldocker-1 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 2m57s kubelet Updated Node Allocatable limit across pods

Any command I send to microk8s kubectl print some error message about memcache.go:

E0320 20:41:19.730002 50241 memcache.go:255] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0320 20:41:19.733484 50241 memcache.go:106] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0320 20:41:19.735017 50241 memcache.go:106] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0320 20:41:19.738322 50241 memcache.go:106] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0320 20:41:20.827654 50241 memcache.go:106] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0320 20:41:20.830214 50241 memcache.go:106] couldn’t get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Any idea what is causing this? From my point of view, nodes seems healthy, ram, cpu and disk are good, networking between them is also functional.

I uploaded the result tarball of microk8s inspect to my Google Drive:

Please help me recover my cluster!

Can you log this in microk8s github issues?

Sure, here it is:

I wasn’t sure where to post since this is not a bug report in itself and more of a request for help I assumed it was preferable to post on the community forum, but as you please! :slight_smile:

@balchua1 Did you try to take a look? Still stuck!