Kubernetes version v1.20.15 suddenly worker node going NotReady and Unkown state

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: v1.20.15
Cloud being used: bare-metal
Installation method: k
Host OS: Linux Rehdat 7
CNI and version: Calico - v3.16.3
CRI and version:

issue is suddenly not the cluster behaving strangely we have 17 worker nodes and 3 master nodes , specific nodes getting NotReady and unknow status , even after rebooted the server in sometime Memory utilization goes high and again node get into NotReady status .

When we check using kubectl top no and kubectl to po no utilization is showing

#kubectl get no
NAME STATUS ROLES AGE VERSION
vm3704lnx NotReady 685d v1.20.15
vm3705lnx Ready control-plane,master 685d v1.20.15
vm3706lnx Ready control-plane,master 685d v1.20.15
vm3707lnx Ready,SchedulingDisabled 685d v1.20.15
vm3708lnx Ready control-plane,master 685d v1.20.15
vm6385lnx NotReady 367d v1.20.15
vm6387lnx Ready 366d v1.20.15
vm6394lnx NotReady 366d v1.20.15
vm6395lnx Ready 366d v1.20.15
vm6420lnx NotReady 366d v1.20.15
vm6473lnx Ready 365d v1.20.15
vm6474lnx Ready 163d v1.20.15
vm6476lnx NotReady 365d v1.20.15
vm6477lnx Ready 365d v1.20.15
vm6478lnx NotReady 365d v1.20.15
vm6479lnx NotReady 365d v1.20.15
vm6480lnx Ready,SchedulingDisabled 163d v1.20.15
vm6694lnx Ready 163d v1.20.15
vm6701lnx NotReady 163d v1.20.15
vm6702lnx NotReady,SchedulingDisabled 163d v1.20.15

o/p for top node

kubectl top no

NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
vm3705lnx 545m 9% 12544Mi 39%
vm3706lnx 567m 9% 9400Mi 29%
vm3707lnx 183m 2% 1786Mi 2%
vm3708lnx 665m 11% 21521Mi 67%
vm6387lnx 2586m 32% 38105Mi 59%
vm6395lnx 901m 11% 41162Mi 64%
vm6473lnx 1868m 23% 26797Mi 41%
vm6477lnx 781m 9% 8330Mi 12%
vm6480lnx 186m 2% 32056Mi 49%
vm6385lnx
vm6476lnx
vm6478lnx
vm6479lnx
vm6701lnx
vm6702lnx
vm3704lnx
vm6394lnx
vm6474lnx
vm6694lnx
vm6420lnx

this one is describe one node :-1:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 600m (7%) 500m (6%)
memory 22696Mi (35%) 4556Mi (7%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message


Warning SystemOOM 13h kubelet System OOM encountered
Warning ImageGCFailed 141m (x24 over 11h) kubelet failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded

==================================
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 1150m (14%) 1 (12%)
memory 25086Mi (39%) 9624Mi (15%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message


Warning ContainerGCFailed 58m (x13 over 5h15m) kubelet rpc error: code = DeadlineExceeded desc = context deadline exceeded
Normal NodeHasSufficientMemory 57m (x11 over 13h) kubelet Node de6385yr status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 57m (x11 over 13h) kubelet Node de6385yr status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 57m (x11 over 13h) kubelet Node de6385yr status is now: NodeHasSufficientPID
Normal NodeNotReady 57m (x8 over 13h) kubelet Node de6385yr status is now: NodeNotReady
Warning ImageGCFailed 56m (x29 over 5h27m) kubelet failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded

=========================================

Not getting any clue which pods or resource behaving like this due to cluster getting defunct state and no deployments running stably .
Pls help if anyway we can find the reason for this issue nothing I can found so far

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

can you check the container engine (docker, crio, containerd) is running on those nodes? also if kubelet is not restarting?

seems like for some reason kubelet cannot communicate with the container engine