I am having an issue with my microk8s cluster. API calls are incredibly slow, and I can’t figure out the root cause. The system has been running fine for almost a year, until this issue.
Calls will run sometimes, but take a long time. Other times they will time out. journalctl on all the machines also shows lots of timeouts. I also get random “couldn’t get the resource list for…”. Whether this happens, and what it is for, changes a lot.
For example:
jarrod@storage01:~$ time microk8s kubectl get ns
E1219 06:38:25.397755 73512 memcache.go:255] couldn't get resource list for cert-manager.io/v1: the server could not find the requested resource
E1219 06:38:25.397889 73512 memcache.go:255] couldn't get resource list for acme.cert-manager.io/v1: the server could not find the requested resource
E1219 06:38:25.399638 73512 memcache.go:255] couldn't get resource list for traefik.containo.us/v1alpha1: the server could not find the requested resource
E1219 06:38:30.393967 73512 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1219 06:38:35.396371 73512 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1219 06:38:40.400431 73512 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1219 06:38:45.403532 73512 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME STATUS AGE
kube-system Active 329d
kube-public Active 329d
kube-node-lease Active 329d
default Active 329d
rook-ceph Active 329d
monitoring Active 329d
metallb-system Active 324d
traefik Active 324d
ssh Active 303d
samba Active 316d
container-registry Active 238d
cert-manager Active 231d
hello-world Active 231d
honeycomb Active 136d
devenv Active 2d3h
real 0m35.609s
user 0m0.136s
sys 0m0.051s
It doesn’t matter which host I run this on. I have not deployed a load balancer in front of the API yet.
DNS is working fine, all the names resolve reliably.
I’m not sure what to look at to figure this out, pointers would be appreciated.
I believe I have the same issue, but much worse. It was fast, and the next day everything crashed, nothing can be done on the cluster any more. Getting a lot of these memcached.gothe server is currently unable to handle the request errors too.
I believe this is due to k8s-dqlite process using 100% CPU, but why?