API very slow, often times out

Hi,

I am having an issue with my microk8s cluster. API calls are incredibly slow, and I can’t figure out the root cause. The system has been running fine for almost a year, until this issue.

Calls will run sometimes, but take a long time. Other times they will time out. journalctl on all the machines also shows lots of timeouts. I also get random “couldn’t get the resource list for…”. Whether this happens, and what it is for, changes a lot.

For example:

jarrod@storage01:~$ time microk8s kubectl get ns
E1219 06:38:25.397755   73512 memcache.go:255] couldn't get resource list for cert-manager.io/v1: the server could not find the requested resource
E1219 06:38:25.397889   73512 memcache.go:255] couldn't get resource list for acme.cert-manager.io/v1: the server could not find the requested resource
E1219 06:38:25.399638   73512 memcache.go:255] couldn't get resource list for traefik.containo.us/v1alpha1: the server could not find the requested resource
E1219 06:38:30.393967   73512 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1219 06:38:35.396371   73512 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1219 06:38:40.400431   73512 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1219 06:38:45.403532   73512 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME                 STATUS   AGE
kube-system          Active   329d
kube-public          Active   329d
kube-node-lease      Active   329d
default              Active   329d
rook-ceph            Active   329d
monitoring           Active   329d
metallb-system       Active   324d
traefik              Active   324d
ssh                  Active   303d
samba                Active   316d
container-registry   Active   238d
cert-manager         Active   231d
hello-world          Active   231d
honeycomb            Active   136d
devenv               Active   2d3h

real	0m35.609s
user	0m0.136s
sys	0m0.051s

It doesn’t matter which host I run this on. I have not deployed a load balancer in front of the API yet.

DNS is working fine, all the names resolve reliably.

I’m not sure what to look at to figure this out, pointers would be appreciated.

I can’t see how to attach files here, so I have uploaded the inspect reports to google drive: microk8s-inspect-reports - Google Drive

That is four of the six nodes; the other two haven’t finished creating the reports.

did you find out the solution for this issue

No, it’s still an issue, and I have no idea where to even look further.

Have the same issue, verry annoying!

I believe I have the same issue, but much worse. It was fast, and the next day everything crashed, nothing can be done on the cluster any more. Getting a lot of these memcached.go the server is currently unable to handle the request errors too.

I believe this is due to k8s-dqlite process using 100% CPU, but why?