Default serviceaccount in each namespace very slow to create

We have a solution we’re building where we provision an instance of microk8s on-demand for our users. As part of that process, we import a lot of resources (namespaces, persistent data stores, etc.), but we’re seeing that microk8s takes upwards of 1-3 minutes to create the “default” serviceaccount in each namespace, meaning we can’t create Pods until this is ready.

For example, our kafka statefulset has logs like:

Events:
  Type     Reason            Age                    From                    Message
  ----     ------            ----                   ----                    -------
  Normal   SuccessfulCreate  2m39s                  statefulset-controller  create Claim kafka-kafka-0 Pod kafka-0 in StatefulSet kafka success
  Warning  FailedCreate      2m33s (x4 over 2m38s)  statefulset-controller  create Pod kafka-0 in StatefulSet kafka failed error: pods "kafka-0" is forbidden: error looking up service account kafka/default: serviceaccount "default" not found
  Normal   SuccessfulCreate  2m32s                  statefulset-controller  create Pod kafka-0 in StatefulSet kafka successful

I don’t see anything in the kubelite logs immediately that might explain this delay – where should I look next?

It might be load on the dqlite - but then again might be another issue.

Can you provide some more info about your deployment?
No. of Pods?
No. and size of nodes?

One thing I would try is to not use the default service account but rather use a dedicated service account for the pods, this way when the pods get created the serviceaccount should be created as well - not much of a solution - i know but It might shed some light whether this is an object creation slowness issue or a new namespace provisioning issue.

Is there somewhere I can pull metrics for something like dqlite?

We’re provisioning a single m5.4xl in AWS, but we are deploying quite a lot of resources rather quickly, including ~40 namespaces and later 100s of pods. I would not be surprised if something like dqlite is thrashing, I just want to be able to isolate the cause and don’t know where to look.

I like your suggestion about provisioning service accounts. The irony is that for 99.9% of the pods, we don’t even need a service account, it’s just that the default setting is to use ‘default’, which is having this multi-minute delay in provisioning.

1 Like

IMHO if you’re deploying to a single node server, you can disable HA, which internally will use etcd and disable dqlite. This might give you better performance for your use case.