"failed to sync secret cache" and "Unable to attach or mount volumes" for many minutes

Cluster information:

Kubernetes version: 1.21.1
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: SLES 15 SP1
CNI and version: Calico v3.19.1
CRI and version: Containerd v1.4.11

We have a quite complex product, and we have observed that its installation can take even more than 1 hour with helm. We started to take a look because increasing the helm timeout did not help. We have some Spark job Pods that get installed in the post-install hooks. Anyway, the problem was that we have seen very many issues like this:

A1) failed to sync configmap cache: timed out waiting for the condition
A2) failed to sync secret cache: timed out waiting for the condition
B) Unable to attach or mount volumes: unmounted volumes … timed out waiting for the condition

These errors are observable in some of the Pods for 5-10-20 minutes, then they all can start up. The problem is… that the required configmaps, secrets and PVCs are all available much earlier, so the Pods are not waiting for these. And all of this seems random: we have ~50 such Spark driver Pods, and around 5-10 has issues, sometimes 0 and the time to wait is also random.

Also worth mentioning, that these 50 Pods all had 3 PVCs, 3 ConfigMaps and around 300 secrets ! Yes, that is a lot, but should not cause any issues I think, not for 20 minutes.

I would like to ask some help as to WHERE to look for. What we have tried is to:

  1. Check the API Server Throttling. I have found nothing there (followed some blogs and requested metrics during the installation), during an installation everything was calm, nothing near the breaking point of the API Server, around 5-10 queries were done at the same time. APIServer should not block for 20 minutes anyway I think.
  2. Re-architecture the Secrets, which eventually was successful, and now the installation takes normal 40 minutes (yes, we have a BIG application). So the issue seems to be with THAT many secret mounts going through the system in this short amount of time. But I have no idea which component breaks down and why.

Where to look? What to check? How to debug this?

1 Like