How to deal with large number of Pods

First I told you our “fuckup” story :slight_smile: (you can skip it if you want :slight_smile: )
We have some app, that consists of about 90 repositories. We use a lot RabbitMQ for queueing as small jobs as possible (for better scalability). In past, we doesnt use any orchestrator. Each repository contains some docker-compose.yml, that our CI server uploads to production servers and makes docker-compose down && docker-compose up. Our production server was single baremetal server with only docker installed (everything was in container). We outsourced management of the server to our provider.
Then we decided to move everything into Kubernetes. Our provider also offers managed Kubernetes clusters, so we buy 3 baremetal servers for worker nodes, each with 64 threads and 192GB RAM. Master nodes are some VMs and everything is managed by our provider. We have just access to our Namespace. So we started moving our application to Kubernetes. Everything looked great, but few days ago, we reached some “magical point” and everything goes totally shitty. The magical point was number of Pods per node. It was the limit we didnt know about. After almost everything was moved to Kubernetes, there were about 1300 running Pods (=433 Pods per Node), which is about 4x more than recommended value of 110 Pods. We have to hotfix it by merging all Pods from Deployment into single Pod with running supervisord (it was the solution we used earlier on single-node server).

My question is: How do you deal with very large amount of small Pods?

I have a few solutions in mind, but noone is ideal for our use-case:

  1. merge Deployments into single Pods with supervisord
    • numprocs (supervisord) = replicas (Deployment)
  2. merge Deployments into 3 Pods with supervisord
    • numprocs (supervisord) = replicas (Deployment) / 3
    • this allows us at least benefit from HighAvailibility
  3. merge Deployments to N Pods with supervisord
    • N = for example 5, so we have 5x less Pods
  4. split Nodes into 4 (or more) VMs, so we dont have 3 Nodes, but 12 Nodes

Each solution has some disadvantages.
1-3) all solutions with supervisord totally kills idea of liveness/readinessProbes
1-2) This kills possibility to effectively use of Horizontal Pod Autoscaler. There is also problem, that a lot of our Deployments have only 3 replicas (fast workers on small queues)
3) This is little bit scalable, but still a lot of hacking
4) This is most of finantial problem, because our provider charges us for:

  • managing Kubernetes
  • housing of our baremetal servers
  • managing the virtualization
  • managing of the additional Kubernetes Node (first 5 nodes are included in “basic” management)
  • if we count together the last 3 points, its more expensive than management of whole Kubernetes (which itself is not very cheap and cost-effective, but it looked, that Kubernetes has so many tech advantages, so we decided to pay for it)

Hi, it’s very interesting.

what exactly have you observed? Instead of using supervisord you can also create N containers within the one Pod, but disadvantages remain.

New Pods wont start. There were different errors, like not mounted volumes (from ConfigMaps, Secrets, PVs,…). Also some networking troubles - some Pods dont see some Services from time to time. I dont see much “under the hood” (because our cluster is managed by our provider), but the provider says, that a lot of resources is consumed by kubelet itself.

You can’t really run kubernetes without specific monitoring - observability is a problem. From the sound of it - you overrun the masters and it goes pear-shaped but can’t see it. The kubelet is the canary. Who does the kubelet talk to?