First I told you our “fuckup” story (you can skip it if you want )
We have some app, that consists of about 90 repositories. We use a lot RabbitMQ for queueing as small jobs as possible (for better scalability). In past, we doesnt use any orchestrator. Each repository contains some
docker-compose.yml, that our CI server uploads to production servers and makes
docker-compose down && docker-compose up. Our production server was single baremetal server with only docker installed (everything was in container). We outsourced management of the server to our provider.
Then we decided to move everything into Kubernetes. Our provider also offers managed Kubernetes clusters, so we buy 3 baremetal servers for worker nodes, each with 64 threads and 192GB RAM. Master nodes are some VMs and everything is managed by our provider. We have just access to our Namespace. So we started moving our application to Kubernetes. Everything looked great, but few days ago, we reached some “magical point” and everything goes totally shitty. The magical point was number of Pods per node. It was the limit we didnt know about. After almost everything was moved to Kubernetes, there were about 1300 running Pods (=433 Pods per Node), which is about 4x more than recommended value of 110 Pods. We have to hotfix it by merging all Pods from Deployment into single Pod with running supervisord (it was the solution we used earlier on single-node server).
My question is: How do you deal with very large amount of small Pods?
I have a few solutions in mind, but noone is ideal for our use-case:
- merge Deployments into single Pods with supervisord
- numprocs (supervisord) = replicas (Deployment)
- merge Deployments into 3 Pods with supervisord
- numprocs (supervisord) = replicas (Deployment) / 3
- this allows us at least benefit from HighAvailibility
- merge Deployments to N Pods with supervisord
- N = for example 5, so we have 5x less Pods
- split Nodes into 4 (or more) VMs, so we dont have 3 Nodes, but 12 Nodes
Each solution has some disadvantages.
1-3) all solutions with supervisord totally kills idea of liveness/readinessProbes
1-2) This kills possibility to effectively use of Horizontal Pod Autoscaler. There is also problem, that a lot of our Deployments have only 3 replicas (fast workers on small queues)
3) This is little bit scalable, but still a lot of hacking
4) This is most of finantial problem, because our provider charges us for:
- managing Kubernetes
- housing of our baremetal servers
- managing the virtualization
- managing of the additional Kubernetes Node (first 5 nodes are included in “basic” management)
- if we count together the last 3 points, its more expensive than management of whole Kubernetes (which itself is not very cheap and cost-effective, but it looked, that Kubernetes has so many tech advantages, so we decided to pay for it)