Currently for our EKS clusters we use Cluster Autoscaler (autoscaler/FAQ.md at cluster-autoscaler-release-1.16 · kubernetes/autoscaler · GitHub) to scale dynamically with increase in traffic or processing.
However recently due to increase in customer traffic , we have increased maximum Node size for our auto scaling groups so that nodes can grow 1000 + , 1500 is the upper limit.
(As per the FAQ link above cluster autoscaler has only been tested till cluster size of 1000 nodes)
However while running performance tests we observed that it took a lot time to grow after 800 nodes as autoscaler seemed to slow down in its scaling calls even though pending pods were piling up . Does somebody have experience using autoscaler in 1000 + nodes cluster and what should be the best practices ? Is it able to handle big clusters?
Right now we are running 3 replicas of cluster-autoscaler with 8GB memory limit.
Appreciate any advice.
Cluster information:
Kubernetes version: 1.19.13
Cloud being used: AWS EKS
Host OS: Amazon Linux
CNI and version: AWS CNI 1.9.0
CRI and version: Docker 20.10.7