Cluster Autoscalar issue

We have cluster Autoscalar ,overprovisioning,descheduler setup in our cluster.
Overprovisiong is set as replicacount 44. (11 pods gives a buffer of 1 EC2 instance)

We are seeing some issues like nodes are scaled down in 10 mins and in 20 mins new nodes are added.This is sometimes repeating.When cluster autoscalar find that the node is not used for 10 mins,then it tries to scale down…But in next 20 mins,there is a need to create a new node.
Not sure how to tune this…Is this because of rebalancing happening due to descheduler cronjob which is running every 30 mins? Or is this because of Overprovisioning replicacount 44? Or any other things we should consider?

k logs -n kube-system cluster-autoscaler-aws-cluster-autoscaler-774bbb4cf-9mq4z aws-cluster-autoscaler | ag “(scale-up plan)|(removing empty node)”

I0505 22:42:14.659857       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-36-56.ec2.internal
I0505 22:42:14.660428       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46611687", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-36-56.ec2.internal
I0505 23:00:18.881836       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]
I0505 23:13:01.741943       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-42-5.ec2.internal
I0505 23:13:01.742378       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46624852", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-42-5.ec2.internal
I0505 23:30:05.422452       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]
I0505 23:42:08.867123       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-36-45.ec2.internal
I0505 23:42:08.868311       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46637400", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-36-45.ec2.internal
I0506 00:00:13.286783       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]
I0506 00:12:26.796684       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-59-39.ec2.internal
I0506 00:12:26.796974       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46650479", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-59-39.ec2.internal
I0506 00:30:22.431082       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]

You are looking for the scale-down-unneeded-time flag. FAQ

If you are using helm

scale-down-unneeded-time: 30m

Direct append this to the command

- --scale-down-unneeded-time=30m
1 Like

Thats so wonderful…Thankyou :slight_smile:

How the descheduler interacts with the overprovisioner, and is there anything to be tuned there to improve this autoscalaing?

Maybe scale-down-utilization-threshold Node utilization level, defined as sum of requested resources divided by capacity, below which a node can be considered for scale down.

We have not experienced the issues you are having.

Ok Thankyou…I will look into it…do you have any thoughts on how deschedular impacts this autoscaling and overprovisioning?