Cluster Autoscalar issue

radhar4 · May 6, 2020, 3:12am

We have cluster Autoscalar ,overprovisioning,descheduler setup in our cluster.
Overprovisiong is set as replicacount 44. (11 pods gives a buffer of 1 EC2 instance)

We are seeing some issues like nodes are scaled down in 10 mins and in 20 mins new nodes are added.This is sometimes repeating.When cluster autoscalar find that the node is not used for 10 mins,then it tries to scale down…But in next 20 mins,there is a need to create a new node.
Not sure how to tune this…Is this because of rebalancing happening due to descheduler cronjob which is running every 30 mins? Or is this because of Overprovisioning replicacount 44? Or any other things we should consider?

k logs -n kube-system cluster-autoscaler-aws-cluster-autoscaler-774bbb4cf-9mq4z aws-cluster-autoscaler | ag “(scale-up plan)|(removing empty node)”

I0505 22:42:14.659857       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-36-56.ec2.internal
I0505 22:42:14.660428       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46611687", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-36-56.ec2.internal
I0505 23:00:18.881836       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]
I0505 23:13:01.741943       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-42-5.ec2.internal
I0505 23:13:01.742378       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46624852", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-42-5.ec2.internal
I0505 23:30:05.422452       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]
I0505 23:42:08.867123       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-36-45.ec2.internal
I0505 23:42:08.868311       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46637400", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-36-45.ec2.internal
I0506 00:00:13.286783       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]
I0506 00:12:26.796684       1 scale_down.go:938] Scale-down: removing empty node ip-10-120-59-39.ec2.internal
I0506 00:12:26.796974       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"28c7ffa0-de2e-4a5f-856f-69684065056b", APIVersion:"v1", ResourceVersion:"46650479", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-120-59-39.ec2.internal
I0506 00:30:22.431082       1 scale_up.go:533] Final scale-up plan: [{nodes.us-east-1.apicentral.axwaydev.net 26->27 (max: 30)}]

michaelpporter · May 6, 2020, 10:51am

You are looking for the scale-down-unneeded-time flag. FAQ

If you are using helm

scale-down-unneeded-time: 30m

Direct append this to the command

- --scale-down-unneeded-time=30m

radhar4 · May 6, 2020, 2:40pm

Thats so wonderful…Thankyou

radhar4 · May 6, 2020, 3:50pm

How the descheduler interacts with the overprovisioner, and is there anything to be tuned there to improve this autoscalaing?

michaelpporter · May 6, 2020, 4:10pm

Maybe scale-down-utilization-threshold Node utilization level, defined as sum of requested resources divided by capacity, below which a node can be considered for scale down.

We have not experienced the issues you are having.

radhar4 · May 6, 2020, 4:50pm

Ok Thankyou…I will look into it…do you have any thoughts on how deschedular impacts this autoscaling and overprovisioning?

Topic		Replies	Views
GKE won't scale down General Discussions	1	1622	January 4, 2020
Issue during Cluster autoscaling when nodes are out of capacity General Discussions	1	533	July 25, 2021
GKE scaled-down nodes won't terminate General Discussions	6	2683	March 18, 2021
Experiences with using Cluster Autoscaler in 1000+ nodes cluster on AWS EKS General Discussions	0	600	December 1, 2021
Cluster Autoscaler and Horizontal Pod Autoscaler working together General Discussions	0	711	March 2, 2021

Cluster Autoscalar issue

Related topics