How does HPA make sense

So I think, I have a misunderstanding of the Horizontal Pod Autoscaler.
Or lets say, I think I dont yet understand the exact/most common reason for it to exist.

Pods are running on Nodes. Nodes are VMs with lets say immutable hardware, e.g. 4CPU, 16RAM.

When my app receives too much traffic, the HPA would scale up by replicating my app in form of another Pod. The only case where HPA would make sense now, is, when my app has a “fix” number of requests it can handle, right?

So my thoughts are these:

If the app uses threading to make use of concurrency, it would be useless to enable HPA, as there can never be handled more than 4 requests at a time (with 4 CPU cores). When HPA scales out, there are not more CPU-cores available, thus it is even worse because it consumes resources for the process of scaling out, while 2 pods can also only use 4 cores altogether.

So what use-cases / benefits are there (except redundancy in case of error of one Pod)?

The prometheus-adapter docs state:

…except you’re not sure that just one instance will handle all the traffic once it goes viral. Thankfully, you’ve got Kubernetes

And then refer to the HPA. It sounds like adding another pod would easily fix it.

Arent the final reasons for a scale-out always due to resource-shortage?
Too many requests => too little CPU/RAM

I currently have a cluster for only one kind of application.
I could imagine that HPA It is useful, if there is a very powerful underlying VM with e.g. 36 CPU cores and 128GB RAM, which works like a “storage” of computing-power and if the app receives too much traffic, HPA adds a Pod on the node, which receives some of the left ofer compute-power on the vm (each pod could have resource-limits).

But then, there would always be “unused” CPU/RAM-resources which would be costly…

As this is not a real Question, but more like me throwing my thoughts out there, demanding for information, every food for thought is appreciated.

You are sort of right, but your missing a few points on this:

  • The alternative of horizontal scaling is vertical scaling which currently (unless you’ve enabled the Alpha feature of in-place pod resource updates) includes potential downtime as the pods need to be restarted (not the best thing to do when you’re trying to scale up) so there’s a big plus here for horizontal scaling.
  • The other thing to take into account is cluster autoscaling, so at idle times I could have a small node running a single pod but when I’m getting a lot more traffic and the HPA kicks in it can trigger the cluster autoscaler the spin up more nodes as needed and take them down when they are no longer needed.

Regarding if the app uses threading to make use of concurrency, if your pod has 4 cpu requests (for the 4 cores in requires) and the node it is on only has 4 cores, when the HPA scales another pod (and you don’t have cluster autoscaler to provide a new node) the pod will not be scheduled as the Kubernetes scheduler requires there to be available resources on the node to satisfy the requests of the pod, meaning it will be stuck in pending state.

For your use-case, it depends on where you are running the cluster, ideally in a cloud environment as I’ve mentioned you could have HPA with a cluster autoscaler that scales the cluster out and in depending on demand so you can handle peak load without having to pay for the resources you require all the time.

Would love to get in touch and hear more about your use-case, feel free to shoot me an email; shir@slimstack.io