Kubernetes scaling with very large requests

I have a problem with a java microservice we are running in AKS. And I would like to get the opinion of others in the Kube community as to how they would handle it.
I think the design of the app is wrong and that scaling cannot overcome this.

The basic operation of the app…

ONE request for data
Fetch the PAGED data from the source (MS dynamics)
Assemble the paged data into a ONE single payload
Return ONE payload.

The problem lies in that the number of rows returned by the data source is NOT known, and not limited.

Could be one, could be a million.

In this example, assume that this pod has replicas, each experiencing the same thing.

The scaling problem is that, at the time of request, the request is a black box (could be 1 or 1M rows).
The pod starts processing the request. In the meantime, because CPU AND MEM are low, and there are no limits on the number of incoming requests, other requests are recvd, each a black box.
Some requests and small and quick, others are not. Eventually, as the pod is assembling all the pages from the source, for each request, the pod finds itself overloaded with many large requests.

By the time the average CPU or MEMORY exceeds the limits, and the HPA kicks in to scale out, the current requests cannot be processed quickly, some eventually pods throw OOM heap errors.
I have tried really low scaling thresholds, like %30 or even less to detect a load before it becomes a problem. But, I think this defeats the purpose of scaling.

The developers say to give it more memory, and they are given between 4G to 6G (Which I think is huge for a microservice!!). They tend to test on their local machine without memory limits. Not in docker or kube mini.

Managers say autoscaling will overcome this and the we are doing it wrong. It’s like they expect HPA to add mem to an overloadeded pod in realtime, and that HPA will kick in the instant one pod exceeds the threshold.

I keep stating that PAGING is used to ensure that the infrustructure can handle a known amount of data, with a given amount of hardware (CPU and MEM limits), and that scaling out will not prevent the system from possibly being overloaded.

Yes, I can give the pods huge amounts of MEM and CPU, and many replicas, but that is not efficient.

My position is MANY small pods will give better thoughput.
If the app implemented paging, it would not need a HUGE amount of memory, and its behaviour would be more predictable.

Thanks for reading this far, and any opinion is appreciated.

Also, how do you protect your pods from taking on too many requests? The app itself doesn’t indicate when it is TOO busy. I have implemented readiness and liveness probes based on the number of busy tomcat threads, and it does work well. But, eventually all the pods are taken off line as they are too busy.

If the limit of busy threads is too high the above will still happen, even if if does cut out the pod. Too many large requests are recv’d.

If the limit is too low, the smaller requests will cause too many of the pods to be taken offline. But the pods do handle the larger requests without going OOM, because an individual pod can’t take on too many large requests.