How to do Load Balance with GRPC connection when HPA autoscaling is enabled?


Im not sure whether this is the right place to ask this. I’m new to kubernetes, i need some guidance in achieving below requirement.

In my azure cluster, im having 2 GRPC services running, 1st will be exposed to outside world using LB. When i call first one, it will then send multiple requests to 2nd GRPC service. Here i have enabled HPA with 50% CPU utilization for scaling. The process running in 2nd GRPC server is long running and memory consuming one.

After lots of request, 2nd pod is autoscaled like i expect, but all my requests are sending to initial pod only and my load is not balanced., i tried sending the request with some delay(2s) still same behavior.

When i researched about that, i found that GRPC will keep the connection alive so the requests are sending to same pod. How to overcome this?

Also, i have used linkerd still same issue. Also , is it possible to scale pod based on the requests it receives? i need to do the scaling for each request, so only one request runs at a time in a pod?


You probably want to control the keep alive values.