Hello All,
I am hosting some REST APIs in AKS and in my use case I expect burst of load like a REST endpoint getting thousands of hits (5K to 10K) in a very short span of time. I am using iptable based approach to route traffic to PODs from ClusterIP service. But in iptable based approach PODs are chosen at random unlike the user space based approach where PODs are chosen in round robin manner.
My PODs are 1Gi RAM and 1 Core large.
What I have been noticing is that not all PODs are getting even load which is fine because of random choice of PODs by ClusterIP but what is worst is that some PODs are hitting 100% resource utilization (CPU and/or RAM) and then later crashing.
I thought of using readinessProbe but my concern is that will it be an efficient solution considering my scenario is a load burst scenario. For example let’s say readinessProbe tells EndPoint controller that a particular POD is reaching resource limits, will EndPoint controller be able to update iptable fast enough that there are no undesired effects in the cluster. Specifically let’s say if there are 100 entries in the iptable at a node, in how many milliseconds or microseconds EndPoint controller will be able to update the iptable?
Is there any documentation on how much time EndPoint controller takes to update a fairly big iptable?
My cluster does not have a huge number of PODs, few large enough PODs on 4 to 5 large VMs.
I was also thinking to reduce the scan time for metric server which scans PODs every 15 seconds by default. Is there any documentation or guidance on customizing the scan interval for metrics explorer.
Thanks,
Himanshu.
Cluster information:
Kubernetes version: 1.20.9
Cloud being used: (put bare-metal if not on a public cloud) Azure
Installation method: Azure Kubernetes Cluster using Azure CLI
Host OS: Ubuntu 18.04
CNI and version: Azure CNI v1.4.14
CRI and version: containerd v1.4.9+azure
You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.