Time taken by EndPoint controller to update iptables

Hello All,

I am hosting some REST APIs in AKS and in my use case I expect burst of load like a REST endpoint getting thousands of hits (5K to 10K) in a very short span of time. I am using iptable based approach to route traffic to PODs from ClusterIP service. But in iptable based approach PODs are chosen at random unlike the user space based approach where PODs are chosen in round robin manner.

My PODs are 1Gi RAM and 1 Core large.

What I have been noticing is that not all PODs are getting even load which is fine because of random choice of PODs by ClusterIP but what is worst is that some PODs are hitting 100% resource utilization (CPU and/or RAM) and then later crashing.

I thought of using readinessProbe but my concern is that will it be an efficient solution considering my scenario is a load burst scenario. For example let’s say readinessProbe tells EndPoint controller that a particular POD is reaching resource limits, will EndPoint controller be able to update iptable fast enough that there are no undesired effects in the cluster. Specifically let’s say if there are 100 entries in the iptable at a node, in how many milliseconds or microseconds EndPoint controller will be able to update the iptable?

Is there any documentation on how much time EndPoint controller takes to update a fairly big iptable?
My cluster does not have a huge number of PODs, few large enough PODs on 4 to 5 large VMs.

I was also thinking to reduce the scan time for metric server which scans PODs every 15 seconds by default. Is there any documentation or guidance on customizing the scan interval for metrics explorer.


Cluster information:

Kubernetes version: 1.20.9
Cloud being used: (put bare-metal if not on a public cloud) Azure
Installation method: Azure Kubernetes Cluster using Azure CLI
Host OS: Ubuntu 18.04
CNI and version: Azure CNI v1.4.14
CRI and version: containerd v1.4.9+azure

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

Bear in mind that “round robin” means nothing when you have N nodes each making
independent decisions. It devolves into random.

Is this internal traffic (from the cluster) or external (from some LB)? Is it
HTTP or something else? When you layer LBs you also need to be aware of how
those are configured. Some will re-use connections (leading to “hot” nodes),
for example.

The roundtrip will be impacted by things like probe period, Kubelet telling the
apiserver, and ultimately kube-proxy has a “max frequency” control. Worst case
should be O(seconds). The actual iptables write is proportional to how big
your table is (how many total endpoints). For 4-5 nodes this should not be a

This is from external LB and it is HTTP. Thanks for the pointer around hot nodes chosen by LBs.

If the LB chooses a node (or pod) and then keeps the connection alive, it doesn’t matter how good the k8s balancing algorithm is - the upstream is choosing to pound on the same node until it decides to close that connection. This is a common enough problem that it keeps coming up.