Kubernetes version: v1.35.0
Host OS: Red Hat Enterprise Linux 9.6
CNI and version: Calico
CRI and version: containerd://2.2.1
==========
I have a Kubernetes cluster stretched across two data centers.
- In DC1 there are 4 worker nodes and 2 master nodes.
- In DC2 there are 4 worker nodes and 1 master node.
We performed an HA test. During a simulated failure of DC1, the external load balancer correctly redirected all traffic to DC2.
However, we observed that about 50% of the traffic processed through ClusterIP services on the worker nodes in DC2 was still being forwarded to nodes/pods in DC1, which were already unavailable.
As a result, the application became unstable because roughly half of the backend traffic was impacted.
This seems to happen because the Kubernetes control plane loses quorum (only one master remains in DC2), which prevents updates to the cluster state, and therefore kube-proxy continues to route traffic to endpoints that belong to DC1.
How can this behavior be eliminated? Specifically, how can we ensure that traffic is not routed to unavailable nodes/pods when the kube-apiserver is unreachable due to loss of quorum (only one master remaining in DC2)?