Traffic Still Routed to a Hung Node for Minutes — Is This a Kubernetes Design Limitation?

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.18.4
Cloud being used: Tencent cloud
Installation method: cloud Managed
Host OS: CentOS Linux 7 (Core)
CNI and version: Global Router (unknown version )
CRI and version: IPVS


Traffic Still Routed to a Hung Node for Minutes — Is This a Kubernetes Design Limitation?

Hi Kubernetes community,

I’d like to share a real production incident and start a discussion around node failure semantics in Kubernetes, especially what happens when a node becomes effectively hung (not crashed, not rebooted), but still receives traffic for an extended period of time.

This incident involves cloud Load Balancer + NodePort + kube-proxy (IPVS), and exposed a behavior that is surprising to application owners.


TL;DR

In our incident:

  • A Kubernetes node became severely hung (I/O blocked, kubelet frozen, containers in D state).

  • The cloud provider Load Balancer correctly detected the node as unhealthy and stopped sending traffic directly to that node.

  • However, traffic was still forwarded to the broken node via other healthy nodes’ NodePort services, for ~11–15 minutes.

  • The root cause is that Endpoint updates depend on kubelet, which was already unavailable on the faulty node.

This raises the discussion:

Is it expected that traffic can still reach a hung node even after the LB has excluded it?
Is this a Kubernetes design trade-off or a gap that could be improved?


Architecture Context (Important)

Our setup is very common in managed Kubernetes environments:

Client
  ↓
Cloud Load Balancer (Ingress)
  ↓
NodePort Service (on each node)
  ↓
kube-proxy / IPVS
  ↓
Pods

Key characteristics:

  • The cloud LB forwards traffic to all nodes via NodePort

  • kube-proxy (IPVS mode) forwards traffic from NodePort to Pod endpoints

  • Endpoint membership is driven by kubelet health updates


Incident Summary (Simplified Timeline)

  • Node memory usage reached ~98% (no swap enabled).

  • Kernel started heavy page cache reclaim.

  • ext4 journal I/O became blocked.

  • System-wide impact:

    • Containers entered TASK_UNINTERRUPTIBLE (D) state (alive but non-responsive)

    • kubelet became frozen:

      • PLEG unhealthy for ~8 minutes

      • Pod eviction and Endpoint updates stopped

    • The node eventually became NotReady

  • OOM killer finally terminated processes and the system recovered.

Total impact: ~12–15 minutes
Traffic still hitting the broken node: ~11+ minutes


Key Question: Why Was Traffic Still Sent to the Broken Node?

What the Cloud Load Balancer Did (Correctly)

  • The cloud LB probes node health

  • The faulty node failed health checks

  • :white_check_mark: The LB removed the node from its backend pool

  • :white_check_mark: New traffic was no longer sent directly to that node

So far, everything worked as expected.


The Key Detail: NodePort Makes Every Node a Forwarder

Even after the bad node was removed from the LB backend pool:

  • Other healthy nodes were still receiving traffic from the LB

  • Those healthy nodes’ NodePort services continued forwarding traffic

  • kube-proxy on healthy nodes still believed the broken node had valid endpoints

This is the critical point.


Why Endpoints Were Not Updated

Endpoints depend on kubelet reporting Pod / Node status back to the control plane.

In this incident:

  • The kubelet on the faulty node:

    • Was stuck in D state

    • Could not update Pod readiness

    • Could not report endpoint changes

  • As a result:

    • The control plane did not immediately remove the node’s Pod endpoints

    • kube-proxy on other nodes kept forwarding traffic to those endpoints

So the traffic path became:

Client
  ↓
Cloud LB (healthy node only)
  ↓
Healthy Node A (NodePort)
  ↓
IPVS selects backend Pod
  ↓
❌ Broken Node B (hung, non-responsive)


Why Kube-Proxy (IPVS) Keeps Forwarding Traffic

Two design aspects combined here:

1. Endpoint Membership Depends on kubelet

  • Endpoint updates are not node-independent

  • If kubelet is frozen:

    • Pod readiness does not change

    • Endpoints are not removed quickly


2. IPVS Behavior for Existing Connections

  • New connections:

    • Respect current endpoint list
  • Existing connections:

    • Matched purely via IPVS connection table

    • Do not re-check backend health

    • Continue forwarding until:

      • TCP is closed

      • Process dies

      • Or keepalive triggers (default: ~2 hours)

This is intentional for performance reasons.


Resulting Failure Mode (Summary)

Component View of the node
Cloud LB :cross_mark: Removed
Control Plane :warning: NotReady (late)
kube-proxy (other nodes) :white_check_mark: Endpoint still valid
TCP :white_check_mark: Connection ESTABLISHED
Application :cross_mark: Hung

This is a split-brain-style situation between control plane, dataplane, and cloud LB.


Is This Expected by Design?

From what we understand so far:

  • Kubernetes prioritizes:

    • Performance

    • Stable connections

    • Avoiding aggressive connection drops

  • Node failure handling focuses more on crash and disconnect, not silent freeze

  • NodePort + IPVS is not tightly coupled with Node readiness signals

However, from an operator’s point of view, this behavior is surprising:

“Once a node is removed from LB and marked NotReady, I expect traffic to stop reaching it shortly.”

That expectation does not hold in this scenario.


Discussion Topics for the Community

I’d love feedback from SIG Node / SIG Network / platform operators:

  1. Is this behavior considered fully expected with NodePort + IPVS?

  2. Should Node NotReady have a stronger or faster impact on dataplane forwarding?

  3. Would it make sense to actively terminate IPVS connections when a node transitions to NotReady?

  4. Are there known best practices specifically for avoiding this NodePort forwarding issue?

    • Especially when kubelet itself is unhealthy

Potential Mitigations We Are Considering

  • Avoid NodePort-based LB fan-in; prefer:

    • Pod-level LB

    • Direct-to-pod cloud LB integrations

  • Aggressive TCP keepalive tuning

  • Shorter application-level timeouts

  • Earlier eviction via memory pressure thresholds

  • Enabling swap to prevent catastrophic I/O stalls

  • Additional readiness gates beyond kubelet signals

Still, these feel more like mitigations than solutions.


Closing Thoughts

This incident highlights a subtle but important reality:

A node can be logically removed, yet still physically in the traffic path.

Understanding this gap between control plane, dataplane, and cloud LB is crucial for designing resilient Kubernetes systems.

Is this simply the cost of flexibility and performance?
Or an area where Kubernetes could evolve?

Looking forward to the community’s insights.


An operator who learned that “NotReady” does not mean “No Traffic”