Traffic Still Routed to a Hung Node for Minutes — Is This a Kubernetes Design Limitation?

solosky · March 30, 2026, 9:01am

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.18.4
Cloud being used: Tencent cloud
Installation method: cloud Managed
Host OS: CentOS Linux 7 (Core)
CNI and version: Global Router (unknown version )
CRI and version: IPVS

Traffic Still Routed to a Hung Node for Minutes — Is This a Kubernetes Design Limitation?

Hi Kubernetes community,

I’d like to share a real production incident and start a discussion around node failure semantics in Kubernetes, especially what happens when a node becomes effectively hung (not crashed, not rebooted), but still receives traffic for an extended period of time.

This incident involves cloud Load Balancer + NodePort + kube-proxy (IPVS), and exposed a behavior that is surprising to application owners.

TL;DR

In our incident:

A Kubernetes node became severely hung (I/O blocked, kubelet frozen, containers in D state).
The cloud provider Load Balancer correctly detected the node as unhealthy and stopped sending traffic directly to that node.
However, traffic was still forwarded to the broken node via other healthy nodes’ NodePort services, for ~11–15 minutes.
The root cause is that Endpoint updates depend on kubelet, which was already unavailable on the faulty node.

This raises the discussion:

Is it expected that traffic can still reach a hung node even after the LB has excluded it?
Is this a Kubernetes design trade-off or a gap that could be improved?

Architecture Context (Important)

Our setup is very common in managed Kubernetes environments:

Client
  ↓
Cloud Load Balancer (Ingress)
  ↓
NodePort Service (on each node)
  ↓
kube-proxy / IPVS
  ↓
Pods

Key characteristics:

The cloud LB forwards traffic to all nodes via NodePort
kube-proxy (IPVS mode) forwards traffic from NodePort to Pod endpoints
Endpoint membership is driven by kubelet health updates

Incident Summary (Simplified Timeline)

Node memory usage reached ~98% (no swap enabled).
Kernel started heavy page cache reclaim.
ext4 journal I/O became blocked.
System-wide impact:
- Containers entered TASK_UNINTERRUPTIBLE (D) state (alive but non-responsive)
- kubelet became frozen:
  - PLEG unhealthy for ~8 minutes
  - Pod eviction and Endpoint updates stopped
- The node eventually became NotReady
OOM killer finally terminated processes and the system recovered.

Total impact: ~12–15 minutes
Traffic still hitting the broken node: ~11+ minutes

Key Question: Why Was Traffic Still Sent to the Broken Node?

What the Cloud Load Balancer Did (Correctly)

The cloud LB probes node health
The faulty node failed health checks
The LB removed the node from its backend pool
New traffic was no longer sent directly to that node

So far, everything worked as expected.

The Key Detail: NodePort Makes Every Node a Forwarder

Even after the bad node was removed from the LB backend pool:

Other healthy nodes were still receiving traffic from the LB
Those healthy nodes’ NodePort services continued forwarding traffic
kube-proxy on healthy nodes still believed the broken node had valid endpoints

This is the critical point.

Why Endpoints Were Not Updated

Endpoints depend on kubelet reporting Pod / Node status back to the control plane.

In this incident:

The kubelet on the faulty node:
- Was stuck in D state
- Could not update Pod readiness
- Could not report endpoint changes
As a result:
- The control plane did not immediately remove the node’s Pod endpoints
- kube-proxy on other nodes kept forwarding traffic to those endpoints

So the traffic path became:

Client
  ↓
Cloud LB (healthy node only)
  ↓
Healthy Node A (NodePort)
  ↓
IPVS selects backend Pod
  ↓
❌ Broken Node B (hung, non-responsive)

Why Kube-Proxy (IPVS) Keeps Forwarding Traffic

Two design aspects combined here:

1. Endpoint Membership Depends on kubelet

Endpoint updates are not node-independent
If kubelet is frozen:
- Pod readiness does not change
- Endpoints are not removed quickly

2. IPVS Behavior for Existing Connections

New connections:
- Respect current endpoint list
Existing connections:
- Matched purely via IPVS connection table
- Do not re-check backend health
- Continue forwarding until:
  - TCP is closed
  - Process dies
  - Or keepalive triggers (default: ~2 hours)

This is intentional for performance reasons.

Resulting Failure Mode (Summary)

Component	View of the node
Cloud LB	Removed
Control Plane	NotReady (late)
kube-proxy (other nodes)	Endpoint still valid
TCP	Connection ESTABLISHED
Application	Hung

This is a split-brain-style situation between control plane, dataplane, and cloud LB.

Is This Expected by Design?

From what we understand so far:

Kubernetes prioritizes:
- Performance
- Stable connections
- Avoiding aggressive connection drops
Node failure handling focuses more on crash and disconnect, not silent freeze
NodePort + IPVS is not tightly coupled with Node readiness signals

However, from an operator’s point of view, this behavior is surprising:

“Once a node is removed from LB and marked NotReady, I expect traffic to stop reaching it shortly.”

That expectation does not hold in this scenario.

Discussion Topics for the Community

I’d love feedback from SIG Node / SIG Network / platform operators:

Is this behavior considered fully expected with NodePort + IPVS?
Should Node NotReady have a stronger or faster impact on dataplane forwarding?
Would it make sense to actively terminate IPVS connections when a node transitions to NotReady?
Are there known best practices specifically for avoiding this NodePort forwarding issue?
- Especially when kubelet itself is unhealthy

Potential Mitigations We Are Considering

Avoid NodePort-based LB fan-in; prefer:
- Pod-level LB
- Direct-to-pod cloud LB integrations
Aggressive TCP keepalive tuning
Shorter application-level timeouts
Earlier eviction via memory pressure thresholds
Enabling swap to prevent catastrophic I/O stalls
Additional readiness gates beyond kubelet signals

Still, these feel more like mitigations than solutions.

Closing Thoughts

This incident highlights a subtle but important reality:

A node can be logically removed, yet still physically in the traffic path.

Understanding this gap between control plane, dataplane, and cloud LB is crucial for designing resilient Kubernetes systems.

Is this simply the cost of flexibility and performance?
Or an area where Kubernetes could evolve?

Looking forward to the community’s insights.

—
An operator who learned that “NotReady” does not mean “No Traffic”

Topic		Replies	Views
Traffic to a Pod located in a Dead Node General Discussions	2	1883	August 23, 2019
Client early failover in case of the node failure for long lived TCP connections General Discussions network	0	22	June 22, 2026
Node down - pods shown still as Running for hours, others stuck in Terminating General Discussions	5	9434	August 4, 2022
Kubernetes routes traffic to unreachable pods after DC failure when control plane loses quorum General Discussions network	0	22	March 16, 2026
Kubernetes for IoT Gateway General Discussions minikube , loadbalancer , network	0	429	October 25, 2023

Traffic Still Routed to a Hung Node for Minutes — Is This a Kubernetes Design Limitation?

Cluster information:

Traffic Still Routed to a Hung Node for Minutes — Is This a Kubernetes Design Limitation?

TL;DR

Architecture Context (Important)

Incident Summary (Simplified Timeline)

Key Question: Why Was Traffic Still Sent to the Broken Node?

What the Cloud Load Balancer Did (Correctly)

The Key Detail: NodePort Makes Every Node a Forwarder

Why Endpoints Were Not Updated

Why Kube-Proxy (IPVS) Keeps Forwarding Traffic

1. Endpoint Membership Depends on kubelet

2. IPVS Behavior for Existing Connections

Resulting Failure Mode (Summary)

Is This Expected by Design?

Discussion Topics for the Community

Potential Mitigations We Are Considering

Closing Thoughts

Related topics