TCP SYN_SENT conntrack flow later becomes ESTABLISHED after backend Pod termination / IP reuse — expected kube-proxy behavior or gap?

Cluster information:

Kubernetes version: 1.33
Cloud being used: AWS
Installation method: EKS
Host OS: Bottlerocket
CNI and version: VPC CNI, v1.21.1-eksbuild.5

Hi all,

I’d like to sanity-check a networking behavior we reproduced in Kubernetes and understand whether this is expected Linux conntrack behavior that must be mitigated only with graceful termination, or whether it points to a kube-proxy / Service-routing gap worth deeper discussion.

Environment

  • Kubernetes on Amazon EKS

  • kube-proxy mode: iptables

  • AWS VPC CNI

  • Bottlerocket nodes

  • Service access pattern: pod-to-pod via svc.cluster.local

  • Clients use HTTP connection pooling / keep-alive

  • Relevant node sysctls observed on affected nodes:

    • net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120

    • net.netfilter.nf_conntrack_tcp_timeout_established = 86400

    • net.netfilter.nf_conntrack_tcp_timeout_close_wait = 3600

    • net.ipv4.tcp_syn_retries = 2

    • net.ipv4.tcp_tw_reuse = 1

Problem statement

We reproduced a case where traffic intended for Service A later established a valid TCP connection to a different Pod that reused the previously selected backend IP after termination/rollout timing. Once that valid connection existed, the client’s pooled HTTP transport kept reusing it, which produced repeated wrong-backend HTTP 404s.

The user-visible symptom was not a TCP error. It was sustained application-level 404s from the wrong workload after a rollout/delete event.

What we know from Kubernetes docs

My understanding from the docs is:

  • kube-proxy programs node forwarding from Service + EndpointSlice state. New traffic should follow current kube-proxy state. Existing flows may continue while terminating endpoints drain.

  • terminating endpoints remain published, but are marked ready=false for normal load balancing, and may be marked terminating=true.

  • endpoint propagation and kube-proxy convergence are asynchronous relative to Pod termination.

That all makes sense and matches what we see.

The specific question

The part I want to validate is this:

Can the following sequence happen as expected behavior?

  1. Client sends a TCP SYN to a ClusterIP/Service during a small stale-routing window after backend Pod termination was initiated (in case of force kill, omm kill, termination without prestop hook.

  2. kube-proxy/iptables still DNATs that first packet to backend Pod IP X.

  3. Linux conntrack creates a tracked TCP flow in SYN_SENT [UNREPLIED].

  4. That flow remains alive for some time.

  5. The original backend Pod is gone and SYN_SENT is in stuck state waiting till 120 second counter.

  6. Later, another Pod comes up and owns/reuses IP X VPC CNI allows it after 30 seconds by default.

  7. A later SYN retransmit (or otherwise the same ongoing tracked flow) is answered, and the same conntrack/NAT mapping becomes ESTABLISHED.

  8. From there, the client’s HTTP pool keeps reusing that valid TCP connection to the wrong backend.

My main question is:

If the same tracked SYN_SENT flow later becomes reachable, is it expected that Linux/conntrack continues using the original NAT/backend mapping instead of forcing a fresh kube-proxy/backend decision?

Reproduction evidence

We captured conntrack entries on the client-side node during reproduction.

Observed class 1: flows that stayed SYN_SENT [UNREPLIED] only

We saw multiple entries like:

  • SYN_SENT [UNREPLIED]

  • client pod IP / ephemeral source port

  • destination = Service VIP

  • reply tuple/backend = a specific backend Pod IP

  • timeout counting down normally

Those entries never became established and simply expired.

Observed class 2: one flow that was first SYN_SENT [UNREPLIED], later ESTABLISHED

For one specific flow, we observed:

  • same client IP

  • same client source port

  • same Service VIP

  • same backend IP in the reply tuple

It was first seen as SYN_SENT [UNREPLIED], and later it showed as:

  • ESTABLISHED

  • [ASSURED]

This is the important point for us: the same flow/backend mapping appears to have survived the gap and later become a valid TCP connection.

Time gap for that flow was roughly ~70 seconds between first observed SYN_SENT [UNREPLIED] and later ESTABLISHED.

Client behavior

The client uses pooled HTTP transports and keep-alive. Once the wrong backend started returning valid HTTP responses, the client transport kept reusing that healthy TCP connection, which made the bad application behavior persist well beyond the original routing window.

Why this matters

The key practical question for us is whether this is simply:

  • expected Linux conntrack/NAT behavior for the same in-flight flow,

  • and Kubernetes expects users to mitigate it only with:

    • preStop

    • sufficient terminationGracePeriodSeconds

    • graceful drain / connection close

Or whether kube-proxy / Service handling is expected to do something more aggressive here.

What we are currently doing as mitigation

Our current mitigation direction is:

  • fail readiness immediately on shutdown

  • add preStop

  • increase terminationGracePeriodSeconds

  • keep terminating Pods alive long enough for EndpointSlice + kube-proxy convergence

  • gracefully close/drain existing connections

This seems like the correct practical mitigation, but I want to understand whether the reproduced behavior itself is considered expected or whether upstream sees this as a gap.

Specific questions for SIG Network / kube-proxy experts

  1. Is a SYN_SENT [UNREPLIED] -> ESTABLISHED transition on the same Service/client/backend tuple after backend Pod termination / IP reuse considered expected Linux conntrack behavior in iptables-mode kube-proxy?

  2. For a still-tracked TCP flow in SYN_SENT, should later SYN retransmits continue to use the original conntrack/NAT backend mapping rather than a fresh kube-proxy lookup?

  3. Is there any expected kube-proxy behavior that should invalidate or flush such stale TCP conntrack/NAT mappings on endpoint removal?

  4. Is this fundamentally “works as designed at Linux/netfilter level, mitigate with graceful termination,” or is this something upstream would consider a kube-proxy correctness gap?

  5. Would nftables mode behave differently here in a meaningful way, or is this still fundamentally a conntrack/lifecycle problem?