Pod End of Life: Still serving after sigterm?

Hello,
in Pod Lifecycle | Kubernetes

  1. You use the kubectl tool to manually delete a specific Pod, with the default grace period (30 seconds).
  2. The Pod in the API server is updated with the time beyond which the Pod is considered “dead” along with the grace period […]
    2.2. The kubelet triggers the container runtime to send a TERM signal to process 1 inside each container.
    […]
  3. At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from EndpointSlice (and Endpoints) objects where these represent a Service with a configured selector. ReplicaSets and other workload resources no longer treat the shutting-down Pod as a valid, in-service replica. Pods that shut down slowly cannot continue to serve traffic as load balancers (like the service proxy) remove the Pod from the list of endpoints as soon as the termination grace period begins.

We see traffic hitting our workloads for small period of time after the SIGTERM happens - my question is why is the order of 2 and 3 not reversed? In other words I would like the pod to no longer be serving strictly before termination is started.

Thanks, Alex

Please read through this and see if it answers your question

2 Likes

Thank you @thockin !

I know it's not super satisfying. We want a deterministic answer, but I hope you can now reason through why this hard. You can explore more by emulating the "better" process.

It does “seem” like the k8s controller could dictate the order a bit harder, but I will take your word for it.

Thanks

Thinking on it further (for anyone else looking at this): I believe the endpointslice update has to propagate to all the other k8s nodes before the other nodes will effectively stop sending traffic to the pod.

The problem is there is an arbitrary number of things which can take an arbitrary amount of time to program and that is almost entirely OUTSIDE of the core of Kubernetes.

The deterministic answer is something like:

  • every interested agent (subsystem, controller, etc) registers their interest in individual endpoints
  • stopping a pod first changes something about the endpoint
  • wait for every interested agent to ACK, probably with a timeout
    • note: this means EVERY NODE and LB controller has to ack that its own dataplane was updated
  • once complete, then start pod shutdown

It’s not impossible, it’s just fairly low RoI effort.

1 Like