How to reduce pod's crash fault detection time in Kubernetes

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: All
Cloud being used: (on a public cloud)

Hi all,
I am trying to reduce pod crash detection time in Kubernetes/GKE. Generally, when I simulate pod crash, then Kubernetes detects it in approx between 1 to 1.5 seconds.
I want to know

  1. what are the configuration parameters to minimize the pod crash detection time and how much minimum we can configure. What are those configuration parameters ? Is there any links, which I can visit?
  2. If there is no configuration parameters, then what are the modules involved in detecting the pod crash and which modules take the propagation delay adding 1 to 1.5 seconds of pod crash detection time.
    This is urgent for us, someone please respond.
    Any pointer to it will be appreciated a lot.
    Looking forward for your help.
    Thanks

Hi:

There is this PreStop hook Container hooks that you may use to detect a pod failure:

This hook is called immediately before a container is terminated due to an API request or management event such as a liveness/startup probe failure, preemption, resource contention and others. A call to the PreStop hook fails if the container is already in a terminated or completed state and the hook must complete before the TERM signal to stop the container can be sent. The Pod’s termination grace period countdown begins before the PreStop hook is executed, so regardless of the outcome of the handler, the container will eventually terminate within the Pod’s termination grace period. No parameters are passed to the handler.

The previous page points to Termination of Pods that describes with detail the steps since the TERM signal is received by the main process in the container. It also describes the process when the kubectl delete pod <podname> --force --grace-period=0 is used.

As the kubelet is responsible for the pod, you may look into the kubelet’s configuration to check if there is some parameter to speed up pod’s failure detection: kubelet | Kubernetes

As the kubelet is terminating the failed pod, the control plane is working to replace it:

At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent a Service with a configured selector.

Once the failed pod has been removed, the proper controller will see a difference between the desired and the actual state of the, say, number of pods in a Deployment. The controller will request a new pod to be created, and the Scheduler will check on which node it can be scheduled and then, the kubelet in the selected node will start the new pod.

Best regards,

Xavi