Live/Read Probe do not follow failure count policy on timeout

JohnPolansky · August 22, 2022, 6:31pm

Cluster information:

Kubernetes version: 1.21.13
Cloud being used: AWS
Installation method:
Host OS: bottlerocket

Hey all,

We have our application configured in Kubernetes with a liveness/readiness probes like this:

    Liveness:   http-get https://:3443/status delay=300s timeout=5s period=30s #success=1 #failure=20
    Readiness:  http-get https://:3443/status delay=60s timeout=5s period=30s #success=1 #failure=20

Notice the failure count is 20, which to my knowledge means the probe should not take any direct actions until it’s failed 20 times which works for normal non-200 responses. However we’ve been experiencing issues… where when the probe calls the ‘/status’ url but it “times out or reset by peer”. This is due to the application being overloaded which yes I know is our issue to resolve. However, what I don’t understand is that when the timeout happens Kubernetes immediately restarts the PODs without waiting for 20 failures.

Liveness probe failed: Get "https://10.2.2.89:3443/status": read tcp 10.2.2.149:36702->10.2.2.89:3443: read: connection reset by peer

Liveness probe failed: Get http://10.0.1.220:3000/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The issue is of course that having the application fail with a single failure is causing us issues since a simple network hiccup or few sec overload of the application can take it down.

Could anyone suggest a solution to this is there another “failure count” for network issues or some other way to get it to allow multiple attempts? We are working to resolve the overload issue in the product, but I’m honestly confused why it isn’t following the failure count?

Thanks!

Topic		Replies	Views
Configuring working liveness and readiness probes for high load pods General Discussions development	2	2043	October 10, 2023
Liveness probe restart delay fixed to 30 seconds? General Discussions	2	1460	April 8, 2021
Apiserver liveness and readiness probes fail randomly with code 500 General Discussions	3	11484	May 23, 2023
Who/where actually work liveness probe in kubernetes? General Discussions	7	10643	February 15, 2019
Is there a way to limit the number of restarts of pod? General Discussions	2	5189	December 30, 2021

Live/Read Probe do not follow failure count policy on timeout

Cluster information:

Related topics