Kubernetes version: 1.21.13
Cloud being used: AWS
Host OS: bottlerocket
We have our application configured in Kubernetes with a liveness/readiness probes like this:
Liveness: http-get https://:3443/status delay=300s timeout=5s period=30s #success=1 #failure=20 Readiness: http-get https://:3443/status delay=60s timeout=5s period=30s #success=1 #failure=20
Notice the failure count is 20, which to my knowledge means the probe should not take any direct actions until it’s failed 20 times which works for normal non-200 responses. However we’ve been experiencing issues… where when the probe calls the ‘/status’ url but it “times out or reset by peer”. This is due to the application being overloaded which yes I know is our issue to resolve. However, what I don’t understand is that when the timeout happens Kubernetes immediately restarts the PODs without waiting for 20 failures.
Liveness probe failed: Get "https://10.2.2.89:3443/status": read tcp 10.2.2.149:36702->10.2.2.89:3443: read: connection reset by peer Liveness probe failed: Get http://10.0.1.220:3000/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The issue is of course that having the application fail with a single failure is causing us issues since a simple network hiccup or few sec overload of the application can take it down.
Could anyone suggest a solution to this is there another “failure count” for network issues or some other way to get it to allow multiple attempts? We are working to resolve the overload issue in the product, but I’m honestly confused why it isn’t following the failure count?