When going over the backoff limit my job pod seems to restart and being killed directly

sneko · February 10, 2022, 6:10pm

Hi,

I’m getting troubles with a job.

It has a backoffLimit: 1 and restartPolicy: OnFailure (so it should do 2 attempts at maximum)
I trigger the job
The pod fails the first time…
So it restarts and fails once again.

(From there I’m expecting the job to stop immediately)

BUT, it will retry a third time (and will last a few seconds before being interrupted).

That’s this firth step that I don’t like. In case my job does some writes/mutations it can be sensible, I mean it’s possible this “third time” the job would execute normally… until it gets interrupted early in its process.

`restartPolicy: OnFailure` has always be tricky (knowledge check)

I’m around for some time and from what I remember, when we set restartPolicy: OnFailure it’s just to tell the pod to restart in case of failure… and since a basic pod has no direct supervisor, if it fails all the time it should restart indefinitely.

But with jobs, there is something supervising, and when it sees the backoffLimit being reached, it kills the pod (it has no other choice because the pod cannot stop its own restart logic).

Could you first confirm me I’m right on this?

If that’s the case, I suppose the supervisor has a delay to react to the second failure, so the pod will restart a 3rd time, the container will perform some instructions for a couple of seconds, then it gets the SIGTERM that will end my process.

Can someone know if that’s the case? Or if it’s totally impossible and I probably has another issue?

(just for the test, I set a backoffLimit of 0, and I did see my job running the 1st time completely but failing, being launched a second time and being interrupted after a few seconds)

Thank you,

Cluster information:

Kubernetes version: v1.21

[EDIT] Additional information

I wanted at start to keep the post simple with a “basic example” to get your feedback/thoughts on this… but maybe worth it to mention my pod contains 2 containers:

mine (the current job)
a Istio sidecar container (that allows me to communicate with other services due to mTLS)

Since a Istio sidecar is a long-running process, it won’t stop at the end of my container result, so I have to explicitly tell the sidecar to stop.*

This is done by doing a trick in the command on my container:

              command: [
                  "/bin/sh",
                  "-c",
                  "/app/main check-balances && curl -X POST http://localhost:15020/quitquitquit",
                ] # "/quitquitquit" is a workaround to force Istio sidecar closing at the end (ref: https://github.com/istio/istio/issues/6324)

But even if my own container ends with failure (the pod should restart), since that’s a && operation the curl command to ask the Istio sidecar to stop should not be executed (it will just when finishing properly).

Maybe because of my container about to restart while being expected to be stopped directly by the backoffLimit, since my Istio sidecar is still live, my whole pod won’t stop directly (meaning my container will have time to restart a third time) before considering the interruption?

Sorry if it’s unclear… there is so many parameters hmmm

Thank you,

camilioni · March 22, 2024, 7:38pm

Hi @sneko ,

How did you solve this or do you still see this behaviour? I am seeing the exact same behaviour and funny enough, am using the same workaround to quit my istio sidecar!

Regards

sneko · March 24, 2024, 12:13am

@camilioni I don’t remember finding a better solution, sorry. I’m no longer using Kubernetes, to stay in good health

Topic		Replies	Views
Understanding backoffLimit in Kubernetes Job General Discussions	0	14673	February 21, 2019
Job failing before backoffLimit General Discussions	0	2750	December 30, 2020
Restart pods only when Pods are Terminating General Discussions	1	554	August 13, 2021
(CronJob) PODs are deleted immediately when a job fails General Discussions	5	10644	February 20, 2019
Is there a way to limit the number of restarts of pod? General Discussions	2	5243	December 30, 2021

When going over the backoff limit my job pod seems to restart and being killed directly

restartPolicy: OnFailure has always be tricky (knowledge check)

Cluster information:

[EDIT] Additional information

Related topics

`restartPolicy: OnFailure` has always be tricky (knowledge check)