When going over the backoff limit my job pod seems to restart and being killed directly

Hi,

I’m getting troubles with a job.

  1. It has a backoffLimit: 1 and restartPolicy: OnFailure (so it should do 2 attempts at maximum)
  2. I trigger the job
  3. The pod fails the first time…
  4. So it restarts and fails once again.

(From there I’m expecting the job to stop immediately)

  1. BUT, it will retry a third time (and will last a few seconds before being interrupted).

That’s this firth step that I don’t like. In case my job does some writes/mutations it can be sensible, I mean it’s possible this “third time” the job would execute normally… until it gets interrupted early in its process.

restartPolicy: OnFailure has always be tricky (knowledge check)

I’m around for some time and from what I remember, when we set restartPolicy: OnFailure it’s just to tell the pod to restart in case of failure… and since a basic pod has no direct supervisor, if it fails all the time it should restart indefinitely.

But with jobs, there is something supervising, and when it sees the backoffLimit being reached, it kills the pod (it has no other choice because the pod cannot stop its own restart logic).

Could you first confirm me I’m right on this?

If that’s the case, I suppose the supervisor has a delay to react to the second failure, so the pod will restart a 3rd time, the container will perform some instructions for a couple of seconds, then it gets the SIGTERM that will end my process.

Can someone know if that’s the case? Or if it’s totally impossible and I probably has another issue?

(just for the test, I set a backoffLimit of 0, and I did see my job running the 1st time completely but failing, being launched a second time and being interrupted after a few seconds)

Thank you,

Cluster information:

Kubernetes version: v1.21

[EDIT] Additional information

I wanted at start to keep the post simple with a “basic example” to get your feedback/thoughts on this… but maybe worth it to mention my pod contains 2 containers:

  • mine (the current job)
  • a Istio sidecar container (that allows me to communicate with other services due to mTLS)

Since a Istio sidecar is a long-running process, it won’t stop at the end of my container result, so I have to explicitly tell the sidecar to stop.*

This is done by doing a trick in the command on my container:

              command: [
                  "/bin/sh",
                  "-c",
                  "/app/main check-balances && curl -X POST http://localhost:15020/quitquitquit",
                ] # "/quitquitquit" is a workaround to force Istio sidecar closing at the end (ref: https://github.com/istio/istio/issues/6324)

But even if my own container ends with failure (the pod should restart), since that’s a && operation the curl command to ask the Istio sidecar to stop should not be executed (it will just when finishing properly).

Maybe because of my container about to restart while being expected to be stopped directly by the backoffLimit, since my Istio sidecar is still live, my whole pod won’t stop directly (meaning my container will have time to restart a third time) before considering the interruption?

Sorry if it’s unclear… there is so many parameters hmmm

Thank you,