Handling Jobs With Long Running InitContainers

Hey folks!

Just recently ran into an issue that someone might have some insight into. I currently have a K8s job that has a long running init container (waits for a service that can often take a while to come up) and after a recent update to GKE 1.26, I’ve noticed that after roughly 5 minutes of trying, a new instance of the K8s job will be spun up. Having two of these running in parallel isn’t great (both make API calls that could conflict, etc.):

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job-{{ now | date "20060102150405" }}
  labels:
    app: my-job
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: my-job
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
    spec:
      restartPolicy: Never
      ...
      initContainers:
      - name: wait-service
        ...
        command: ['bash', '-c', 'while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' http://someService/api/v1/status)" != "200" ]]; do echo waiting for service; sleep 2s; done']
      containers:
        - name: run-job
          ...
      volumes:
          ...
      tolerations: 
          ...

I’m trying to determine what the best mechanism to use to tolerate a long running initContainer such that this single job keeps trying until ultimately failing. I initially attempted to set the spec.activeDeadlineSeconds property to some larger value like 20 minutes, however I noticed that after 5 minutes of waiting, the secondary job ultimately spun up.

I feel like there should be a simple fix here, but just not quite sure what specifically to set to give time for my initContainer to finish before attempting the job again:

│ wait-service waiting for service
│ failed container "run-job" in pod "my-job-20230721165715-rh6s2" is waiting to start: PodInitializing for .../my-job-20230721165715-rh6s2 (run-job)                                                                                 
| wait-service waiting for service   

The above “failed container” is when I see the separate job begin to be spun up and end up in an overall bad situation. Just trying to figure out the best way to handle this as the job itself isn’t going to function well in a scenario where multiple instances of it are executing at once.

Thanks!

After researching this a bit more, I’m wondering if the restartPolicy: Never is actually causing this problem as the underlying container (run-job) fails after 5 minutes while the initContainer is still waiting. Since the policy is set to never restart it considers it a failure instead of continuing with the original job.

Perhaps I need to adjust the restartPolicy and add a activeDeadlineSeconds to truly consider the job as failed?

Does that make sense? Or maybe there’s a better way of handling this?

I attempted to apply the changes mentioned above, namely:

restartPolicy: OnFailure
...
activeDeadlineSeconds: 1600

However, it still appeared that another job was created after the failed container message in the original pod logs (after ~5 minutes of trying).