Handling Jobs With Long Running InitContainers

Rion_Williams · July 21, 2023, 8:39pm

Hey folks!

Just recently ran into an issue that someone might have some insight into. I currently have a K8s job that has a long running init container (waits for a service that can often take a while to come up) and after a recent update to GKE 1.26, I’ve noticed that after roughly 5 minutes of trying, a new instance of the K8s job will be spun up. Having two of these running in parallel isn’t great (both make API calls that could conflict, etc.):

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job-{{ now | date "20060102150405" }}
  labels:
    app: my-job
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: my-job
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
    spec:
      restartPolicy: Never
      ...
      initContainers:
      - name: wait-service
        ...
        command: ['bash', '-c', 'while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' http://someService/api/v1/status)" != "200" ]]; do echo waiting for service; sleep 2s; done']
      containers:
        - name: run-job
          ...
      volumes:
          ...
      tolerations: 
          ...

I’m trying to determine what the best mechanism to use to tolerate a long running initContainer such that this single job keeps trying until ultimately failing. I initially attempted to set the spec.activeDeadlineSeconds property to some larger value like 20 minutes, however I noticed that after 5 minutes of waiting, the secondary job ultimately spun up.

I feel like there should be a simple fix here, but just not quite sure what specifically to set to give time for my initContainer to finish before attempting the job again:

│ wait-service waiting for service
│ failed container "run-job" in pod "my-job-20230721165715-rh6s2" is waiting to start: PodInitializing for .../my-job-20230721165715-rh6s2 (run-job)                                                                                 
| wait-service waiting for service

The above “failed container” is when I see the separate job begin to be spun up and end up in an overall bad situation. Just trying to figure out the best way to handle this as the job itself isn’t going to function well in a scenario where multiple instances of it are executing at once.

Thanks!

Rion_Williams · July 22, 2023, 2:52pm

After researching this a bit more, I’m wondering if the restartPolicy: Never is actually causing this problem as the underlying container (run-job) fails after 5 minutes while the initContainer is still waiting. Since the policy is set to never restart it considers it a failure instead of continuing with the original job.

Perhaps I need to adjust the restartPolicy and add a activeDeadlineSeconds to truly consider the job as failed?

Does that make sense? Or maybe there’s a better way of handling this?

Rion_Williams · July 22, 2023, 6:02pm

I attempted to apply the changes mentioned above, namely:

restartPolicy: OnFailure
...
activeDeadlineSeconds: 1600

However, it still appeared that another job was created after the failed container message in the original pod logs (after ~5 minutes of trying).

Topic		Replies	Views
Failed Job/Pod/Container troubleshooting General Discussions	5	7782	January 10, 2024
How to use initContainers with Jobs alongside two different strategies General Discussions	0	573	December 8, 2020
Start jobs from API inside cluster General Discussions	1	1584	June 26, 2021
Is there a way to keep Succeeded pod for longer even after the node is recycled General Discussions	0	23	May 22, 2025
InitContainer problem: Not launching the rest General Discussions	1	1172	June 8, 2019

Handling Jobs With Long Running InitContainers

Related topics