Hey folks!
Just recently ran into an issue that someone might have some insight into. I currently have a K8s job that has a long running init container (waits for a service that can often take a while to come up) and after a recent update to GKE 1.26, I’ve noticed that after roughly 5 minutes of trying, a new instance of the K8s job will be spun up. Having two of these running in parallel isn’t great (both make API calls that could conflict, etc.):
apiVersion: batch/v1
kind: Job
metadata:
name: my-job-{{ now | date "20060102150405" }}
labels:
app: my-job
spec:
backoffLimit: 0
template:
metadata:
labels:
app: my-job
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
spec:
restartPolicy: Never
...
initContainers:
- name: wait-service
...
command: ['bash', '-c', 'while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' http://someService/api/v1/status)" != "200" ]]; do echo waiting for service; sleep 2s; done']
containers:
- name: run-job
...
volumes:
...
tolerations:
...
I’m trying to determine what the best mechanism to use to tolerate a long running initContainer
such that this single job keeps trying until ultimately failing. I initially attempted to set the spec.activeDeadlineSeconds
property to some larger value like 20 minutes, however I noticed that after 5 minutes of waiting, the secondary job ultimately spun up.
I feel like there should be a simple fix here, but just not quite sure what specifically to set to give time for my initContainer to finish before attempting the job again:
│ wait-service waiting for service
│ failed container "run-job" in pod "my-job-20230721165715-rh6s2" is waiting to start: PodInitializing for .../my-job-20230721165715-rh6s2 (run-job)
| wait-service waiting for service
The above “failed container” is when I see the separate job begin to be spun up and end up in an overall bad situation. Just trying to figure out the best way to handle this as the job itself isn’t going to function well in a scenario where multiple instances of it are executing at once.
Thanks!