Kubernetes version: 1.18.4
Cloud being used: bare-metal
Host OS: Ubuntu
Dear all,
I have a job launched with the following settings:
completions: 30 parallelism: 30 requests: memory: "15.0Gi" limits: memory: "15.0Gi" restartPolicy: Never backoffLimit: 60
The pods currently fail because not enough memory was allocated (i.e. the requested 15Gi is not enough).
However, I expected the job to keep retrying until the backoff limit was reached.
Instead, in the job description I have:
Pods Statuses: 0 Running / 0 Succeeded / 19 Failed
So the job overall tried only 19 times instead of the 60 that I had supposed it would.
More than 12h have passed since last pod was created and I have even saw in the job events that some pod were deleted (instead of being left OOMKilled), so I guess the job has given up and ended.
However, I do still have 13 pods in OOMKilled status that belong to that job.
Doing
kubectl get jobs jobid5614799 -n lenai -o jsonpath=‘{.status.conditions}’
I get
[map[lastProbeTime:2020-12-29T16:13:42Z lastTransitionTime:2020-12-29T16:13:42Z message:Job has reached the specified backoff limit reason:BackoffLimitExceeded status:True type:Failed]]
So indeed the job has given up.
Notice that I started the job on Tue, 29 Dec 2020 11:58 so it did not have the time in 4 hours to launch and fail 60 pods (each pod exceeds memory and fails after about 3h and at maximum 12 pods can be run concurrently with the current quota in that namespace).
So I think the 19 failed are indeed all the pods that were launched, well below the expected 60 backoffLimit.
I do not have activeDeadline set in the job.
What has happened? Why did it give up after only 19 failures?
Why did it delete some of the failures?