(CronJob) PODs are deleted immediately when a job fails


#1

I have a question regarding the CronJob resource. In the spec I have set both successfulJobsHistoryLimit and failedJobsHistoryLimit to 3 and backoffLimit=0 to basically always keep history of minimum last 3 jobs and not retry a job if it fails for any reason (e.g. OOM). This works, but what I am seeing is that the PODs are terminated (deleted) immediately as they fail even though the jobs are kept according to the history limits.

According to the Job docs:
When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status.

This seems to contradict the behavior I am seeing (pods are deleted). Is this by any chance different for CronJobs or is there a bug?

I am running GKE 1.11.7-gke.4

I want to be able to find out why a previous job has failed but this is not possible if the pod is deleted immediately.


#2

So, you see this behavior just with failed cronjobs and not with successful?

Have you tried if it happens with plain jobs (instead of cronjobs) too?

Also, what do you mean with “even though the jobs are kept according to the history limits”? Can you be more specific? What is preserved exactly? I get pods aren’t, but not sure I follow what it is preserved :slight_smile:

Thanks!


#3

@rata

I only tested this with failed jobs as I ran into the issue while debugging some issue with my cronjob. I have not tried with plain job.

What I ment was that the actual Job objects were not deleted and were kept in accordance with the number set for successfulJobsHistoryLimit and failedJobsHistoryLimit, respectively. The pods were deleted as soon as it failed, though.

After posting I deleted and re-created the CronJob object (instead of just editing it) and now it seems to work fine with the following settings:

jobTemplate.spec.backoffLimit = 0
successfulJobsHistoryLimit = 1
failedJobsHistoryLimit = 1
podSpec.restartPolicy = Never

#4

Pods are not deleted now? (Only the last one)

And if you go back to 3 it happens the same again?

It would be nice to try with successful pods, too, to better understand the problem, IMHO


#5

Yes, I will do more testing with both failed and successful pods but so far it seems to be working.


#6

Glad!

Please report back. Even if it just happens to work with both settings to 3 now :slight_smile: