How to limit amount of time spent on ImagePullBackOff

I am running a batchv1/job with a pod that references an invalid image repository and/or tag. The pod spends a significant amount of time in a PodScheduled=true and Ready=false state, constantly trying to fetch the image with a back off algorithm:

  Type     Reason     Age                   From                                                Message
  ----     ------     ----                  ----                                                -------
  Normal   Scheduled  12m                   default-scheduler                                   Successfully assigned default/da10566d0b04462ebebcbbe6b964274c-85jdg to ory-cloud-tests-integration-control-plane
  Normal   Pulling    11m (x4 over 12m)     kubelet, ory-cloud-tests-integration-control-plane  Pulling image "this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing"
  Warning  Failed     11m (x4 over 12m)     kubelet, ory-cloud-tests-integration-control-plane  Failed to pull image "this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing": rpc error: code = Unknown desc = failed to resolve image "docker.io/library/this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing": no available registry endpoint: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     11m (x4 over 12m)     kubelet, ory-cloud-tests-integration-control-plane  Error: ErrImagePull
  Warning  Failed     10m (x6 over 12m)     kubelet, ory-cloud-tests-integration-control-plane  Error: ImagePullBackOff
  Normal   BackOff    2m31s (x41 over 12m)  kubelet, ory-cloud-tests-integration-control-plane  Back-off pulling image "this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing"
foobars-MacBook-Pro:~ foobar$ kubectl describe pod

I would like to limit the number of retries and/or limit the amount of time the container can remain in this state. I want this, because I want my Job to finish fast if such an error occurs.

Is that possible?

My JobSpec is written for the go-client and looks like this:

if _, err := t.d.KubernetesClient().BatchV1().Jobs(job.Namespace).Create(&batchv1.Job{
	ObjectMeta: metav1.ObjectMeta{
		Name:      job.ID,
		Namespace: job.Namespace,
		Labels: map[string]string{
			kindName:         "tenant_runner",
			labelTenantID:    tenant.ID,
			labelTenantSlug:  tenant.Slug,
			labelTenantJobID: job.ID,
		},
	},
	Spec: batchv1.JobSpec{
		// No parallelism!!
		Parallelism: pointerx.Int32(1),

		// Run the job to completion only once!
		Completions: pointerx.Int32(1),

		// Retry running the job at most 5 times.
		BackoffLimit: pointerx.Int32(t.JobBackoffLimit),

		// This cleans up the jobs after a day when finished.
		TTLSecondsAfterFinished: pointerx.Int32(int32((time.Hour * 24).Seconds())),

		Template: v1.PodTemplateSpec{
			Spec: v1.PodSpec{
				// This needs to be never because the job controller will restart this pod!
				RestartPolicy: v1.RestartPolicyNever,

				ActiveDeadlineSeconds: pointerx.Int64(int64((t.podExecutionTimeout).Seconds())),
				ServiceAccountName: t.c.TenantExecutorServiceAccount(),

				Containers: []v1.Container{
					{
						// ImagePullPolicy must be PullIfNotPresent or kind will fail integration tests as it will try to fetch the image from remote.
						// see: https://kind.sigs.k8s.io/docs/user/quick-start/#loading-an-image-into-your-cluster
						//
						// It's also very important to have something predictable here (e.g. v1.0)
						ImagePullPolicy: v1.PullIfNotPresent,
						Name:            job.ID,
						Image:           t.c.TenantExecutorImage(),
						Command:         []string{"backoffice"},
						Args:            args,
						Env:             []v1.EnvVar{{Name: "LOG_LEVEL", Value: "trace"}, {Name: TenantJSONEnv, Value: encodedTenant}},
					},
				},
			},
		},
	},
})
1 Like

Kubernetes set a timeout limit on image pulls

I don’t think the provided link to SO is a valid answer to this question: The link describes setting a timeout per image pull.
This thread is about giving up the image pull after a few tries (because the specified image simply doesn’t exist).
When working with batch/Job’s (and not at an individual Pod level, for example when building a custom operator), this issue will cause the Job to stall forever.

1 Like