I am running a batchv1/job with a pod that references an invalid image repository and/or tag. The pod spends a significant amount of time in a PodScheduled=true
and Ready=false
state, constantly trying to fetch the image with a back off algorithm:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned default/da10566d0b04462ebebcbbe6b964274c-85jdg to ory-cloud-tests-integration-control-plane
Normal Pulling 11m (x4 over 12m) kubelet, ory-cloud-tests-integration-control-plane Pulling image "this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing"
Warning Failed 11m (x4 over 12m) kubelet, ory-cloud-tests-integration-control-plane Failed to pull image "this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing": rpc error: code = Unknown desc = failed to resolve image "docker.io/library/this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing": no available registry endpoint: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
Warning Failed 11m (x4 over 12m) kubelet, ory-cloud-tests-integration-control-plane Error: ErrImagePull
Warning Failed 10m (x6 over 12m) kubelet, ory-cloud-tests-integration-control-plane Error: ImagePullBackOff
Normal BackOff 2m31s (x41 over 12m) kubelet, ory-cloud-tests-integration-control-plane Back-off pulling image "this-image-does-not-exist82f21188-71fa-4179-aced-f26716a1a129:testing"
foobars-MacBook-Pro:~ foobar$ kubectl describe pod
I would like to limit the number of retries and/or limit the amount of time the container can remain in this state. I want this, because I want my Job to finish fast if such an error occurs.
Is that possible?
My JobSpec is written for the go-client and looks like this:
if _, err := t.d.KubernetesClient().BatchV1().Jobs(job.Namespace).Create(&batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
Name: job.ID,
Namespace: job.Namespace,
Labels: map[string]string{
kindName: "tenant_runner",
labelTenantID: tenant.ID,
labelTenantSlug: tenant.Slug,
labelTenantJobID: job.ID,
},
},
Spec: batchv1.JobSpec{
// No parallelism!!
Parallelism: pointerx.Int32(1),
// Run the job to completion only once!
Completions: pointerx.Int32(1),
// Retry running the job at most 5 times.
BackoffLimit: pointerx.Int32(t.JobBackoffLimit),
// This cleans up the jobs after a day when finished.
TTLSecondsAfterFinished: pointerx.Int32(int32((time.Hour * 24).Seconds())),
Template: v1.PodTemplateSpec{
Spec: v1.PodSpec{
// This needs to be never because the job controller will restart this pod!
RestartPolicy: v1.RestartPolicyNever,
ActiveDeadlineSeconds: pointerx.Int64(int64((t.podExecutionTimeout).Seconds())),
ServiceAccountName: t.c.TenantExecutorServiceAccount(),
Containers: []v1.Container{
{
// ImagePullPolicy must be PullIfNotPresent or kind will fail integration tests as it will try to fetch the image from remote.
// see: https://kind.sigs.k8s.io/docs/user/quick-start/#loading-an-image-into-your-cluster
//
// It's also very important to have something predictable here (e.g. v1.0)
ImagePullPolicy: v1.PullIfNotPresent,
Name: job.ID,
Image: t.c.TenantExecutorImage(),
Command: []string{"backoffice"},
Args: args,
Env: []v1.EnvVar{{Name: "LOG_LEVEL", Value: "trace"}, {Name: TenantJSONEnv, Value: encodedTenant}},
},
},
},
},
},
})