Failed Job/Pod/Container troubleshooting

Cluster information:

Kubernetes version: 1.11.5
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 18.04.1 LTS
CNI and version: flannel v? might be 0.10.0
CRI and version: docker 18.09.2

Hello All,

I’ve built a bare metal K8S cluster on premise with three controller nodes (also running etcd) and five workers. Bear with me for asking novice questions, I’m relatively new to this.

Side note: I’ve reached out on the K8S Slack channel with this issue and have had no luck. Either I’m not using Slack well, or the application isn’t well suited to “forum” style discussions. I mostly see a lot of other asks for help scrolling by, and it seems like a roll of the dice whether anyone will see my post, and have knowledge of the issue, and take time to reply.

I have devs submitting Jobs to the cluster from an in-house workflow management stack, sometimes in the thousands, with each launching a Pod and a Container and most of them completing normally and getting cleaned up by the stack. But an increasing number of jobs now stick with the pod hung at “CreateContainerError.”

Specifically, the pod is complaining with:

state:
waiting:
message: ‘Error response from daemon: Conflict. The container name “jobname-podstring_namespace_k8sCreatedUID_0” is already in use by container “big_old_docker_UID”.
You have to remove (or rename) that container to be able to reuse that name.’
reason: CreateContainerError

I’ve found lots of references online describing similar situations where docker users needed to delete a or rename a container, or to restart docker or even reboot the node, but don’t know how that would apply in this situation: kubernetes is the one creating these containers so I have no control over the container name, and it doesn’t seem realistic to have to intervene on every worker node.

From kubectl get job -o yaml:

    apiVersion: batch/v1
    kind: Job
    metadata:
      creationTimestamp: 2019-05-01T19:07:55Z
      labels:
        application_instance: delphi-evaluator-dcostanz1
        queue_token: 428d08dc-58f3-40fb-ac23-2562ae8391ce
      name: delphi-evaluator-dcostanz1-job-144438
      namespace: xapps
      resourceVersion: "43194423"
      selfLink: /apis/batch/v1/namespaces/xapps/jobs/delphi-evaluator-dcostanz1-job-144438
      uid: 6cfa2af4-6c44-11e9-b73b-e2d6d3513984
    spec:
      activeDeadlineSeconds: 3600
      backoffLimit: 0
      completions: 1
      parallelism: 1
      selector:
        matchLabels:
          controller-uid: 6cfa2af4-6c44-11e9-b73b-e2d6d3513984
      template:
        metadata:
          creationTimestamp: null
          labels:
            application_instance: delphi-evaluator-dcostanz1
            controller-uid: 6cfa2af4-6c44-11e9-b73b-e2d6d3513984
            job-name: delphi-evaluator-dcostanz1-job-144438
            queue_token: 428d08dc-58f3-40fb-ac23-2562ae8391ce
          name: delphi-evaluator-dcostanz1-pod-144438
        spec:
          containers:
          - args:
            - -c
            - ./run.sh
            command:
            - /bin/sh
            image: dreg.scharp.org/xapps-delphi-transformation-type-configurable-r1
            imagePullPolicy: Always
            name: delphi-evaluator-dcostanz1-container-144438
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dataset-directory
              name: dataset-directory
            workingDir: /dataset-directory/
          dnsPolicy: ClusterFirst
          imagePullSecrets:
          - name: regcred
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext:
            fsGroup: 1000
            runAsUser: 1058
          terminationGracePeriodSeconds: 30
          volumes:
          - name: dataset-directory
            nfs:
              path: /scharp_delphi_evaluator/delphi_evaluator/dcostanz_1/428d08dc-58f3-40fb-ac23-2562ae8391ce
              server: scharpdata3.pc.scharp.org
    status:
      conditions:
      - lastProbeTime: 2019-05-01T19:12:08Z
        lastTransitionTime: 2019-05-01T19:12:08Z
        message: Job has reached the specified backoff limit
        reason: BackoffLimitExceeded
        status: "True"
        type: Failed
      failed: 1
      startTime: 2019-05-01T19:07:55Z
From kubectl get pod -o yaml:
    apiVersion: v1
    kind: Pod
    metadata:
      creationTimestamp: 2019-05-01T19:07:55Z
      generateName: delphi-evaluator-dcostanz1-job-144438-
      labels:
        application_instance: delphi-evaluator-dcostanz1
        controller-uid: 6cfa2af4-6c44-11e9-b73b-e2d6d3513984
        job-name: delphi-evaluator-dcostanz1-job-144438
        queue_token: 428d08dc-58f3-40fb-ac23-2562ae8391ce
      name: delphi-evaluator-dcostanz1-job-144438-tmmrg
      namespace: xapps
      ownerReferences:
      - apiVersion: batch/v1
        blockOwnerDeletion: true
        controller: true
        kind: Job
        name: delphi-evaluator-dcostanz1-job-144438
        uid: 6cfa2af4-6c44-11e9-b73b-e2d6d3513984
      resourceVersion: "43194421"
      selfLink: /api/v1/namespaces/xapps/pods/delphi-evaluator-dcostanz1-job-144438-tmmrg
      uid: 6cfae749-6c44-11e9-ab21-96ca041346e4
    spec:
      containers:
      - args:
        - -c
        - ./run.sh
        command:
        - /bin/sh
        image: dreg.scharp.org/xapps-delphi-transformation-type-configurable-r1
        imagePullPolicy: Always
        name: delphi-evaluator-dcostanz1-container-144438
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dataset-directory
          name: dataset-directory
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-8jnkb
          readOnly: true
        workingDir: /dataset-directory/
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: regcred
      nodeName: kw-prod-e03
      priority: 0
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsUser: 1058
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300
      volumes:
      - name: dataset-directory
        nfs:
          path: /scharp_delphi_evaluator/delphi_evaluator/dcostanz_1/428d08dc-58f3-40fb-ac23-2562ae8391ce
          server: scharpdata3.pc.scharp.org
      - name: default-token-8jnkb
        secret:
          defaultMode: 420
          secretName: default-token-8jnkb
    status:
      conditions:
      - lastProbeTime: null
        lastTransitionTime: 2019-05-01T19:07:59Z
        status: "True"
        type: Initialized
      - lastProbeTime: null
        lastTransitionTime: 2019-05-01T19:07:59Z
        message: 'containers with unready status: [delphi-evaluator-dcostanz1-container-144438]'
        reason: ContainersNotReady
        status: "False"
        type: Ready
      - lastProbeTime: null
        lastTransitionTime: null
        message: 'containers with unready status: [delphi-evaluator-dcostanz1-container-144438]'
        reason: ContainersNotReady
        status: "False"
        type: ContainersReady
      - lastProbeTime: null
        lastTransitionTime: 2019-05-01T19:07:56Z
        status: "True"
        type: PodScheduled
      containerStatuses:
      - containerID: docker://371f7df847f3d2cd727399011a4cd9f90a474301c099979d63527dcb7d9eb52e
        image: dreg.scharp.org/xapps-delphi-transformation-type-configurable-r1:latest
        imageID: docker-pullable://dreg.scharp.org/xapps-delphi-transformation-type-configurable-r1@sha256:f521e98c550560f496e0cc21f5f7af3f5a62fc61dc9e29b33d750a29664abc24
        lastState:
          terminated:
            containerID: docker://371f7df847f3d2cd727399011a4cd9f90a474301c099979d63527dcb7d9eb52e
            exitCode: 0
            finishedAt: null
            startedAt: null
        name: delphi-evaluator-dcostanz1-container-144438
        ready: false
        restartCount: 0
        state:
          waiting:
            message: 'Error response from daemon: Conflict. The container name "/k8s_delphi-evaluator-dcostanz1-container-144438_delphi-evaluator-dcostanz1-job-144438-tmmrg_xapps_6cfae749-6c44-11e9-ab21-96ca041346e4_0"
              is already in use by container "371f7df847f3d2cd727399011a4cd9f90a474301c099979d63527dcb7d9eb52e".
              You have to remove (or rename) that container to be able to reuse that name.'
            reason: CreateContainerError
      hostIP: 140.107.117.52
      phase: Failed
      podIP: 10.244.5.228
      qosClass: BestEffort
      startTime: 2019-05-01T19:07:59Z

Thank you in advance for any help!

(note that I did use the ctrl-shift-C tip in the instructions to try to format my yaml but it doesn’t seem to have had any effect)

This doesn’t answer your question, but when the ctrl+shift+C thing doesn’t work you can do it manually by adding backticks for code blocks:

```
apiVersion: batch/v1
kind: Job
metadata: 
  creationTimestamp: 2019-05-01T19:07:55Z
  labels: 
```

and it’ll preserve your formatting.

Thanks! It didn’t make any difference initially but did after I removed the blocks that were hiding my cluster info. Interesting mash-up of options, html, markdown and I don’t know what else. It’s like trying to read someone else’s perl code.

Unfortunately I don’t have too much for you :frowning: The only thing I can find is that the error is bubbled back up from the docker daemon itself. It’s essentially being passed through. Might be resource starvation at the time and bubbling up as an error? If you put resource requests and limits on the pods the scheduler should be able to place them better.

Hi,
It may be too late here, but based on the error you are receiving, it seems like the Pods spun up by the Job are still lingering around. You may have to clean them up after their run. More details here on having it done automatically - Jobs | Kubernetes

I hope this helps someone. Cheers!