FailedScheduling GPU node as taint is not tolerated

MattChoy · February 15, 2022, 7:21am

Asking for help? Comment out what you need so we can get more information to help you!

I have the following error which is 1 node(s) had taint {nvidia.com/gpu: }, that the pod didn't tolerate. on the hook-image-awaiter which I believe the OKE engine creates. I don’t understand how I can create a kubernetes configuration file for that pod if it’s created by the kubernetes engine.

Is there a way to add the nvidia.com/gpu toleration to the pod as a workaround? Alternatively, does anyone have a suggestion as to how to create the kubeconfig file? Thanks.

Cluster information:

Kubernetes version: v1.21.5 (OKE-Created Cluster)
Cloud being used: Oracle Cloud Infrastructure, Kubernetes Cluster created using Oracle Kubernetes Engine
Installation method: Created using Cluster Quick-Create ok OKE
Host OS: n/a, but can update if required
CNI and version: Unsure
CRI and version: Unsure

Kubernetes PodSpec YAML File

apiVersion: v1 # What version of the Kubernetes API to use
kind: Pod      # What kind of object you want to create
metadata:      # Data that helps uniquely identify the object, including a name, string, UID and optional namespace
  name: nvidia-gpu-workload-v2
spec:          # What state you desire for the object, differs for every type of Kubernetes object.
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: k8s.gcr/io/cuda-vector-add:v0.1
      resources:
        limits:
          nvidia.com/gpu: 1
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    effect: "NoSchedule"

ERRORS

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
kube-system   coredns-68474555d4-rqss4               0/1     Pending   0          21h
kube-system   csi-oci-node-2tzpr                     1/1     Running   0          21h
kube-system   kube-dns-autoscaler-84cdfb8898-kkwnn   0/1     Pending   0          21h
kube-system   kube-flannel-ds-gnnr8                  1/1     Running   0          21h
kube-system   kube-proxy-b29fr                       1/1     Running   0          21h
kube-system   nvidia-gpu-device-plugin-j7h9b         1/1     Running   0          21h
kube-system   proxymux-client-c7mpp                  1/1     Running   0          21h
neuro-k1      hook-image-awaiter-5tq5c               0/1     Pending   0          21h
matt_choy@cloudshell:~ (ap-sydney-1)$ kubectl describe pod hook-image-awaiter-5tq5c -n neuro-k1
Name:           hook-image-awaiter-5tq5c
Namespace:      neuro-k1
Priority:       0
Node:           <none>
Labels:         app=jupyterhub
                component=image-puller
                controller-uid=f20a880b-c16c-479c-ab4f-5985ad3d5ac4
                job-name=hook-image-awaiter
                release=neuro-jupyterhub-3
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Job/hook-image-awaiter
Containers:
  hook-image-awaiter:
    Image:      jupyterhub/k8s-image-awaiter:1.2.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /image-awaiter
      -ca-path=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      -auth-token-path=/var/run/secrets/kubernetes.io/serviceaccount/token
      -api-server-address=https://kubernetes.default.svc:$(KUBERNETES_SERVICE_PORT)
      -namespace=neuro-k1
      -daemonset=hook-image-puller
      -pod-scheduling-wait-duration=10
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m7trb (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-m7trb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 hub.jupyter.org/dedicated=core:NoSchedule
                             hub.jupyter.org_dedicated=core:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  21h   default-scheduler  0/1 nodes are available: 1 node(s) had taint {nvidia.com/gpu: }, that the pod didn't tolerate.

Topic		Replies	Views
How can I use nvidia gpu in kubernetes pod? General Discussions	2	3226	August 19, 2022
0/1 nodes are available: 1 Insufficient nvidia.com/gpu General Discussions development	1	1197	September 2, 2024
GKE with ARM64 (T2A) node got a NoSchedule taint by default General Discussions	0	1294	August 25, 2022
Can I tolerate all taints except one? General Discussions	0	739	February 6, 2023
What is the possible reason that the taint and tolerations not work as I expect in EKS General Discussions	0	485	April 5, 2023

FailedScheduling GPU node as taint is not tolerated

Cluster information:

Kubernetes PodSpec YAML File

ERRORS

Related topics