Asking for help? Comment out what you need so we can get more information to help you!
I have the following error which is 1 node(s) had taint {nvidia.com/gpu: }, that the pod didn't tolerate.
on the hook-image-awaiter which I believe the OKE engine creates. I don’t understand how I can create a kubernetes configuration file for that pod if it’s created by the kubernetes engine.
Is there a way to add the nvidia.com/gpu toleration to the pod as a workaround? Alternatively, does anyone have a suggestion as to how to create the kubeconfig file? Thanks.
Cluster information:
Kubernetes version: v1.21.5 (OKE-Created Cluster)
Cloud being used: Oracle Cloud Infrastructure, Kubernetes Cluster created using Oracle Kubernetes Engine
Installation method: Created using Cluster Quick-Create ok OKE
Host OS: n/a, but can update if required
CNI and version: Unsure
CRI and version: Unsure
Kubernetes PodSpec YAML File
apiVersion: v1 # What version of the Kubernetes API to use
kind: Pod # What kind of object you want to create
metadata: # Data that helps uniquely identify the object, including a name, string, UID and optional namespace
name: nvidia-gpu-workload-v2
spec: # What state you desire for the object, differs for every type of Kubernetes object.
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: k8s.gcr/io/cuda-vector-add:v0.1
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
effect: "NoSchedule"
ERRORS
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-68474555d4-rqss4 0/1 Pending 0 21h
kube-system csi-oci-node-2tzpr 1/1 Running 0 21h
kube-system kube-dns-autoscaler-84cdfb8898-kkwnn 0/1 Pending 0 21h
kube-system kube-flannel-ds-gnnr8 1/1 Running 0 21h
kube-system kube-proxy-b29fr 1/1 Running 0 21h
kube-system nvidia-gpu-device-plugin-j7h9b 1/1 Running 0 21h
kube-system proxymux-client-c7mpp 1/1 Running 0 21h
neuro-k1 hook-image-awaiter-5tq5c 0/1 Pending 0 21h
matt_choy@cloudshell:~ (ap-sydney-1)$ kubectl describe pod hook-image-awaiter-5tq5c -n neuro-k1
Name: hook-image-awaiter-5tq5c
Namespace: neuro-k1
Priority: 0
Node: <none>
Labels: app=jupyterhub
component=image-puller
controller-uid=f20a880b-c16c-479c-ab4f-5985ad3d5ac4
job-name=hook-image-awaiter
release=neuro-jupyterhub-3
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: Job/hook-image-awaiter
Containers:
hook-image-awaiter:
Image: jupyterhub/k8s-image-awaiter:1.2.0
Port: <none>
Host Port: <none>
Command:
/image-awaiter
-ca-path=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
-auth-token-path=/var/run/secrets/kubernetes.io/serviceaccount/token
-api-server-address=https://kubernetes.default.svc:$(KUBERNETES_SERVICE_PORT)
-namespace=neuro-k1
-daemonset=hook-image-puller
-pod-scheduling-wait-duration=10
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m7trb (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-m7trb:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: hub.jupyter.org/dedicated=core:NoSchedule
hub.jupyter.org_dedicated=core:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 21h default-scheduler 0/1 nodes are available: 1 node(s) had taint {nvidia.com/gpu: }, that the pod didn't tolerate.