Hi All,
I have 2 node microk8s cluster where the GPU server is running 550 version of the nvidia-driver.
When I enable nvidia addon in microk8s it fails as shown in below logs
I did find a similar issue in NVIDIA but not sure how can I fix the same for microk8s
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources gpu-feature-discovery-zk42r 0/1 Init:0/1 0 28m
gpu-operator-resources gpu-operator-999cc8dcc-dsxtq 1/1 Running 0 29m
gpu-operator-resources gpu-operator-node-feature-discovery-gc-7cc7ccfff8-t25hj 1/1 Running 0 29m
gpu-operator-resources gpu-operator-node-feature-discovery-master-d8597d549-pqv4g 1/1 Running 0 29m
gpu-operator-resources gpu-operator-node-feature-discovery-worker-6rp8g 1/1 Running 0 29m
gpu-operator-resources gpu-operator-node-feature-discovery-worker-zhswg 1/1 Running 0 29m
gpu-operator-resources nvidia-container-toolkit-daemonset-vpbf4 0/1 Init:CrashLoopBackOff 7 (49s ago) 11m
gpu-operator-resources nvidia-dcgm-exporter-29l5w 0/1 Init:0/1 0 28m
gpu-operator-resources nvidia-device-plugin-daemonset-zgqzn 0/1 Init:0/1 0 28m
gpu-operator-resources nvidia-operator-validator-rzm56 0/1 Init:0/4 0 28m
ingress nginx-ingress-microk8s-controller-69wld 1/1 Running 0 33m
ingress nginx-ingress-microk8s-controller-nm4kf 1/1 Running 1 (31m ago) 35m
kube-system calico-kube-controllers-77bd7c5b-wwkrd 1/1 Running 1 (31m ago) 36m
kube-system calico-node-l8xvg 1/1 Running 1 (31m ago) 33m
kube-system calico-node-s5pzr 1/1 Running 1 (31m ago) 33m
kube-system coredns-864597b5fd-8p5sc 1/1 Running 1 (31m ago) 36m
kube-system dashboard-metrics-scraper-5657497c4c-c54dc 1/1 Running 1 (31m ago) 34m
kube-system hostpath-provisioner-756cd956bc-m7g5m 1/1 Running 2 (31m ago) 35m
kube-system kubernetes-dashboard-54b48fbf9-tt8tc 1/1 Running 1 (31m ago) 34m
kube-system metrics-server-848968bdcd-rnkwf 1/1 Running 1 (31m ago) 34m
metallb-system controller-5f7bb57799-xdh6f 1/1 Running 1 (31m ago) 35m
metallb-system speaker-7rhwt 1/1 Running 0 33m
metallb-system speaker-v7qlf 1/1 Running 1 (31m ago) 35m
$ microk8s.kubectl describe pod nvidia-container-toolkit-daemonset-vpbf4 -n gpu-operator-resources
Name: nvidia-container-toolkit-daemonset-vpbf4
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: gpu01/10.30.1.116
Start Time: Wed, 11 Dec 2024 18:40:28 -0800
Labels: app=nvidia-container-toolkit-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=78c9c56f56
helm.sh/chart=gpu-operator-v23.9.1
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: db563df4328d7f9eee358647bf88d6b3437bd00a4116d57c6f230755b6986b53
cni.projectcalico.org/podIP: 10.1.69.135/32
cni.projectcalico.org/podIPs: 10.1.69.135/32
Status: Pending
IP: 10.1.69.135
IPs:
IP: 10.1.69.135
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://b0785b1c365451c16e3c8149221e79ea2d4b6827ebbfc263825d50cc33a5e274
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 11 Dec 2024 18:42:10 -0800
Finished: Wed, 11 Dec 2024 18:42:10 -0800
Ready: False
Restart Count: 4
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kzrnl (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID:
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.14.3-ubuntu20.04
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
ROOT: /usr/local/nvidia
RUNTIME_ARGS:
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
CONTAINERD_CONFIG: /runtime/config-dir/containerd-template.toml
CONTAINERD_SOCKET: /runtime/sock-dir/containerd.sock
CONTAINERD_SET_AS_DEFAULT: 1
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/host from host-root (ro)
/run/nvidia from nvidia-run-path (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kzrnl (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
nvidia-run-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
containerd-config:
Type: HostPath (bare host directory volume)
Path: /var/snap/microk8s/current/args
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /var/snap/microk8s/common/run
HostPathType:
kube-api-access-kzrnl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m56s default-scheduler Successfully assigned gpu-operator-resources/nvidia-container-toolkit-daemonset-vpbf4 to gpu01
Normal Pulled 75s (x5 over 2m55s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 75s (x5 over 2m55s) kubelet Created container driver-validation
Normal Started 75s (x5 over 2m55s) kubelet Started container driver-validation
Warning BackOff 44s (x10 over 2m48s) kubelet Back-off restarting failed container driver-validation in pod nvidia-container-toolkit-daemonset-vpbf4_gpu-operator-resources(f321456f-85f5-4a51-86da-afedb0318dd4)