Enabling NVIDIA GPU in microk8s fails for nvidia driver 550

Hi All,

I have 2 node microk8s cluster where the GPU server is running 550 version of the nvidia-driver.
When I enable nvidia addon in microk8s it fails as shown in below logs

I did find a similar issue in NVIDIA but not sure how can I fix the same for microk8s

$ kubectl get pods -A
NAMESPACE                NAME                                                         READY   STATUS                  RESTARTS      AGE
gpu-operator-resources   gpu-feature-discovery-zk42r                                  0/1     Init:0/1                0             28m
gpu-operator-resources   gpu-operator-999cc8dcc-dsxtq                                 1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-gc-7cc7ccfff8-t25hj      1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-master-d8597d549-pqv4g   1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-6rp8g             1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-zhswg             1/1     Running                 0             29m
gpu-operator-resources   nvidia-container-toolkit-daemonset-vpbf4                     0/1     Init:CrashLoopBackOff   7 (49s ago)   11m
gpu-operator-resources   nvidia-dcgm-exporter-29l5w                                   0/1     Init:0/1                0             28m
gpu-operator-resources   nvidia-device-plugin-daemonset-zgqzn                         0/1     Init:0/1                0             28m
gpu-operator-resources   nvidia-operator-validator-rzm56                              0/1     Init:0/4                0             28m
ingress                  nginx-ingress-microk8s-controller-69wld                      1/1     Running                 0             33m
ingress                  nginx-ingress-microk8s-controller-nm4kf                      1/1     Running                 1 (31m ago)   35m
kube-system              calico-kube-controllers-77bd7c5b-wwkrd                       1/1     Running                 1 (31m ago)   36m
kube-system              calico-node-l8xvg                                            1/1     Running                 1 (31m ago)   33m
kube-system              calico-node-s5pzr                                            1/1     Running                 1 (31m ago)   33m
kube-system              coredns-864597b5fd-8p5sc                                     1/1     Running                 1 (31m ago)   36m
kube-system              dashboard-metrics-scraper-5657497c4c-c54dc                   1/1     Running                 1 (31m ago)   34m
kube-system              hostpath-provisioner-756cd956bc-m7g5m                        1/1     Running                 2 (31m ago)   35m
kube-system              kubernetes-dashboard-54b48fbf9-tt8tc                         1/1     Running                 1 (31m ago)   34m
kube-system              metrics-server-848968bdcd-rnkwf                              1/1     Running                 1 (31m ago)   34m
metallb-system           controller-5f7bb57799-xdh6f                                  1/1     Running                 1 (31m ago)   35m
metallb-system           speaker-7rhwt                                                1/1     Running                 0             33m
metallb-system           speaker-v7qlf                                                1/1     Running                 1 (31m ago)   35m


$ microk8s.kubectl describe pod nvidia-container-toolkit-daemonset-vpbf4 -n gpu-operator-resources 
Name:                 nvidia-container-toolkit-daemonset-vpbf4
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 gpu01/10.30.1.116
Start Time:           Wed, 11 Dec 2024 18:40:28 -0800
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=78c9c56f56
                      helm.sh/chart=gpu-operator-v23.9.1
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: db563df4328d7f9eee358647bf88d6b3437bd00a4116d57c6f230755b6986b53
                      cni.projectcalico.org/podIP: 10.1.69.135/32
                      cni.projectcalico.org/podIPs: 10.1.69.135/32
Status:               Pending
IP:                   10.1.69.135
IPs:
  IP:           10.1.69.135
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://b0785b1c365451c16e3c8149221e79ea2d4b6827ebbfc263825d50cc33a5e274
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 11 Dec 2024 18:42:10 -0800
      Finished:     Wed, 11 Dec 2024 18:42:10 -0800
    Ready:          False
    Restart Count:  4
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kzrnl (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.14.3-ubuntu20.04
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      ROOT:                                             /usr/local/nvidia
      RUNTIME_ARGS:                                     
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:  management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                           void
      CONTAINERD_CONFIG:                                /runtime/config-dir/containerd-template.toml
      CONTAINERD_SOCKET:                                /runtime/sock-dir/containerd.sock
      CONTAINERD_SET_AS_DEFAULT:                        1
      RUNTIME:                                          containerd
      CONTAINERD_RUNTIME_CLASS:                         nvidia
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /host from host-root (ro)
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kzrnl (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/current/args
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/common/run
    HostPathType:  
  kube-api-access-kzrnl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  2m56s                 default-scheduler  Successfully assigned gpu-operator-resources/nvidia-container-toolkit-daemonset-vpbf4 to gpu01
  Normal   Pulled     75s (x5 over 2m55s)   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    75s (x5 over 2m55s)   kubelet            Created container driver-validation
  Normal   Started    75s (x5 over 2m55s)   kubelet            Started container driver-validation
  Warning  BackOff    44s (x10 over 2m48s)  kubelet            Back-off restarting failed container driver-validation in pod nvidia-container-toolkit-daemonset-vpbf4_gpu-operator-resources(f321456f-85f5-4a51-86da-afedb0318dd4)

Is there anything interesting in the logs of the crashlooping pods? Can you share them?

I believe I faced the same problem today. The issue happens because the validator is trying to create symlinks in /dev/char for NVIDIA devices, Thats fails due to an invalid device node. You can skip it and solve the issue by patching this:

$ microk8s.kubectl patch clusterpolicy cluster-policy
–type=json -p=‘[ { “op”: “add”,“path”: “/spec/validator/driver/env”,
“value”: [ { “name”: “DISABLE_DEV_CHAR_SYMLINK_CREATION”,
“value”: “true” } ] } ]’

After that I uninstall/install again the add-on and this time the operator deployed ok.