Enabling NVIDIA GPU in microk8s fails for nvidia driver 550

ACodingfreak · December 12, 2024, 2:54am

Hi All,

I have 2 node microk8s cluster where the GPU server is running 550 version of the nvidia-driver.
When I enable nvidia addon in microk8s it fails as shown in below logs

I did find a similar issue in NVIDIA but not sure how can I fix the same for microk8s

github.com/NVIDIA/gpu-operator

failed to create NVIDIA device nodes

opened 12:24PM - 27 Sep 24 UTC

closed 10:39PM - 08 Oct 24 UTC

dstrbad

Hi team, I can't get gpu operator working on my environment. I'm getting follow…ing error `microk8s kubectl logs nvidia-container-toolkit-daemonset-55p2p -n gpu-operator-resources driver-validation` output: ``` time="2024-09-27T12:16:45Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices" time="2024-09-27T12:16:45Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" ``` for `ls /dev/char` on host I get: ``` ~ # ls /dev/char/ 10:1 10:200 10:234 1:11 116:6 13:65 13:72 1:8 189:3 234:2 4:0 4:15 4:21 4:28 4:34 4:40 4:47 4:53 4:6 4:66 4:72 4:79 4:85 4:91 508:1 7:128 7:2 7:66 89:2 10:124 10:223 10:235 116:10 116:7 13:66 13:73 180:0 1:9 240:0 4:1 4:16 4:22 4:29 4:35 4:41 4:48 4:54 4:60 4:67 4:73 4:8 4:86 4:92 509:0 7:129 7:3 7:67 89:3 10:125 10:227 10:236 116:2 116:8 13:67 13:74 189:0 195:0 240:1 4:10 4:17 4:23 4:3 4:36 4:42 4:49 4:55 4:61 4:68 4:74 4:80 4:87 4:93 5:1 7:130 7:4 7:68 89:4 10:126 10:228 10:237 116:3 116:9 13:68 13:75 189:1 195:255 241:0 4:11 4:18 4:24 4:30 4:37 4:43 4:5 4:56 4:62 4:69 4:75 4:81 4:88 4:94 5:2 7:131 7:5 7:69 89:5 10:127 10:229 10:242 116:33 1:3 13:69 1:4 189:128 226:0 241:1 4:12 4:19 4:25 4:31 4:38 4:44 4:50 4:57 4:63 4:7 4:76 4:82 4:89 4:95 5:3 7:132 7:6 7:70 10:183 10:231 108:0 116:4 13:63 13:70 1:5 189:129 226:128 248:0 4:13 4:2 4:26 4:32 4:39 4:45 4:51 4:58 4:64 4:70 4:77 4:83 4:9 5:0 7:0 7:133 7:64 89:0 10:196 10:232 1:1 116:5 13:64 13:71 1:7 189:2 234:1 249:0 4:14 4:20 4:27 4:33 4:4 4:46 4:52 4:59 4:65 4:71 4:78 4:84 4:90 508:0 7:1 7:134 7:65 89:1 ``` **environment:** Ubuntu 22.04.4 LTS MicroK8s v1.30.4 revision 7180 **nvidia drivers:** ``` ~ # nvidia-smi Fri Sep 27 12:20:37 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX 4000 SFF Ada ... Off | 00000000:01:00.0 Off | Off | | 30% 36C P8 5W / 70W | 2MiB / 20475MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` I tried setting `ClusterPolicy` as suggested in log but getting ``` message: 'ClusterPolicy is not ready, states not ready: [state-container-toolkit state-operator-validation state-device-plugin state-dcgm-exporter gpu-feature-discovery]' ``` Any suggestions, how to address this issue? Thank you!

$ kubectl get pods -A
NAMESPACE                NAME                                                         READY   STATUS                  RESTARTS      AGE
gpu-operator-resources   gpu-feature-discovery-zk42r                                  0/1     Init:0/1                0             28m
gpu-operator-resources   gpu-operator-999cc8dcc-dsxtq                                 1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-gc-7cc7ccfff8-t25hj      1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-master-d8597d549-pqv4g   1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-6rp8g             1/1     Running                 0             29m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-zhswg             1/1     Running                 0             29m
gpu-operator-resources   nvidia-container-toolkit-daemonset-vpbf4                     0/1     Init:CrashLoopBackOff   7 (49s ago)   11m
gpu-operator-resources   nvidia-dcgm-exporter-29l5w                                   0/1     Init:0/1                0             28m
gpu-operator-resources   nvidia-device-plugin-daemonset-zgqzn                         0/1     Init:0/1                0             28m
gpu-operator-resources   nvidia-operator-validator-rzm56                              0/1     Init:0/4                0             28m
ingress                  nginx-ingress-microk8s-controller-69wld                      1/1     Running                 0             33m
ingress                  nginx-ingress-microk8s-controller-nm4kf                      1/1     Running                 1 (31m ago)   35m
kube-system              calico-kube-controllers-77bd7c5b-wwkrd                       1/1     Running                 1 (31m ago)   36m
kube-system              calico-node-l8xvg                                            1/1     Running                 1 (31m ago)   33m
kube-system              calico-node-s5pzr                                            1/1     Running                 1 (31m ago)   33m
kube-system              coredns-864597b5fd-8p5sc                                     1/1     Running                 1 (31m ago)   36m
kube-system              dashboard-metrics-scraper-5657497c4c-c54dc                   1/1     Running                 1 (31m ago)   34m
kube-system              hostpath-provisioner-756cd956bc-m7g5m                        1/1     Running                 2 (31m ago)   35m
kube-system              kubernetes-dashboard-54b48fbf9-tt8tc                         1/1     Running                 1 (31m ago)   34m
kube-system              metrics-server-848968bdcd-rnkwf                              1/1     Running                 1 (31m ago)   34m
metallb-system           controller-5f7bb57799-xdh6f                                  1/1     Running                 1 (31m ago)   35m
metallb-system           speaker-7rhwt                                                1/1     Running                 0             33m
metallb-system           speaker-v7qlf                                                1/1     Running                 1 (31m ago)   35m


$ microk8s.kubectl describe pod nvidia-container-toolkit-daemonset-vpbf4 -n gpu-operator-resources 
Name:                 nvidia-container-toolkit-daemonset-vpbf4
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 gpu01/10.30.1.116
Start Time:           Wed, 11 Dec 2024 18:40:28 -0800
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=78c9c56f56
                      helm.sh/chart=gpu-operator-v23.9.1
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: db563df4328d7f9eee358647bf88d6b3437bd00a4116d57c6f230755b6986b53
                      cni.projectcalico.org/podIP: 10.1.69.135/32
                      cni.projectcalico.org/podIPs: 10.1.69.135/32
Status:               Pending
IP:                   10.1.69.135
IPs:
  IP:           10.1.69.135
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://b0785b1c365451c16e3c8149221e79ea2d4b6827ebbfc263825d50cc33a5e274
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:549ec806717ecd832a1dd219d3cb671024d005df0cfd54269441d21a0083ee51
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 11 Dec 2024 18:42:10 -0800
      Finished:     Wed, 11 Dec 2024 18:42:10 -0800
    Ready:          False
    Restart Count:  4
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kzrnl (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.14.3-ubuntu20.04
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      ROOT:                                             /usr/local/nvidia
      RUNTIME_ARGS:                                     
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:  management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                           void
      CONTAINERD_CONFIG:                                /runtime/config-dir/containerd-template.toml
      CONTAINERD_SOCKET:                                /runtime/sock-dir/containerd.sock
      CONTAINERD_SET_AS_DEFAULT:                        1
      RUNTIME:                                          containerd
      CONTAINERD_RUNTIME_CLASS:                         nvidia
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /host from host-root (ro)
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kzrnl (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/current/args
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/common/run
    HostPathType:  
  kube-api-access-kzrnl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  2m56s                 default-scheduler  Successfully assigned gpu-operator-resources/nvidia-container-toolkit-daemonset-vpbf4 to gpu01
  Normal   Pulled     75s (x5 over 2m55s)   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    75s (x5 over 2m55s)   kubelet            Created container driver-validation
  Normal   Started    75s (x5 over 2m55s)   kubelet            Started container driver-validation
  Warning  BackOff    44s (x10 over 2m48s)  kubelet            Back-off restarting failed container driver-validation in pod nvidia-container-toolkit-daemonset-vpbf4_gpu-operator-resources(f321456f-85f5-4a51-86da-afedb0318dd4)

kjackal · December 12, 2024, 6:42am

Is there anything interesting in the logs of the crashlooping pods? Can you share them?

jotamg · January 10, 2025, 9:34pm

I believe I faced the same problem today. The issue happens because the validator is trying to create symlinks in /dev/char for NVIDIA devices, Thats fails due to an invalid device node. You can skip it and solve the issue by patching this:

$ microk8s.kubectl patch clusterpolicy cluster-policy
–type=json -p=‘[ { “op”: “add”,“path”: “/spec/validator/driver/env”,
“value”: [ { “name”: “DISABLE_DEV_CHAR_SYMLINK_CREATION”,
“value”: “true” } ] } ]’

After that I uninstall/install again the add-on and this time the operator deployed ok.

Topic		Replies	Views
MicroK8s 1.31 + Ubuntu 24.04.01 + Nvidia 4060Ti 16GB microk8s	3	1304	December 13, 2024
Failed to enable gpu microk8s microk8s	0	72	March 17, 2025
Add-on: gpu microk8s docs	11	12185	March 19, 2022
Microk8s 1.25 nvidia GPU not installing correctly (using host drivers) microk8s microk8s	1	1449	October 11, 2022
MicroK8s on NVIDIA DGX microk8s docs	1	5984	September 29, 2022

Enabling NVIDIA GPU in microk8s fails for nvidia driver 550

Related topics