MicroK8s 1.31 + Ubuntu 24.04.01 + Nvidia 4060Ti 16GB

Hi,

recently I reinstalled my home lab test Kubernetes server to install everything from scratch, for the purpose of confirming fresh install of OS and MicroK8s with my last installation steps works as expected and just to go through the process once more (main reason was to install newer versions of OS and MicroK8s).
Prior to this my test MicroK8s cluster (on Ubuntu 22.04) with MicroK8s Nvidia add-on enabled was working fine and I didn’t have any issues, I had few deployments that used Nvidia GPU in my server like Ollama/OpenWeb-UI and DCGM.

Now I tried the same steps but currently cannot get the MicroK8s Nvidia add-on to work.
Currently I am using Ubuntu 24.04.01 and MicroK8s 1.31 and have installed all the necessary Nvidia drivers and toolkits (official/proprietary Nvidia Driver 565 is installed along with Nvidia CUDA Toolkit 12.6 and latest Nvidia Container Runtime). Other than that, I have the latest Docker Engine (for Ubuntu) installed and with Docker the Nvidia runtime works normally. I have deployed Ollama/OpenWeb-UI for some LLM testing on Docker and Nvidia GPU (drivers/runtime) is working fine.

The only issue I have is with enabling MicroK8s Nvidia add-on.
Here is the output of command that enables Nvidia add-on on MicroK8s cluster:

microk8s enable nvidia

Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
GPU 0: NVIDIA GeForce RTX 4060 Ti (UUID: GPU-61c2501c-3060-023a-46fe-41cf4c24886f)
GPU 1: Quadro P600 (UUID: GPU-7adad9f5-dc00-f630-8e4c-05bcf17ac96c)
“nvidia” already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories…
…Successfully got an update from the “nvidia” chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using host GPU driver
W1104 10:35:13.338745 1040135 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.
kubernetes.io/master is use “node-role.kubernetes.io/control-plane” instead
W1104 10:35:13.341208 1040135 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.
kubernetes.io/master is use “node-role.kubernetes.io/control-plane” instead
NAME: gpu-operator
LAST DEPLOYED: Mon Nov 4 10:35:12 2024
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator

The two Warnings in the output I didn’t have previously, this is new.

I m,onitored deployment on Kubernetes Dashboard and via kubectl and found some problems that I didn’t have before.
For nvidia-operator-validator, gpu-feature-discovery, nvidia-dcgm-exporter and nvidia-device-plugin-daemonset I get the following error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for “nvidia” is configured

And for nvidia-container-toolkit-daemonset I get the following error:

Back-off restarting failed container driver-validation in pod nvidia-container-toolkit-daemonset-85tj6_gpu-operator-resources(c7becfab-a7a7-4021-ae40-a3a9d52e4506)

After some looking around the documentation for both Nvidia GPU Operator and MicroK8s it seems that Ubuntu 24.04 (and 24.04.01) is still not supported.

Can someone please point me in the right direction if they solved an issue like this.
Also, any advice going forward would be greatly appreciated.
Currently I am still searching for a fix to this issue and would like to get to bottom of this.
My end goal is to go through a clean OS install and MicroK8s configuration along with any necessary steps so that I gain experience in deploying MicroK8s and working with Kubernetes in general.
Thanks!

It seems I found the solution, Nvidia GPU Operator and all related containers/daemonsets are now working normally.
I deployed Nvidia GPU Operator manually, with Nvidia instructions, not the microk8s enable nvidia command for enabling add-on.

The one thing I did differently is point the Nvidia GPU Operator options for MicroK8s to ContainerD configuration file that is updated by Nvidia Container Runtime scripts.

So instead of using the proposed Nvidia GPU Operator installation for MicroK8s with following options:

microk8s helm install gpu-operator -n gpu-operator --create-namespace
nvidia/gpu-operator $HELM_OPTIONS
–set toolkit.env[0].name=CONTAINERD_CONFIG
–set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml
–set toolkit.env[1].name=CONTAINERD_SOCKET
–set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock
–set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS
–set toolkit.env[2].value=nvidia
–set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT
–set-string toolkit.env[3].value=true

I used the following:

microk8s helm install gpu-operator -n gpu-operator --create-namespace
nvidia/gpu-operator $HELM_OPTIONS
–set toolkit.env[0].name=CONTAINERD_CONFIG
–set toolkit.env[0].value=/etc/containerd/config.toml
–set toolkit.env[1].name=CONTAINERD_SOCKET
–set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock
–set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS
–set toolkit.env[2].value=nvidia
–set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT
–set-string toolkit.env[3].value=true

So far things are working normally and GPUs are available and recognized.