Node "soft-locks" when requesting multiple GPUs for pod

Hi, I’ve encountered a weird issue where I can create a container with 1 or all available GPUs of a node assigned to it, but will “soft-lock” the entire node when requesting any other amount of GPUs. Specifically the Pod will get scheduled on a node with sufficient GPUs and then get stuck in the pending state forever without starting container. While existing pods continue running, it is no longer possible to create new pods or terminate any existing pod on the node regardless of it having a GPU assigned or not. The only way to fix this is to stop and start microk8s on that node.

Has anyone ever had a similiar problem on a microk8s cluster? I’ve also tried to increase logging verbosity for kubelite and containerd but could not identify any obvious errors, so I’m not even sure whether this is a bug or if I’m just getting something wrong.

I’ve encountered this problem on the 1.24/stable channel with the latest 1.11.0 release of the gpu-operator that I installed through the add-on.

Thanks!