Node "soft-locks" when requesting multiple GPUs for pod

cjlvu · July 27, 2022, 2:11pm

Hi, I’ve encountered a weird issue where I can create a container with 1 or all available GPUs of a node assigned to it, but will “soft-lock” the entire node when requesting any other amount of GPUs. Specifically the Pod will get scheduled on a node with sufficient GPUs and then get stuck in the pending state forever without starting container. While existing pods continue running, it is no longer possible to create new pods or terminate any existing pod on the node regardless of it having a GPU assigned or not. The only way to fix this is to stop and start microk8s on that node.

Has anyone ever had a similiar problem on a microk8s cluster? I’ve also tried to increase logging verbosity for kubelite and containerd but could not identify any obvious errors, so I’m not even sure whether this is a bug or if I’m just getting something wrong.

I’ve encountered this problem on the 1.24/stable channel with the latest 1.11.0 release of the gpu-operator that I installed through the add-on.

Thanks!

Topic		Replies	Views
Limit GPUs on Node available to Pods microk8s	0	734	July 28, 2019
Containers not allocated exclusive GPU when multiple Pods are deployed General Discussions microk8s	0	36	June 18, 2025
GPU resource limit General Discussions	1	926	October 9, 2019
Custom resource limit for GPU Memory microk8s	0	703	November 24, 2023
Join non microk8s to cluster (kubeadm join) possible? enable gpu on jetson not working microk8s	1	1186	June 19, 2020

Node "soft-locks" when requesting multiple GPUs for pod

Related topics