Kubernetes version: kubelet=1.23.6-00
Cloud being used: (put bare-metal if not on a public cloud)
Host OS: ubuntu 18.04
I am currently endeavoring to execute multiple jobs through a shell script in a unified manner. To provide some context, I am utilizing a multi-GPU machine. However, I have encountered an issue wherein, during a single run, if I attempt to launch 10 jobs, they consume approximately 90% of the available memory on GPU index 0, resulting in successful execution. Conversely, when I attempt to run 20 jobs simultaneously, instead of efficiently distributing the workload across multiple GPUs, all the jobs are being allocated to a single GPU. Consequently, I am confronted with an “out of memory” error message.