GPU jobs are as crucial as CPU jobs in the AI era. The scheduling of pods is restricted by the physical position of the GPU. Hence, a pod with GPU pooling may be one of the best solutions.
Are there any plans to add GPU resource pooling, such as VMware vSphere Bitfusion or H3C CAS CVM?
The GPU Pooling could:
- GPU physical card slicing, which realizes 1% arithmetic granularity and 1MB video memory granularity by two dimensions of arithmetic and video memory, in order to provide arithmetic power that matches the demand of less than one physical GPU card.
- Remote invocation, i.e., deploying an AI task on a single CPU server, which can be accelerated by remotely calling GPU resources over the network without the need for a GPU card locally.
- Resource aggregation, which aggregates multiple GPU cards in the resource pool to a single computing task, so that a single task can use more GPU card resources without having to pay attention to the number of GPUs on a single machine.
- On-demand, dynamic expansion of GPU resources according to the demand for computing power, without restarting the virtual machine or container.
With GPU pooling the k8s could further be featured with:
- support for over-subscription, or over-selling of GPU resources, which is a capability that cloud vendors would like to have;
- GPU video memory expansion, which can be used to expand the CPU’s memory to expand the GPU’s physical video memory, for example, it can support the allocation and use of more than 32G video memory by CUDA programs on a GPU that only has 32G video memory;
- in the event of a shortage of GPU resources, it can be possible to perform queuing and queuing priority settings for GPU;
- in the case of resource contention between multiple GPU tasks, the high-priority tasks can be guaranteed resources;
- support for the unified management of physical GPUs and virtual GPUs, switching between each other on demand, and so on.
Addition context:
Related GPU virtualization topic: rCUDA
Bitfusion with k8s
There are some enterprise solutions such as OrionX to integrate this feature into RKE.