My adventures with MicroK8s to enable GPU and use MIG on a DGX A100

Hi folks. I wanted to share my story about using MicroK8s and MIG (Multi-Instance GPU) on an Nvidia DGX A100 server. This required, among other things, that I ran microk8s enable gpu but even that step wasn’t quite so simple. So here goes.

First of all, it looks like enabling GPU in MicroK8s has been a challenge for a while. There’s an issue on github where users have been posting for the last 2 years the trouble they’ve been running into trying to just do microk8s enable gpu.

The new MicroK8s version 1.21 in the beta snap channel (ie. sudo snap install microk8s --channel=1.21/beta --classic) now installs the Nvidia GPU operator. This K8s operator handles loading the nvidia driver in the kernel. This means you don’t need to have the nvidia drivers installed on the host. As a matter of fact, you have to make sure that there are no nvidia drivers installed on the host. So that’s step 1…

Note: I ran these steps on Ubuntu 20.04 on an Nvidia DGX A100 machine with 8 x A100 GPUs.

Step 1) Make sure to completely remove Nvidia drivers from the host

Find all the nvidia-driver-xyz packages (dpkg -l | grep nvidia-driver) and apt purge them. I simply removed every package with the words nvidia or cuda in them. Even better yet, start with a brand new Ubuntu 20.04 machine.

Step 2) Blocklist the nouveau driver

The second step is to make sure to block the nouveau driver from loading in the kernel, otherwise the nvidia daemonset plugin won’t be able to load the nvidia drivers since nouveau will be holding and locking the devices.

Edit /etc/default/grub and add the following options to GRUB_CMDLINE_LINUX_DEFAULT:

modprobe.blacklist=nouveau nouveau.modeset=0

Run sudo update-grub to update the boot options and then reboot the machine. Once you log back in, you shouldn’t have any nvidia or nouveau drivers loaded in your kernel. The following command should return nothing:

$ lsmod | grep -i -e cuda -e nvidia -e nouveau

Step 3) Install fabric manager

Now because these are A100 GPUs, I need to install the nvidia fabric manager package, otherwise the gpu operator won’t be able to use the A100 GPU. This is not required for non-A100 GPUs (ie. T4).

I followed the instructions from Nvidia’s website to install the cuda repository on my machine. This is what I did:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda-repo-ubuntu2004-11-2-local_11.2.2-460.32.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-2-local_11.2.2-460.32.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-2-local/7fa2af80.pub
sudo apt-get update

You don’t want to run the last command from the website (sudo apt-get -y install cuda) because you don’t want CUDA on the machine. You just want the repo. This set of commands will install a local repo right on your machine that currently matches the driver version that the gpu operator loads (460.32.03 if I remember correctly).

And now install the fabric manager package:

sudo apt install nvidia-fabricmanager-460

This will add a systemd service, but you won’t be able to enable it yet, because you don’t have any nvidia drivers at this point. You can see that with systemctl status nvidia-fabricmanager

Step 4) Install MicroK8s 1.21/beta

As far as I know, version 1.21/beta is the first version who supports the GPU operator and that actually works.

sudo snap install microk8s --channel=1.21/beta --classic

As of this writing, the snap version installed with this command is v1.21.0-beta.1 2021-03-12 (2085)

Of course you might to add yourself to the microk8s group as usual:

sudo usermod -a -G microk8s ubuntu
sudo chown -f -R ubuntu ~/.kube

… and logout and log back in.

Step 5) Enable DNS and make sure it works

If you are in a standard environment that can reach out to the Google DNS servers, you don’t need to do this step (enable gpu will do this for you). But if your environment doesn’t let you do that, you need to manually enable DNS. Here’s how I enable DNS with a manual DNS server:

microk8s enable dns:10.229.32.21

I suggest launching a test pod here and making sure you can reach the Internet. My favorite pod to do this is tutum/dnsutils. So:

kubectl run my-shell --rm -i --tty --image tutum/dnsutils -- bash

And make sure you can resolve an external hostname from inside the pod:

nslookup ubuntu.com

Step 6) Enable GPU and fabric manager

Now comes the fun part. You can enable the GPU addon but you need to enable the fabric manager service once the nvidia drivers are loaded in the kernel by the gpu operator.

I recommend having 2 windows and doing watch "lsmod | grep nvidia" in one window while you run the following command in the other window:

microk8s enable gpu

Once you see the nvidia drivers are loaded in the kernel like this:

nvidia_modeset       1228800  0
nvidia_uvm           1011712  0
nvidia              34037760  101 nvidia_uvm,nvidia_modeset

… you can stop the watch command and run the following commands to enable fabric manager and check its status:

sudo systemctl --now enable nvidia-fabricmanager
systemctl status nvidia-fabricmanager

You should see something like this:

● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-03-29 19:22:11 UTC; 3s ago
    Process: 192212 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
   Main PID: 192218 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 15.3M
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─192218 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Mar 29 19:22:01 blanka systemd[1]: Starting NVIDIA fabric manager service...
Mar 29 19:22:11 blanka nv-fabricmanager[192218]: Successfully configured all the available GPUs and NVSwitches.
Mar 29 19:22:11 blanka systemd[1]: Started NVIDIA fabric manager service.

This will allow the gpu operator pods to run successfully. If you don’t enable the service, the nvidia-device-plugin-daemonset pod will be stuck in CrashLoopBackoff.

If you do a microk8s kubectl get nodes -o yaml | grep gpu, you’ll notice that you have 8 GPUs, which is what we’d expect since we have 8 A100 GPUs in the DGX A100.

Step 7) Enable MIG

By now, you should have MicroK8s running with GPU enabled. However we are not using MIG yet. MIG allows you to take each of the 8 A100 GPUs on the DGX A100 and split them in up to seven slices, for a total of 56 usable GPUs on the DGX A100.

Install the nvidia utilities. This command should install the utils from the local cuda repo that we previously installed:

sudo apt-get install nvidia-utils-460

By default, MIG mode is not enabled on the NVIDIA A100. You can run nvidia-smi to show that MIG mode is disabled:

ubuntu@blanka:~$ nvidia-smi
Mon Mar 29 20:37:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |     
[...]
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   24C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      Off  | 00000000:47:00.0 Off |                    0 |
| N/A   25C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
[...]

MIG mode can be enabled on a per-GPU basis, but for this example I enabled it for all GPUs. If no GPU ID is specified, then MIG mode is applied to all the GPUs on the system. Note that MIG mode (Disabled or Enabled states) is persistent across system reboots.

Also, it’s important to note that there are 2 MIG strategies: single and mixed. The GPU operator uses the single strategy by default. So that’s what we get when we do microk8s enable gpu. I didn’t try the mixed strategy myself.

The single MIG strategy requires that all 8 GPUs on the DGX A100 have MIG enabled and they must also be partitioned exactly the same way.

You run this command to enable MIG on all GPUs:

sudo nvidia-smi -mig 1

Now you’ll notice that this is a delayed enable. You need to reset the GPUs or reboot the server to really enable MIG on the GPUs. I found it easier to just reboot.

However, there’s currently a bug with microk8s where kubelite and containerd keep restarting after a reboot. This is a known issue that should be fixed in the current 1.21/candidate channel, which at the time of this writing corresponds to v1.21.0-rc.0 2021-03-29 (2101). Hopefully this won’t be a problem for you. Otherwise you can do like me and remove microk8s and reinstall it and re-enable the GPU.

Don’t forget to re-enable the fabric manager service once the nvidia drivers are loaded back in the kernel by the gpu operator.

Step 8) Create the GPU slices

Now that MIG is enabled (check with nvidia-smi again), we can proceed to slicing up the 8 GPUs in more usable GPUs for Microk8s.

The NVIDIA driver provides a number of profiles that users can opt-in for when configuring the MIG feature in A100. The profiles are the sizes and capabilities of the GPU instances that can be created by the user. The driver also provides information about the placements, which indicate the type and number of instances that can be created.

$ sudo nvidia-smi mig -lgip
+--------------------------------------------------------------------------+
| GPU instance profiles:                                                   |
| GPU   Name          ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                           Free/Total   GiB              CE    JPEG  OFA  |
|==========================================================================|
|   0  MIG 1g.5gb     19     7/7        4.95       No     14     0     0   |
|                                                          1     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 2g.10gb    14     3/3        9.90       No     28     1     0   |
|                                                          2     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 3g.20gb     9     2/2        19.79      No     42     2     0   |
|                                                          3     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 4g.20gb     5     1/1        19.79      No     56     2     0   |
|                                                          4     0     0   |
+--------------------------------------------------------------------------+
|   0  MIG 7g.40gb     0     1/1        39.59      No     98     5     0   |
|                                                          7     1     1   |
+--------------------------------------------------------------------------+

You can list the possible placements available using the following command. The syntax of the placement is {}: and shows the placement of the instances on the GPU.

$ sudo nvidia-smi mig -lgipp
GPU  0 Profile ID 19 Placements: {0,1,2,3,4,5,6}:1
GPU  0 Profile ID 14 Placements: {0,2,4}:2
GPU  0 Profile ID  9 Placements: {0,4}:4
GPU  0 Profile ID  5 Placement : {0}:4
GPU  0 Profile ID  0 Placement : {0}:8

The command shows that the user can create two instances of type 3g.20gb (profile ID 9) or seven instances of 1g.5gb (profile ID 19).

Before starting to use MIG, the user needs to create GPU instances using the -cgi option. One of three options can be used to specify the instance profiles to be created:

  • Profile ID (e.g. 9, 14, 5)
  • Short name of the profile (e.g. 3g.20gb
  • Full profile name of the instance (e.g. MIG 3g.20gb)

Once the GPU instances are created, one needs to create the corresponding Compute Instances (CI). By using the -C option, nvidia-smi creates these instances.

Note: Without creating GPU instances (and corresponding compute instances), CUDA workloads cannot be run on the GPU. In other words, simply enabling MIG mode on the GPU is not sufficient. Also note that, the created MIG devices are not persistent across system reboots. Thus, the user or system administrator needs to recreate the desired MIG configurations if the GPU or system is reset. For automated tooling support for this purpose, refer to the NVIDIA MIG Partition Editor (or mig-parted) tool.

The following example shows how the user can create GPU instances (and corresponding compute instances). In this example, the user can create two GPU instances (of type 3g.20gb), with each GPU instance having half of the available compute and memory capacity. In this example, we purposefully use profile ID and short profile name to showcase how either option can be used:

$ sudo nvidia-smi mig -cgi 9,3g.20gb -C
Successfully created GPU instance ID  2 on GPU  0 using profile MIG 3g.20gb (ID  9)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 3g.20gb (ID  2)
Successfully created GPU instance ID  1 on GPU  0 using profile MIG 3g.20gb (ID  9)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  1 using profile MIG 3g.20gb (ID  2)

Now list the available GPU instances:

$ sudo nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 3g.20gb       9        1          4:4     |
+----------------------------------------------------+
|   0  MIG 3g.20gb       9        2          0:4     |
+----------------------------------------------------+
[...]

Now verify that the GIs and corresponding CIs are created:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |                      | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    1   0   0  |     11MiB / 20224MiB | 42      0 |  3   0    2    0    0 |
+------------------+----------------------+-----------+-----------------------+
|  0    2   0   1  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
+------------------+----------------------+-----------+-----------------------+
[...]        

Now if you do another microk8s kubectl get nodes -o yaml | grep gpu, you’ll notice that you have a GPU count of 16 instead of 8.

And that’s it. I didn’t push this further for now. I just wanted to see the GPU count go from 8 to something more. Hopefully this helps some people get up and running with MIG and GPU in microk8s!

6 Likes

Thank you @davecore for this article. :+1:

Too cool! cc @kjackal @sabdfl

All I want for Christmas is a D-G-X :slight_smile:

2 Likes

The video of my demo is up on youtube: https://www.youtube.com/watch?v=4ALztZDlkJ0

Awesome contribution!

DGX2 will be better wont be?