Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for allocating GPUs in Passthrough-Mode #183

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

varunrsekar
Copy link

This PR introduces a new DeviceClass vfiopci.nvidia.com that will allocate a full GPU in PassThrough-mode (PT) by binding the GPU to vfio-pci driver.

The primary usecase for this new DeviceClass are Kata containers and KubeVirt VMs that require the gpu to be in PT-mode and made available to a pod which then would spin up a guest with the gpu.

Note: Regular pod workloads will not benefit from this DeviceClass and shouldn't try to use this.

As part of this change, I've introduced some (but not all) modifications to the kind cluster config that are needed for this DeviceClass to work. Host-level modifications needed:

# Example on Ubuntu:

# Enable IOMMU on the host kernel
if ! grep -q "GRUB_CMDLINE_LINUX_DEFAULT=.*intel_iommu=on" /etc/default/grub; then
   sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on /g' /etc/default/grub
fi
sudo update-grub

# Disable GDM
sudo systemctl stop gdm && systemctl disable gdm
# Unload nvidia-drm
sudo modprobe -r nvidia-drm
# Reboot the node
sudo reboot

Validated on a kind cluster with a Quattro P2000 GPU:

$ nvidia-smi -L
GPU 0: Quadro P2000 (UUID: GPU-7bea1569-778c-fb4d-7801-df6b6b85ceac)
$ k get resourceclaim -n gpu-test-vfiopci
NAME             STATE                AGE
pod1-gpu-k9w6g   allocated,reserved   21s

$ k get pod -n gpu-test-vfiopci
NAME   READY   STATUS    RESTARTS   AGE
pod1   1/1     Running   0          2m20s

Open items:

  • how to make sysfs on the kind cluster node be read-write mountable?
  • Handling kubelet plugin restarts when the GPU is bound to vfio-pci driver as there is no device discovery possible at that time.

Varun Ramachandra Sekar added 11 commits October 15, 2024 11:05
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
@varunrsekar
Copy link
Author

/cc @klueska

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant