Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvproxy: Support GPU capability segmentation #10856

Open
EtiennePerot opened this issue Sep 4, 2024 · 0 comments
Open

nvproxy: Support GPU capability segmentation #10856

EtiennePerot opened this issue Sep 4, 2024 · 0 comments
Labels
type: enhancement New feature or request

Comments

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Sep 4, 2024

Description

Currently, gVisor's NVIDIA GPU support feature (nvproxy) only supports CUDA-related commands (ioctls, allocation classes, etc.). There have been multiple requests to expand this set to support non-CUDA GPU workloads, such as video transcoding (NVENC, NVDEC) in #9452. Vulkan has also come up.

One aspect of nvproxy's design is that it inherently limits the exposed NVIDIA kernel driver ABI to the set of commands that nvproxy understands. Like all attack-surface-reduction measures, doing so offers some security benefits.

If we continue to add commands to nvproxy under the same big bag of commands it currently knows about, this will weaken this benefit over time. This has been fine so far because the only workloads nvproxy has aimed to support were all of the same type (compute/CUDA-type workloads), and thus can be reasonably expected to require a largely-overlapping set of commands as each other. However, by adding support for e.g. video transcoding workloads, adding them to this existing set would expose video-transcoding ABI commands to CUDA workloads that do not need them. This feature request is about avoiding that.

Is this feature related to a specific bug?

#9452 and other discussions.

Do you have a specific solution in mind?

This feature request is about implementing a capability segmentation scheme to nvproxy commands. This way, all commands that are not required by CUDA workloads would not be exposed unless explicitly requested.

NVIDIA has the concept of "driver capabilities", which map to shared libraries (.so files) that roughly correspond to the set of high-level functions that users of each capability would need. They are:

  • Compute: Hardware-accelerated number-crunching. CUDA and OpenCL applications
  • Graphics: Hardware-accelerated 3D and 2D rendering. OpenGL and Vulkan applications.
  • Video: Hardware-accelerated video encoding and decoding. NVENC and NVDEC respectively.
  • Display: Rendering to physical monitors. Used by X11 and Wayland applications.
  • Utility: GPU hardware info and management. Used by nvidia-smi and NVML.

NVIDIA exposes the choice of these GPU capabilities using the NVIDIA_DRIVER_CAPABILITIES environment variable, similar to the NVIDIA_VISIBLE_DEVICES environment variable.

We can reuse this scheme, as it is already out there and fairly easy to understand (i.e. easy for users to specify) while still providing significant ability to keep large amounts of the kernel driver ABI unexposed.

@EtiennePerot EtiennePerot added the type: enhancement New feature or request label Sep 4, 2024
copybara-service bot pushed a commit that referenced this issue Sep 4, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 4, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 4, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 4, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 4, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 4, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 5, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 6, 2024
runsc attempts to emulate nvidia-container-runtime-hook. But it was always
passing "--compute --utility" as driver capability flags to
`nvidia-container-cli configure` command.

Fix runsc to emulate nvidia-container-runtime-hook correctly by parsing
NVIDIA_DRIVER_CAPABILITIES and converting that comma-separated list to flags.

This is in preparation for adding support for non-compute GPU workloads in
nvproxy :)

Updates #9452
Updates #10856

PiperOrigin-RevId: 671644915
copybara-service bot pushed a commit that referenced this issue Sep 6, 2024
runsc attempts to emulate nvidia-container-runtime-hook. But it was always
passing "--compute --utility" as driver capability flags to
`nvidia-container-cli configure` command.

Fix runsc to emulate nvidia-container-runtime-hook correctly by parsing
NVIDIA_DRIVER_CAPABILITIES and converting that comma-separated list to flags.

This is in preparation for adding support for non-compute GPU workloads in
nvproxy :)

Updates #9452
Updates #10856

PiperOrigin-RevId: 671644915
copybara-service bot pushed a commit that referenced this issue Sep 9, 2024
runsc attempts to emulate nvidia-container-runtime-hook. But it was always
passing "--compute --utility" as driver capability flags to
`nvidia-container-cli configure` command.

Fix runsc to emulate nvidia-container-runtime-hook correctly by parsing
NVIDIA_DRIVER_CAPABILITIES and converting that comma-separated list to flags.

This is in preparation for adding support for non-compute GPU workloads in
nvproxy :)

Updates #9452
Updates #10856

PiperOrigin-RevId: 671644915
copybara-service bot pushed a commit that referenced this issue Sep 9, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 9, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 670751227
copybara-service bot pushed a commit that referenced this issue Sep 9, 2024
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 672714520
copybara-service bot pushed a commit that referenced this issue Sep 24, 2024
runsc attempts to emulate nvidia-container-runtime-hook. But it was always
passing "--compute --utility" as driver capability flags to
`nvidia-container-cli configure` command.

Fix runsc to emulate nvidia-container-runtime-hook correctly by parsing
NVIDIA_DRIVER_CAPABILITIES and converting that comma-separated list to flags.

This is in preparation for adding support for non-compute GPU workloads in
nvproxy :)

Updates #9452
Updates #10856

PiperOrigin-RevId: 671644915
copybara-service bot pushed a commit that referenced this issue Sep 24, 2024
runsc attempts to emulate nvidia-container-runtime-hook. But it was always
passing "--compute --utility" as driver capability flags to
`nvidia-container-cli configure` command.

Fix runsc to emulate nvidia-container-runtime-hook correctly by parsing
NVIDIA_DRIVER_CAPABILITIES and converting that comma-separated list to flags.

This is in preparation for adding support for non-compute GPU workloads in
nvproxy :)

Updates #9452
Updates #10856

PiperOrigin-RevId: 671644915
copybara-service bot pushed a commit that referenced this issue Sep 24, 2024
runsc attempts to emulate nvidia-container-runtime-hook. But it was always
passing "--compute --utility" as driver capability flags to
`nvidia-container-cli configure` command.

Fix runsc to emulate nvidia-container-runtime-hook correctly by parsing
NVIDIA_DRIVER_CAPABILITIES and converting that comma-separated list to flags.

This is in preparation for adding support for non-compute GPU workloads in
nvproxy :)

Updates #9452
Updates #10856

PiperOrigin-RevId: 678064565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant