Skip to content

Releases: NVIDIA/k8s-device-plugin

v0.17.0

31 Oct 15:36
d475b2c
Compare
Choose a tag to compare

What's Changed

v0.17.0

  • Promote v0.17.0-rc.1 to GA

v0.17.0-rc.1

  • Add CAP_SYS_ADMIN if volume-mounts list strategy is included
  • Remove unneeded DEVICE_PLUGIN_MODE envvar
  • Fix applying SELinux label for MPS
  • Use a base image that aligns with the ubi-minimal base image
  • Switch to a ubi9-based base image
  • Remove namespace field from cluster-scoped resources
  • Generate labels for IMEX cligue and domain
  • Add optional injection of the default IMEX channel
  • Allow kubelet-socket to be specified as command line argument

v0.17.0-rc.1

31 Oct 15:40
a2c760c
Compare
Choose a tag to compare
v0.17.0-rc.1 Pre-release
Pre-release

What's Changed

  • Add CAP_SYS_ADMIN if volume-mounts list strategy is included
  • Remove unneeded DEVICE_PLUGIN_MODE envvar
  • Fix applying SELinux label for MPS
  • Use a base image that aligns with the ubi-minimal base image
  • Switch to a ubi9-based base image
  • Remove namespace field from cluster-scoped resources
  • Generate labels for IMEX cligue and domain
  • Add optional injection of the default IMEX channel
  • Allow kubelet-socket to be specified as command line argument

v0.16.2

08 Aug 11:02
42a0fa9
Compare
Choose a tag to compare

What's Changed

  • Fix applying SELinux label for MPS
  • Remove unneeded DEVICE_PLUGIN_MODE envvar
  • Add CAP_SYS_ADMIN if volume-mounts list strategy is included (fixes #856)

Full Changelog: v0.16.1...v0.16.2

v0.16.1

26 Jul 18:37
cb6e45e
Compare
Choose a tag to compare

Changelog

What's Changed

  • Bump nvidia-container-toolkit to v1.16.1 to fix a bug with CDI spec generation for MIG devices

Full Changelog: v0.16.0...v0.16.1

v0.16.0

16 Jul 13:29
d2eea55
Compare
Choose a tag to compare

Changelog

v0.16.0

  • Fixed logic of atomic writing of the feature file
  • Replaced WithDialer with WithContextDialer
  • Fixed SELinux context of MPS pipe directory.
  • Changed behavior for empty MIG devices to issue a warning instead of an error when the mixed strategy is selected
  • Added a a GFD node label for the GPU mode.
  • Update CUDA base image version to 12.5.1

v0.16.0-rc.1

  • Skip container updates if only CDI is selected
  • Allow cdi hook path to be set
  • Add nvidiaDevRoot config option
  • Detect devRoot for driver installation
  • Changed the automatically created MPS /dev/shm to half of the total memory as obtained from /proc/meminfo
  • Remove redundant version log
  • Remove provenance information from image manifests
  • add ngc image signing job for auto signing
  • fix: target should be binaries
  • Allow device discovery strategy to be specified
  • Refactor cdi handler construction
  • Add addMigMonitorDevices field to nvidia-device-plugin.options helper
  • Fix allPossibleMigStrategiesAreNone helm chart helper
  • use the helm quote function to wrap boolean values in quotes
  • Fix usage of hasConfigMap
  • Make info, nvml, and device lib construction explicit
  • Clean up construction of WSL devices
  • Remove unused function
  • Don't require node-name to be set if not needed
  • Make vgpu failures non-fatal
  • Use HasTegraFiles over IsTegraSystem
  • Raise error for MPS when using MIG
  • Align container driver root envvars
  • Update github.com/NVIDIA/go-nvml to v0.12.0-6
  • Add unit tests cases for sanitise func
  • Improving logic to sanitize GFD generated node labels
  • Add newline to pod logs
  • Adding vfio manager
  • Add prepare-release.sh script
  • Don't require node-name to be set if not needed
  • Remove GitLab pipeline .gitlab.yml
  • E2E test: fix object names
  • strip parentheses from the gpu product name
  • E2E test: instanciate a logger for helm outputs
  • E2E test: enhance logging via ginkgo/gomega
  • E2E test: remove e2elogs helper pkg
  • E2E test: Create HelmClient during Framework init
  • E2E test: Add -ginkgo.v flag to increase verbosity
  • E2E test: Create DiagnosticsCollector
  • Update vendoring
  • Replace go-nvlib/pkg/nvml with go-nvml/pkg/nvml
  • Add dependabot updates for release-0.15

Full Changelog: v0.15.0...v0.16.0

v0.15.1

25 Jun 12:07
682d9fa
Compare
Choose a tag to compare

Changelog

  • Fix inconsistent usage of hasConfigMap helm template. This addresses cases where certain resources (roles and service accounts) would be created even if they were not required.
  • Raise an error in GFD when MPS is used with MIG. This ensures that the behavior across GFD and the Device Plugin is consistent.
  • Remove provenance information from published images.
  • Use half of total memory for size of MPS tmpfs by default.

v0.16.0-rc.1

18 Jun 15:02
0403911
Compare
Choose a tag to compare
v0.16.0-rc.1 Pre-release
Pre-release

Changelog

  • Add script to create release
  • Fix handling of device-discovery-strategy for GFD
  • Skip README updates for rc releases
  • Fix generate-changelog.sh script
  • Skip container updates if only CDI is selected
  • Allow cdi hook path to be set
  • Add nvidiaDevRoot config option
  • Detect devRoot for driver installation
  • Set /dev/shm size from /proc/meminfo
  • Remove redundant version log
  • Remove provenance information from image manifests
  • add ngc image signing job for auto signing
  • fix: target should be binaries
  • Allow device discovery strategy to be specified
  • Refactor cdi handler construction
  • Add addMigMonitorDevices field to nvidia-device-plugin.options helper
  • Fix allPossibleMigStrategiesAreNone helm chart helper
  • use the helm quote function to wrap boolean values in quotes
  • Fix usage of hasConfigMap
  • Make info, nvml, and device lib construction explicit
  • Clean up construction of WSL devices
  • Remove unused function
  • Don't require node-name to be set if not needed
  • Make vgpu failures non-fatal
  • Use HasTegraFiles over IsTegraSystem
  • Raise error for MPS when using MIG
  • Align container driver root envvars
  • Update github.com/NVIDIA/go-nvml to v0.12.0-6
  • Add unit tests cases for sanitise func
  • Improving logic to sanitize GFD generated node labels
  • Add newline to pod logs
  • Adding vfio manager
  • Add prepare-release.sh script
  • Don't require node-name to be set if not needed
  • Remove GitLab pipeline .gitlab.yml
  • E2E test: fix object names
  • strip parentheses from the gpu product name
  • E2E test: instanciate a logger for helm outputs
  • E2E test: enhance logging via ginkgo/gomega
  • E2E test: remove e2elogs helper pkg
  • E2E test: Create HelmClient during Framework init
  • E2E test: Add -ginkgo.v flag to increase verbosity
  • E2E test: Create DiagnosticsCollector
  • Update vendoring
  • Replace go-nvlib/pkg/nvml with go-nvml/pkg/nvml
  • Add dependabot updates for release-0.15

v0.15.0

17 Apr 12:22
435bfb7
Compare
Choose a tag to compare

The NVIDIA GPU Device Plugin v0.15.0 release includes the following major changes:

Consolidated the NVIDIA GPU Device Plugin and NVIDIA GPU Feature Discovery repositories

Since the NVIDIA GPU Device Plugin and GPU Feature Discovery (GFD) components are often used together, we have consolidated the repositories. The primary goal was to streamline the development and release process and functionality remains unchanged. The user facing changes are as follows:

  • The two components will use the same version, meaning that the GFD version jumps from v0.8.2 to v0.15.0.
  • The two components use the same container image, meaning that instead of nvcr.io/nvidia/gpu-feature-discovery is to be used nvcr.io/nvidia/k8s-device-plugin. Note that this may mean that the gpu-feature-discovery command needs to be explicitly specified.

In order to facilitate the transition for users that rely on a standalone GFD deployment, this release includes a gpu-feature-discovery helm chart in the device plugin helm repository.

Added experimental support for GPU partitioning using MPS.

This release of the NVIDIA GPU Device Plugin includes experiemental support for GPU sharing using CUDA MPS. Feedback on this feature is appreciated.

This functionality is not production ready and includes a number of known issues including:

  • The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.
  • There is no synchronization between the CUDA MPS control daemon and the GPU Device Plugin under restarts or configuration changes. This means that workloads may crash if they lose access to shared resources controlled by the CUDA MPS control daemon.
  • MPS is only supported for full GPUs.
  • It is not possible to "combine" MPS GPU requests to allow for access to more memory by a single container.

Deprecation Notice

The following table shows a set of new CUDA driver and runtime version labels and their existing equivalents. The existing labels should be considered deprecated and will be removed in a future release.

New Label Deprecated Label
nvidia.com/cuda.driver-version.major nvidia.com/cuda.driver.major
nvidia.com/cuda.driver-version.minor nvidia.com/cuda.driver.minor
nvidia.com/cuda.driver-version.revision nvidia.com/cuda.driver.rev
nvidia.com/cuda.driver-version.full
nvidia.com/cuda.runtime-version.major nvidia.com/cuda.runtime.major
nvidia.com/cuda.runtime-version.minor nvidia.com/cuda.runtime.minor
nvidia.com/cuda.runtime-version.full

Full Changelog: v0.14.0...v0.15.0

Changes since v0.15.0-rc.2

  • Moved nvidia-device-plugin.yml static deployment at the root of the repository to deployments/static/nvidia-device-plugin.yml.
  • Simplify PCI device clases in NFD worker configuration.
  • Update CUDA base image version to 12.4.1.
  • Switch to Ubuntu22.04-based CUDA image for default image.
  • Add new CUDA driver and runtime version labels to align with other NFD version labels.
  • Update NFD dependency to v0.15.3.

v0.15.0-rc.2

  • Bump CUDA base image version to 12.3.2
  • Add cdi-cri device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.
  • Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where
    these limits are not applied for devices if set by UUID.
  • Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.
  • Set mps device memory limit by index.
  • Explicitly set sharing.mps.failRequestsGreaterThanOne = true.
  • Run tail -f for each MPS daemon to output logs.
  • Enforce replica limits for MPS sharing.

v0.15.0-rc.1

  • Import GPU Feature Discovery into the GPU Device Plugin repo. This means that the same version and container image is used for both components.
  • Add tooling to create a kind cluster for local development and testing.
  • Update go-gpuallocator dependency to migrate away from the deprecated gpu-monitoring-tools NVML bindings.
  • Remove legacyDaemonsetAPI config option. This was only required for k8s versions < 1.16.
  • Add support for MPS sharing.
  • Bump CUDA base image version to 12.3.1

v0.15.0-rc.2

18 Mar 11:48
Compare
Choose a tag to compare

What's changed

  • Bump CUDA base image version to 12.3.2
  • Add cdi-cri device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.
  • Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where
    these limits are not applied for devices if set by UUID.
  • Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.
  • Set mps device memory limit by index.
  • Explicitly set sharing.mps.failRequestsGreaterThanOne = true.
  • Run tail -f for each MPS daemon to output logs.
  • Enforce replica limits for MPS sharing.

v0.14.5

29 Feb 10:23
3d549fb
Compare
Choose a tag to compare

What's Changed

  • Update the nvidia-container-toolkit go dependency. This fixes a bug in CDI spec generation on systems were lib -> usr/lib symlinks exist.
  • Update the CUDA base images to 12.3.2

Full Changelog: v0.14.4...v0.14.5