-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290
base: main
Are you sure you want to change the base?
Changes from 3 commits
0486254
5ff36e4
830a0dd
0d20235
6e4314d
b125982
d2a13ba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -212,6 +212,13 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n | |
args..., | ||
) | ||
|
||
// Append CDI device args (used for GPU support) | ||
if len(node.Devices) > 0 { | ||
for _, device := range node.Devices { | ||
args = append(args, "--device", strings.TrimSpace(device)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of having to remember to call |
||
} | ||
} | ||
|
||
// convert mounts and port mappings to container run args | ||
args = append(args, generateMountBindings(node.ExtraMounts...)...) | ||
mappingArgs, err := generatePortMappings(clusterIPFamily, node.ExtraPortMappings...) | ||
|
@@ -302,7 +309,6 @@ type podmanNetworks []struct { | |
func getSubnets(networkName string) ([]string, error) { | ||
cmd := exec.Command("podman", "network", "inspect", networkName) | ||
out, err := exec.Output(cmd) | ||
|
||
if err != nil { | ||
return nil, errors.Wrap(err, "failed to get subnets") | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,6 +114,10 @@ func (n *Node) Validate() error { | |
errs = append(errs, errors.New("image is a required field")) | ||
} | ||
|
||
if err := validateDevices(n.Devices); err != nil { | ||
errs = append(errs, errors.Wrapf(err, "invalid devices")) | ||
} | ||
|
||
// validate extra port forwards | ||
for _, mapping := range n.ExtraPortMappings { | ||
if err := validatePort(mapping.HostPort); err != nil { | ||
|
@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error { | |
return nil | ||
} | ||
|
||
func validateDevices(devices []string) error { | ||
for _, device := range devices { | ||
device := strings.TrimSpace(device) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to use It should also be fine to allow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll give that a try. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update to what Evan suggests to reduce the dependencies pulled in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @BenTheElder What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @klueska I am using parser.ParseQualifiedName(device) instead of parser.ParseQualifiedName(device). The results are the same except you get the additional error message about why your device string is invalid. I think this is preferable. ParseQualifiedName() is called in IsQualifiedName(). This doesn't change the dependencies at all. There are tests on github.com/container-orchestrated-devices/container-device-interface/pkg/parser pulling in the additional packages. Maybe we should put the tests in a different package? @elezar wdyt? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can update the tests in that package to use a different import for comparison or to be in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't personally have a preference - I haven't found any with fewer dependencies. I was thinking use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have created cncf-tags/container-device-interface#149 with an update to the test package. This should prevent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, a few things came up back in the real world and here with #3277 etc., the dependency discussion wound up forked in https://github.com/kubernetes-sigs/kind/pull/3290/files#r1304873501 I'm going to try to get #3335 after which when this PR is rebased and See linked comment for stance on deps. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently for example we have a lot of CRI inspired types with in-tree dependency-free code instead, to avoid creating dep hell for our users. That may not be so reasonable here. |
||
// validate device string is not empty | ||
if len(device) == 0 { | ||
return errors.Errorf("invalid device string: '%v'. Empty Strings not allowed", device) | ||
} | ||
} | ||
return nil | ||
} | ||
|
||
func validatePort(port int32) error { | ||
// NOTE: -1 is a special value for auto-selecting the port in the container | ||
// backend where possible as opposed to in kind itself. | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,7 @@ menu: | |
toc: true | ||
description: |- | ||
This guide covers how to configure KIND cluster creation. | ||
|
||
We know this is currently a bit lacking and will expand it over time - PRs welcome! | ||
--- | ||
## Getting Started | ||
|
@@ -281,10 +281,71 @@ nodes: | |
image: kindest/node:v1.16.4@sha256:b91a2c2317a000f3a783489dfb755064177dbc3a0b2f4147d50f04825d016f55 | ||
{{< /codeFromInline >}} | ||
|
||
[Reference](https://kind.sigs.k8s.io/docs/user/quick-start/#creating-a-cluster) | ||
[Reference](https://kind.sigs.k8s.io/docs/user/quick-start/#creating-a-cluster) | ||
|
||
**Note**: Kubernetes versions are expressed as x.y.z, where x is the major version, y is the minor version, and z is the patch version, following [Semantic Versioning](https://semver.org/) terminology. For more information, see [Kubernetes Release Versioning.](https://github.com/kubernetes/sig-release/blob/master/release-engineering/versioning.md#kubernetes-release-versioning) | ||
|
||
### GPU Support | ||
|
||
There are two ways to map GPUs in to a KinD cluster. The first is using the `devices` API and the second is using the `extraMounts` API. | ||
|
||
#### Using the Devices API | ||
|
||
As a pre-requisite you install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host. | ||
|
||
Using `devices` for GPU support requires Docker v25 or later. See notes on CDI Container Support [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support) | ||
lukeogg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
GPU devices can be mapped to Kind node copntainers with the devices API: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Is it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
All GPUs mapped to a single control-plane: | ||
|
||
{{< codeFromInline lang="yaml" >}} | ||
kind: Cluster | ||
apiVersion: kind.x-k8s.io/v1alpha4 | ||
nodes: | ||
- role: control-plane | ||
devices: | ||
- "nvidia.com/gpu=all" | ||
{{< /codeFromInline >}} | ||
|
||
Specific GPUs mapped to specific worker nodes based on index: | ||
|
||
{{< codeFromInline lang="yaml" >}} | ||
kind: Cluster | ||
apiVersion: kind.x-k8s.io/v1alpha4 | ||
nodes: | ||
- role: control-plane | ||
- role: worker | ||
devices: | ||
- "nvidia.com/gpu=0" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that this will not give you access to just GPU 0, but rather all GPUs on the system. This is because --privileged is passed to the docker call that creates the node, thus giving it access to all devices under /dev (including all GPU devices). Individual GPU injection will only be supported once if/when kind is able to start nodes without --privileged. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, this reduces the value of adding an API for CDI devices in my opinion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, this is the same for all other mechanisms too. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't supposed we can just trigger driver library injection directly in a simple way and reduce the API surface to "yes I'd like the GPU drivers"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I spent today putting this together: It should be a good starting point for using GPUs with kind independent of this PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've also pinged the docker maintainers that helped us push CDI through to see what their thoughts are here: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was pointed to this on that thread: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If moby/moby#47663 becomes available we'll have to version detect to use it to avoid breaking all of the users on older docker installs, and I'll be pretty wary of the branching codepath behaving differently vs --privileged. We have only once required a newer docker version (when we started to require cgroupns=private be available) which caused a lot of consternation even though that change was aimed at mitigating the runc misc issues that caused kind to very flaky/broken ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would've been nice when kind was being built, but now we have to consider that people undoubtedly have test environments that depend on being in a --privileged node :/ |
||
- role: worker | ||
devices: | ||
- "nvidia.com/gpu=1" | ||
{{< /codeFromInline >}} | ||
|
||
#### Using the Extra Mounts API | ||
|
||
GPUs can also be mapped using the `extraMounts` API. This method passes a list of GPUs to inject as volume mounts rather than the environment variable `NVIDIA_VISIBLE_DEVICES`. | ||
|
||
Steps to enable this: | ||
|
||
1. Add nvidia as your default runtime in /etc/docker/daemon.json | ||
lukeogg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Restart docker (as necessary) | ||
1. Set `accept-nvidia-visible-devices-as-volume-mounts = true` in `/etc/nvidia-container-runtime/config.toml` | ||
1. Add the `extraMounts` to any kind nodes you want to have access to all GPUs in the system: | ||
|
||
{{< codeFromInline lang="yaml" >}} | ||
kind: Cluster | ||
apiVersion: kind.x-k8s.io/v1alpha4 | ||
nodes: | ||
- role: control-plane | ||
extraMounts: | ||
- hostPath: /dev/null | ||
containerPath: /var/run/nvidia-container-devices/all | ||
{{< /codeFromInline >}} | ||
|
||
Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know whether this is strictly true. Using cc @klueska There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned in #3257 (comment), we can't achieve any finer granularity than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doing some testing today and this does allow access to all GPUs. I'll update the documentation. It is fairly convenient with Kind, however, as you don't need to inject drivers or install them on the node images. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This limitation of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Same here. |
||
|
||
### Extra Mounts | ||
|
||
Extra mounts can be used to pass through storage on the host to a kind node | ||
|
@@ -300,10 +361,10 @@ For more information see the [Docker file sharing guide.](https://docs.docker.co | |
|
||
### Extra Port Mappings | ||
|
||
Extra port mappings can be used to port forward to the kind nodes. This is a | ||
cross-platform option to get traffic into your kind cluster. | ||
Extra port mappings can be used to port forward to the kind nodes. This is a | ||
cross-platform option to get traffic into your kind cluster. | ||
|
||
If you are running Docker without the Docker Desktop Application on Linux, you can simply send traffic to the node IPs from the host without extra port mappings. | ||
If you are running Docker without the Docker Desktop Application on Linux, you can simply send traffic to the node IPs from the host without extra port mappings. | ||
With the installation of the Docker Desktop Application, whether it is on macOs, Windows or Linux, you'll want to use these. | ||
|
||
You may also want to see the [Ingress Guide]. | ||
|
@@ -401,11 +462,11 @@ nodes: | |
|
||
### Kubeadm Config Patches | ||
|
||
KIND uses [`kubeadm`](/docs/design/principles/#leverage-existing-tooling) | ||
KIND uses [`kubeadm`](/docs/design/principles/#leverage-existing-tooling) | ||
to configure cluster nodes. | ||
|
||
Formally KIND runs `kubeadm init` on the first control-plane node, we can customize the flags by using the kubeadm | ||
[InitConfiguration](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file) | ||
[InitConfiguration](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file) | ||
([spec](https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta3#InitConfiguration)) | ||
|
||
{{< codeFromInline lang="yaml" >}} | ||
|
@@ -436,9 +497,9 @@ nodes: | |
enable-admission-plugins: NodeRestriction,MutatingAdmissionWebhook,ValidatingAdmissionWebhook | ||
{{< /codeFromInline >}} | ||
|
||
On every additional node configured in the KIND cluster, | ||
On every additional node configured in the KIND cluster, | ||
worker or control-plane (in HA mode), | ||
KIND runs `kubeadm join` which can be configured using the | ||
KIND runs `kubeadm join` which can be configured using the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docs consistently use "KIND" or "kind" but not "KinD". We're not using the acronym anymore as it may be podman or in the future something else. Generally just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should put a note in https://kind.sigs.k8s.io/docs/contributing/development/#documentation |
||
[JoinConfiguration](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-join/#config-file) | ||
([spec](https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta3#JoinConfiguration)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use
docker info
we can also extract information as to whetherexperimental
is enabled.Does changing this function to
assertCDISupported()
or something similar make sense here? We could then check the combination of server version and experimental to see whether it is supported.