Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pkg/apis/config/v1alpha4/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ type Node struct {
// binded to a host Port
ExtraPortMappings []PortMapping `yaml:"extraPortMappings,omitempty" json:"extraPortMappings,omitempty"`

// Devices allows access to GPUs through CDI using the --devices flag added in Docker v25.
// https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support
Devices []string `yaml:"devices,omitempty" json:"devices,omitempty"`

// KubeadmConfigPatches are applied to the generated kubeadm config as
// merge patches. The `kind` field must match the target object, and
// if `apiVersion` is specified it will only be applied to matching objects.
Expand Down
5 changes: 5 additions & 0 deletions pkg/apis/config/v1alpha4/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions pkg/cluster/internal/providers/docker/provision.go
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,20 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
args = append(args, "-e", "KUBECONFIG=/etc/kubernetes/admin.conf")
}

// Append CDI device args (used for GPU support)
if len(node.Devices) > 0 {
// Check for docker > 25
ver := Version()
if strings.Split(ver, ".")[0] < "25" {
return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver)
}

// Append args for each device
for _, device := range node.Devices {
args = append(args, "--device", strings.TrimSpace(device))
}
}

// finally, specify the image to run
return append(args, node.Image), nil
}
Expand Down
10 changes: 10 additions & 0 deletions pkg/cluster/internal/providers/docker/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,16 @@ func IsAvailable() bool {
return strings.HasPrefix(lines[0], "Docker version")
}

// Version gets the version of docker available on the system
func Version() string {
cmd := exec.Command("docker", "version", "--format", "'{{.Server.Version}}'")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use docker info we can also extract information as to whether experimental is enabled.

Does changing this function to assertCDISupported() or something similar make sense here? We could then check the combination of server version and experimental to see whether it is supported.

lines, err := exec.OutputLines(cmd)
if err != nil || len(lines) != 1 {
return ""
}
return strings.Trim(lines[0], "'")
}

// usernsRemap checks if userns-remap is enabled in dockerd
func usernsRemap() bool {
cmd := exec.Command("docker", "info", "--format", "'{{json .SecurityOptions}}'")
Expand Down
8 changes: 7 additions & 1 deletion pkg/cluster/internal/providers/podman/provision.go
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,13 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
args...,
)

// Append CDI device args (used for GPU support)
if len(node.Devices) > 0 {
for _, device := range node.Devices {
args = append(args, "--device", strings.TrimSpace(device))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to remember to call strings.TrimSpace whereever the elements of node.CDIDevices are used, can we do that somewhere once?

}
}

// convert mounts and port mappings to container run args
args = append(args, generateMountBindings(node.ExtraMounts...)...)
mappingArgs, err := generatePortMappings(clusterIPFamily, node.ExtraPortMappings...)
Expand Down Expand Up @@ -302,7 +309,6 @@ type podmanNetworks []struct {
func getSubnets(networkName string) ([]string, error) {
cmd := exec.Command("podman", "network", "inspect", networkName)
out, err := exec.Output(cmd)

if err != nil {
return nil, errors.Wrap(err, "failed to get subnets")
}
Expand Down
5 changes: 5 additions & 0 deletions pkg/internal/apis/config/convert_v1alpha4.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ func convertv1alpha4Node(in *v1alpha4.Node, out *Node) {
out.ExtraMounts = make([]Mount, len(in.ExtraMounts))
out.ExtraPortMappings = make([]PortMapping, len(in.ExtraPortMappings))
out.KubeadmConfigPatchesJSON6902 = make([]PatchJSON6902, len(in.KubeadmConfigPatchesJSON6902))
out.Devices = make([]string, len(in.Devices))

for i := range in.ExtraMounts {
convertv1alpha4Mount(&in.ExtraMounts[i], &out.ExtraMounts[i])
Expand All @@ -68,6 +69,10 @@ func convertv1alpha4Node(in *v1alpha4.Node, out *Node) {
for i := range in.KubeadmConfigPatchesJSON6902 {
convertv1alpha4PatchJSON6902(&in.KubeadmConfigPatchesJSON6902[i], &out.KubeadmConfigPatchesJSON6902[i])
}

for i := range in.Devices {
out.Devices[i] = in.Devices[i]
}
}

func convertv1alpha4PatchJSON6902(in *v1alpha4.PatchJSON6902, out *PatchJSON6902) {
Expand Down
4 changes: 4 additions & 0 deletions pkg/internal/apis/config/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ type Node struct {
// binded to a host Port
ExtraPortMappings []PortMapping

// Devices allows access to GPUs through CDI using the --devices flag added in Docker v25.
// https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support
Devices []string
lukeogg marked this conversation as resolved.
Show resolved Hide resolved

// KubeadmConfigPatches are applied to the generated kubeadm config as
// strategic merge patches to `kustomize build` internally
// https://github.com/kubernetes/community/blob/a9cf5c8f3380bb52ebe57b1e2dbdec136d8dd484/contributors/devel/sig-api-machinery/strategic-merge-patch.md
Expand Down
15 changes: 15 additions & 0 deletions pkg/internal/apis/config/validate.go
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,10 @@ func (n *Node) Validate() error {
errs = append(errs, errors.New("image is a required field"))
}

if err := validateDevices(n.Devices); err != nil {
errs = append(errs, errors.Wrapf(err, "invalid devices"))
}

// validate extra port forwards
for _, mapping := range n.ExtraPortMappings {
if err := validatePort(mapping.HostPort); err != nil {
Expand Down Expand Up @@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error {
return nil
}

func validateDevices(devices []string) error {
for _, device := range devices {
device := strings.TrimSpace(device)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use "github.com/container-orchestrated-devices/container-device-interface/pkg/parser" and call parser.IsQualifiedName(device) here. This is used in docker/cli#4084 and significantly reduces the dependencies pulled in.

It should also be fine to allow podman and / or docker to perform this validation though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a try.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to what Evan suggests to reduce the dependencies pulled in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska I am using parser.ParseQualifiedName(device) instead of parser.ParseQualifiedName(device). The results are the same except you get the additional error message about why your device string is invalid. I think this is preferable. ParseQualifiedName() is called in IsQualifiedName().

This doesn't change the dependencies at all. There are tests on github.com/container-orchestrated-devices/container-device-interface/pkg/parser pulling in the additional packages. Maybe we should put the tests in a different package? @elezar wdyt?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the tests in that package to use a different import for comparison or to be in the parser_test package if require. As a matter of interest, which package that is equivalent to require would be preferred?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have a preference - I haven't found any with fewer dependencies. I was thinking use the parser_test package to keep things very lightweight.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created cncf-tags/container-device-interface#149 with an update to the test package. This should prevent github.com/stretchr/testify/require from getting pulled in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few things came up back in the real world and here with #3277 etc., the dependency discussion wound up forked in https://github.com/kubernetes-sigs/kind/pull/3290/files#r1304873501

I'm going to try to get #3335 after which when this PR is rebased and go mod tidy it should be clearer what we're actually talking about for the new dependencies.

See linked comment for stance on deps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for example we have a lot of CRI inspired types with in-tree dependency-free code instead, to avoid creating dep hell for our users. That may not be so reasonable here.

// validate device string is not empty
if len(device) == 0 {
return errors.Errorf("invalid device string: '%v'. Empty Strings not allowed", device)
}
}
return nil
}

func validatePort(port int32) error {
// NOTE: -1 is a special value for auto-selecting the port in the container
// backend where possible as opposed to in kind itself.
Expand Down
29 changes: 24 additions & 5 deletions pkg/internal/apis/config/validate_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@ package config

import (
"fmt"
"sigs.k8s.io/kind/pkg/internal/assert"
"testing"

"sigs.k8s.io/kind/pkg/internal/assert"

lukeogg marked this conversation as resolved.
Show resolved Hide resolved
"sigs.k8s.io/kind/pkg/errors"
)

Expand Down Expand Up @@ -251,7 +252,7 @@ func TestClusterValidate(t *testing.T) {
}

for _, tc := range cases {
tc := tc //capture loop variable
tc := tc // capture loop variable
t.Run(tc.Name, func(t *testing.T) {
t.Parallel()
err := tc.Cluster.Validate()
Expand Down Expand Up @@ -343,6 +344,24 @@ func TestNodeValidate(t *testing.T) {
}(),
ExpectErrors: 1,
},
{
TestName: "Empty Devices",
Node: func() Node {
cfg := newDefaultedNode(ControlPlaneRole)
cfg.Devices = []string{" ", ""}
return cfg
}(),
ExpectErrors: 1,
},
{
TestName: "Valid Devices",
Node: func() Node {
cfg := newDefaultedNode(ControlPlaneRole)
cfg.Devices = []string{"vendor1.com/device=test", "nvidia.com/gpu=1", "nvidia.com/gpu=all", "vendor.com/foo=1", "foo.bar.baz/foo-bar123.B_az=all"}
return cfg
}(),
ExpectErrors: 0,
},
{
TestName: "Invalid HostPort",
Node: func() Node {
Expand All @@ -360,7 +379,7 @@ func TestNodeValidate(t *testing.T) {
}

for _, tc := range cases {
tc := tc //capture loop variable
tc := tc // capture loop variable
t.Run(tc.TestName, func(t *testing.T) {
t.Parallel()
err := tc.Node.Validate()
Expand Down Expand Up @@ -414,7 +433,7 @@ func TestPortValidate(t *testing.T) {
}

for _, tc := range cases {
tc := tc //capture loop variable
tc := tc // capture loop variable
t.Run(tc.TestName, func(t *testing.T) {
t.Parallel()
err := validatePort(tc.Port)
Expand Down Expand Up @@ -537,7 +556,7 @@ func TestValidatePortMappings(t *testing.T) {
}

for _, tc := range cases {
tc := tc //capture loop variable
tc := tc // capture loop variable
t.Run(tc.testName, func(t *testing.T) {
t.Parallel()

Expand Down
5 changes: 5 additions & 0 deletions pkg/internal/apis/config/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

79 changes: 70 additions & 9 deletions site/content/docs/user/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ menu:
toc: true
description: |-
This guide covers how to configure KIND cluster creation.

We know this is currently a bit lacking and will expand it over time - PRs welcome!
---
## Getting Started
Expand Down Expand Up @@ -281,10 +281,71 @@ nodes:
image: kindest/node:v1.16.4@sha256:b91a2c2317a000f3a783489dfb755064177dbc3a0b2f4147d50f04825d016f55
{{< /codeFromInline >}}

[Reference](https://kind.sigs.k8s.io/docs/user/quick-start/#creating-a-cluster)
[Reference](https://kind.sigs.k8s.io/docs/user/quick-start/#creating-a-cluster)

**Note**: Kubernetes versions are expressed as x.y.z, where x is the major version, y is the minor version, and z is the patch version, following [Semantic Versioning](https://semver.org/) terminology. For more information, see [Kubernetes Release Versioning.](https://github.com/kubernetes/sig-release/blob/master/release-engineering/versioning.md#kubernetes-release-versioning)

### GPU Support

There are two ways to map GPUs in to a KinD cluster. The first is using the `devices` API and the second is using the `extraMounts` API.

#### Using the Devices API

As a pre-requisite you install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host.

Using `devices` for GPU support requires Docker v25 or later. See notes on CDI Container Support [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)
lukeogg marked this conversation as resolved.
Show resolved Hide resolved

GPU devices can be mapped to Kind node copntainers with the devices API:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it Kind or KinD?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KIND or kind. #3290 (comment)


All GPUs mapped to a single control-plane:

{{< codeFromInline lang="yaml" >}}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
devices:
- "nvidia.com/gpu=all"
{{< /codeFromInline >}}

Specific GPUs mapped to specific worker nodes based on index:

{{< codeFromInline lang="yaml" >}}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
devices:
- "nvidia.com/gpu=0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will not give you access to just GPU 0, but rather all GPUs on the system.

This is because --privileged is passed to the docker call that creates the node, thus giving it access to all devices under /dev (including all GPU devices).

Individual GPU injection will only be supported once if/when kind is able to start nodes without --privileged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is the same for all other mechanisms too. The --privileged flag will mean that any container will have access to all devices and the only value add is that the driver libraries are also injected.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only value add is that the driver libraries are also injected.

I don't supposed we can just trigger driver library injection directly in a simple way and reduce the API surface to "yes I'd like the GPU drivers"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent today putting this together:
https://github.com/klueska/kind-with-gpus-examples

It should be a good starting point for using GPUs with kind independent of this PR

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also pinged the docker maintainers that helped us push CDI through to see what their thoughts are here:
https://dockercommunity.slack.com/archives/C04MZQZJE94/p1712086101448089

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was pointed to this on that thread:
moby/moby#39702 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If moby/moby#47663 becomes available we'll have to version detect to use it to avoid breaking all of the users on older docker installs, and I'll be pretty wary of the branching codepath behaving differently vs --privileged.

We have only once required a newer docker version (when we started to require cgroupns=private be available) which caused a lot of consternation even though that change was aimed at mitigating the runc misc issues that caused kind to very flaky/broken ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would've been nice when kind was being built, but now we have to consider that people undoubtedly have test environments that depend on being in a --privileged node :/

- role: worker
devices:
- "nvidia.com/gpu=1"
{{< /codeFromInline >}}

#### Using the Extra Mounts API

GPUs can also be mapped using the `extraMounts` API. This method passes a list of GPUs to inject as volume mounts rather than the environment variable `NVIDIA_VISIBLE_DEVICES`.

Steps to enable this:

1. Add nvidia as your default runtime in /etc/docker/daemon.json
lukeogg marked this conversation as resolved.
Show resolved Hide resolved
1. Restart docker (as necessary)
1. Set `accept-nvidia-visible-devices-as-volume-mounts = true` in `/etc/nvidia-container-runtime/config.toml`
1. Add the `extraMounts` to any kind nodes you want to have access to all GPUs in the system:

{{< codeFromInline lang="yaml" >}}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
{{< /codeFromInline >}}

Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether this is strictly true. Using /var/run/nvidia-container-devices/0 or /var/run/nvidia-container-devices/{{DEVICE_UUID}} should allow individual devices to be made available.

cc @klueska

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #3257 (comment), we can't achieve any finer granularity than all (regardless of whether we do volume mounts or use CDI) because the node's container is started with --privileged. Even if you tell it to selectively inject just a single GPU, the node will be able to see all of them because --privileged pulls in all device nodes under /dev.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some testing today and this does allow access to all GPUs. I'll update the documentation. It is fairly convenient with Kind, however, as you don't need to inject drivers or install them on the node images.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limitation of all is true even in the CDI case.

Copy link
Author

@lukeogg lukeogg Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

Same here.


### Extra Mounts

Extra mounts can be used to pass through storage on the host to a kind node
Expand All @@ -300,10 +361,10 @@ For more information see the [Docker file sharing guide.](https://docs.docker.co

### Extra Port Mappings

Extra port mappings can be used to port forward to the kind nodes. This is a
cross-platform option to get traffic into your kind cluster.
Extra port mappings can be used to port forward to the kind nodes. This is a
cross-platform option to get traffic into your kind cluster.

If you are running Docker without the Docker Desktop Application on Linux, you can simply send traffic to the node IPs from the host without extra port mappings.
If you are running Docker without the Docker Desktop Application on Linux, you can simply send traffic to the node IPs from the host without extra port mappings.
With the installation of the Docker Desktop Application, whether it is on macOs, Windows or Linux, you'll want to use these.

You may also want to see the [Ingress Guide].
Expand Down Expand Up @@ -401,11 +462,11 @@ nodes:

### Kubeadm Config Patches

KIND uses [`kubeadm`](/docs/design/principles/#leverage-existing-tooling)
KIND uses [`kubeadm`](/docs/design/principles/#leverage-existing-tooling)
to configure cluster nodes.

Formally KIND runs `kubeadm init` on the first control-plane node, we can customize the flags by using the kubeadm
[InitConfiguration](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file)
[InitConfiguration](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file)
([spec](https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta3#InitConfiguration))

{{< codeFromInline lang="yaml" >}}
Expand Down Expand Up @@ -436,9 +497,9 @@ nodes:
enable-admission-plugins: NodeRestriction,MutatingAdmissionWebhook,ValidatingAdmissionWebhook
{{< /codeFromInline >}}

On every additional node configured in the KIND cluster,
On every additional node configured in the KIND cluster,
worker or control-plane (in HA mode),
KIND runs `kubeadm join` which can be configured using the
KIND runs `kubeadm join` which can be configured using the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind vs KinD vs KIND?

Copy link
Member

@BenTheElder BenTheElder Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs consistently use "KIND" or "kind" but not "KinD". We're not using the acronym anymore as it may be podman or in the future something else. Generally just kind like the CLI, but sometimes "KIND" or kind (code formatted) is less ambiguously the tool vs the english word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JoinConfiguration](https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-join/#config-file)
([spec](https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta3#JoinConfiguration))

Expand Down