Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a MIG example #182

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

yuanchen8911
Copy link
Collaborator

@yuanchen8911 yuanchen8911 commented Oct 16, 2024

This PR creates a new MIG example gpu-test-mig.yaml with the new apiVersion: resource.k8s.io/v1alpha3 to the demo/quickstart folder.

Validated on an A100 machine with the following MIG configuration.

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
  MIG 3g.20gb     Device  0: (UUID: MIG-ba67e81d-d10e-5b3f-9bba-1af4b97b4b18)
  MIG 2g.10gb     Device  1: (UUID: MIG-c22ee1da-dd8f-57be-8f0a-d951c67ad3f3)
  MIG 1g.5gb      Device  2: (UUID: MIG-2f18f1a5-2ea8-5c05-a674-aee0e69e22ca)
  MIG 1g.5gb      Device  3: (UUID: MIG-a4bbead5-b0b1-5339-af1b-6239e2e6b4bd)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
  MIG 3g.20gb     Device  0: (UUID: MIG-a4a1fc88-103b-5c5b-ba1c-8c5c617fba7d)
  MIG 2g.10gb     Device  1: (UUID: MIG-6eb2e7a3-4440-562d-98e8-536a814b5ffd)
  MIG 1g.5gb      Device  2: (UUID: MIG-2286c62b-847f-5aaa-85b2-21b147544503)
  MIG 1g.5gb      Device  3: (UUID: MIG-0f47e714-65d5-5e11-b1bb-0d49bdaa5b29)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
  MIG 3g.20gb     Device  0: (UUID: MIG-8b4bd806-4301-5a29-8bca-a603a52e7192)
  MIG 2g.10gb     Device  1: (UUID: MIG-c0c44878-d0a8-5c7e-9386-df111231427d)
  MIG 1g.5gb      Device  2: (UUID: MIG-ba4a9d35-943c-5eac-b7a9-513b58c39ae0)
  MIG 1g.5gb      Device  3: (UUID: MIG-d1ac450c-f47e-53a9-bbbe-9cb23a589a6c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
  MIG 3g.20gb     Device  0: (UUID: MIG-7a6eeb45-6baa-5c5e-8366-b58486c7748a)
  MIG 2g.10gb     Device  1: (UUID: MIG-74b6741b-0bca-5cf4-b6b8-364fbd875d7f)
  MIG 1g.5gb      Device  2: (UUID: MIG-cb6813d1-f566-5bd4-ab58-e3f7016e28c1)
  MIG 1g.5gb      Device  3: (UUID: MIG-f7c166a5-e33a-579c-ac74-5fcd27ecd212)
GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-ec9d53cc-125d-d4a3-9687-304df8eb4749)
GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-3eb87630-93d5-b2b6-b8ff-9b359caf4ee2)
GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-8216274a-c05d-def0-af18-c74647300267)
GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-b1028956-cfa2-0990-bf4a-5da9abb51763)
$ kubectl get resourceclaim -n gpu-test-mig
NAME                                    STATE                AGE
pod-646f7467bc-6kt6k-mig-ts-gpu-22fh2   allocated,reserved   7m38s
pod-646f7467bc-bb8lp-mig-ts-gpu-jmdrh   allocated,reserved   7m38s
pod-646f7467bc-lf5tg-mig-ts-gpu-kmxgv   allocated,reserved   7m38s
pod-646f7467bc-tt989-mig-ts-gpu-ck85v   allocated,reserved   7m38s

$ kubectl get pods -n gpu-test-mig
NAME                   READY   STATUS    RESTARTS   AGE
pod-646f7467bc-6kt6k   1/1     Running   0          7m45s
pod-646f7467bc-bb8lp   1/1     Running   0          7m45s
pod-646f7467bc-lf5tg   1/1     Running   0          7m45s
pod-646f7467bc-tt989   1/1     Running   0          7m45s

Signed-off-by: Yuan Chen <[email protected]>

Create a new example for MIG

Signed-off-by: Yuan Chen <[email protected]>

Update comment

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 changed the title Add an MIG example Add a MIG example Oct 16, 2024
Comment on lines +25 to +27
constraints:
- requests: []
matchAttribute: "gpu.nvidia.com/parentUUID"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for this with just a single MIG device request -- this pulls together the different requests and ensures the allocations come from the same underlying GPU

args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: mig-ts-gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use the name *-ts-* here unless you put an explicit timeSlicing config on the request.

Comment on lines +46 to +48
resourceClaims:
- name: mig-ts-gpu
resourceClaimTemplateName: mig-devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this below the list of containers

kind: ResourceClaimTemplate
metadata:
namespace: gpu-test-mig
name: mig-devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: mig-devices
name: mig-device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants