Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensive example task running training workload on GPUs using JobSet #429

Open
Tracked by #438 ...
danielvegamyhre opened this issue Feb 16, 2024 · 15 comments
Open
Tracked by #438 ...
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines.

Comments

@danielvegamyhre
Copy link
Contributor

What would you like to be added:
A comprehensive example showing how to run a training workload on GPUs using JobSet. We could have one example per major cloud provider.

Why is this needed:
We need more concrete examples to reduce friction of user onboarding. Right now we mostly have toy examples with sleep containers to demonstrate functionality of different features.

@danielvegamyhre danielvegamyhre added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Feb 16, 2024
@uroy-personal
Copy link

/assign

@danielvegamyhre
Copy link
Contributor Author

@uroy-personal are you still working on these? If not I am going to unassign them so someone else can work on them.

@uroy-personal
Copy link

uroy-personal commented Mar 13, 2024

Hi @danielvegamyhre,
Yes I am on it. I need to add the example here right?
https://github.com/kubernetes-sigs/jobset/blob/main/docs/concepts/README.md

Also please help me on what content ( example yaml ) to put there. I hope to finish all the open tasks ( assigned to me ) by this week-end.

@danielvegamyhre
Copy link
Contributor Author

Hi @danielvegamyhre, Yes I am on it. I need to add the example here right? https://github.com/kubernetes-sigs/jobset/blob/main/docs/concepts/README.md

Also please help me on what content ( example yaml ) to put there. I hope to finish all the open tasks ( assigned to me ) by this week-end.

Yes, you can reference some examples in the examples/ directory to help you get started.

@danielvegamyhre
Copy link
Contributor Author

Also note it would be nice in the provisioning step to show example commands for all 3 major cloud providers (AWS, GCP, Azure)

@uroy-personal
Copy link

Also note it would be nice in the provisioning step to show example commands for all 3 major cloud providers (AWS, GCP, Azure)

Thanks. I am working on it. Hope to raise the PR in the next few days.

@danielvegamyhre
Copy link
Contributor Author

@uroy-personal Just following up, are you still working on this?

@uroy-personal
Copy link

Yes @danielvegamyhre , I am on it. I made the changes but found that the above README page removed. Will complete it within this week for sure! Thanks

@uroy-personal
Copy link

Good Morning @danielvegamyhre ,
Started the ball rolling here. So far I have added the examples present in examples/ into the site concepts page. Where to get the example commands for the cloud providers ( GCP, AWS & Azure ) ? Please help. I will modify the PR again.

@uroy-personal
Copy link

It seems this issue needs GPU access. Is there a way to get GPU access @danielvegamyhre ?

@uroy-personal
Copy link

/unassign

@danielvegamyhre
Copy link
Contributor Author

@uroy-personal To make this easier, let's not include the steps to provision GPU nodes on each Cloud Provider. Instead, let's just use a generic/placeholder nodeSelector (e.g. your.cloud.provider.com/gpu-type) to indicate to the user this should be replaced.

@uroy-personal
Copy link

Thanks @danielvegamyhre , I will have a look and get back at the earliest!

@googs1025
Copy link
Member

/assign Currently I have a gpu environment, but the gpu card is not up to date, but I can maybe try it and see.

@googs1025
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

3 participants