Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Allow customising base trainer and storage images in Train API #2261

Merged

Conversation

varshaprasad96
Copy link
Contributor

What this PR does / why we need it:
Allow customising base storage_initializer and trainer images through Env vars.
Example use case: Train API could be expanded to use ROCm libs in addition to CUDA.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2247

TODO: Docs to be updated in https://github.com/kubeflow/website.

Checklist:

  • Docs included if any changes are user facing

Allow customizing base storage_initializer and trainer images through
Env vars.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10927951593

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on sdk/fetch-base-image at 100.0%

Totals Coverage Status
Change from base Build 10927738808: 100.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR!
/approve

@deepanker13 Do you have any other comments?
If not, you can just say /lgtm, and then this will be merged into the master branch.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/assign @deepanker13

@deepanker13
Copy link
Contributor

Thanks @varshaprasad96
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Sep 19, 2024
@google-oss-prow google-oss-prow bot merged commit ee6756b into kubeflow:master Sep 19, 2024
39 checks passed
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @varshaprasad96!
Please can you submit PR to address my comment ?

@@ -82,14 +82,14 @@


# TODO (andreyvelich): We should add image tag for Storage Initializer and Trainer.
STORAGE_INITIALIZER_IMAGE = "docker.io/kubeflow/storage-initializer"
STORAGE_INITIALIZER_IMAGE_DEFAULT = "docker.io/kubeflow/storage-initializer"
Copy link
Member

@andreyvelich andreyvelich Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varshaprasad96 Please can you submit PR to add the following change in the constants.py:

STORAGE_INITIALIZER_IMAGE = os.getenv("STORAGE_INITIAILIZER_IMAGE", "docker.io/kubeflow/storage-initializer")
TRAINER_TRANSFORMER_IMAGE = os.getenv("TRAINER_TRANSFORMER_IMAGE", "docker.io/kubeflow/trainer-huggingface")

That will allow users to quickly see the env they can modify, instead of searching in the training_client.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!.. here we go: #2268

varshaprasad96 added a commit to varshaprasad96/website that referenced this pull request Sep 24, 2024
Follow up from kubeflow/training-operator#2261 as
this is a user facing change.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
varshaprasad96 added a commit to varshaprasad96/website that referenced this pull request Sep 27, 2024
Follow up from kubeflow/training-operator#2261 as
this is a user facing change.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
varshaprasad96 added a commit to varshaprasad96/website that referenced this pull request Sep 27, 2024
Follow up from kubeflow/training-operator#2261 as
this is a user facing change.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
varshaprasad96 added a commit to varshaprasad96/website that referenced this pull request Sep 27, 2024
Follow up from kubeflow/training-operator#2261 as
this is a user facing change.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
varshaprasad96 added a commit to varshaprasad96/website that referenced this pull request Sep 27, 2024
Follow up from kubeflow/training-operator#2261 as
this is a user facing change.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Issues with trying to use train API with TinyLlama LLM
5 participants