Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate pytorchjob workers are configured when elasticpolicy is configured #2320

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tarat44
Copy link

@tarat44 tarat44 commented Nov 4, 2024

What this PR does / why we need it:
This PR adds a check in the Pytorchjob validating webhook to ensure that if a user configures an elastic policy that they have defined a worker spec. Previously defining an elastic policy with no worker spec was resulting in a nil pointer exception on this line. Additionally, we added validation to ensure that the worker spec defines at least one replica.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2278

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: tarat44 <[email protected]>
Co-authored-by: ricardov1 <[email protected]>
Co-authored-by: alenawang <[email protected]>
@tarat44 tarat44 changed the title Validate workers for elasticpolicy Validate workers are configured when elasticpolicy is configured Nov 4, 2024
@tarat44 tarat44 changed the title Validate workers are configured when elasticpolicy is configured Validate pytorchjob workers are configured when elasticpolicy is configured Nov 4, 2024
@coveralls
Copy link

coveralls commented Nov 4, 2024

Pull Request Test Coverage Report for Build 11706701329

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11663764609: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@tenzen-y
Copy link
Member

tenzen-y commented Nov 5, 2024

@tarat44 tarat44 force-pushed the validate-workers-for-elasticpolicy branch from fad6922 to a4a0050 Compare November 6, 2024 15:28
@tarat44
Copy link
Author

tarat44 commented Nov 6, 2024

@tarat44 In advance, could you sign to DCO?

https://github.com/kubeflow/training-operator/pull/2320/checks?check_run_id=32498218264

@tenzen-y Thanks for bringing this to my attention, and I apologize for the delay. I combined the commits into one and provided the sign off message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training Operator crashes when submitting PyTorchJob with elasticPolicy but without worker template defined
3 participants