-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix volcano podgroup update issue #2079
Conversation
b4457b7
to
c4b4547
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this fix @ckyuto!
Please can you rebase it ?
Pull Request Test Coverage Report for Build 9296657539Details
💛 - Coveralls |
9397c3d
to
ada7d06
Compare
c8d7650
to
25715d2
Compare
@andreyvelich Can you help review? |
@ckyuto Could you eliminate irrelevant commits? |
|
f3c56ef
to
88347fa
Compare
0fd9120
to
b885e4d
Compare
@andreyvelich @tenzen-y I think there's a simple way to fix this. Can I get a review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
01fffe9
to
7597718
Compare
@tenzen-y the failed flow looks like a transient error. Can you help rerun again?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, lgtm
Could you extend the PyTorchJob integration test to verify the validations?
It("Should get the corresponding resources successfully", func() { |
8a96561
to
d42625b
Compare
Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the reason why the CI keeps having the running state even if CI succeeded.
So, could you try to rebase this PR?
updatedJob := &kubeflowv1.PyTorchJob{} | ||
Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob") | ||
|
||
updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test" | ||
err := testK8sClient.Update(ctx, updatedJob) | ||
|
||
By("Checking that the queue update fails") | ||
Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded") | ||
Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updatedJob := &kubeflowv1.PyTorchJob{} | |
Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob") | |
updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test" | |
err := testK8sClient.Update(ctx, updatedJob) | |
By("Checking that the queue update fails") | |
Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded") | |
Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message") | |
Eventually(func(g Gomega) { | |
updatedJob := &kubeflowv1.PyTorchJob{} | |
g.Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob") | |
updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test" | |
err := testK8sClient.Update(ctx, updatedJob) | |
By("Checking that the queue update fails") | |
g.Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded") | |
g.Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message") | |
}, testutil.Timeout, testutil.Interval).Should(Succeeded()) |
The update operation often fails due to other reasons. So, could you use the retry mechanism to avoid flaky tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
d42625b
to
1a23620
Compare
Signed-off-by: Weiyu Yen <[email protected]>
Signed-off-by: Weiyu Yen <[email protected]>
Signed-off-by: Weiyu Yen <[email protected]>
Signed-off-by: Weiyu Yen <[email protected]>
Signed-off-by: Weiyu Yen <[email protected]>
1a23620
to
6fb2548
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/lgtm
/approve
Expect(err).To(MatchError(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")) | ||
return err != nil | ||
}, testutil.Timeout, testutil.Interval).Should(BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't prefer this approach since the root cause is possible to be hidden. Ok, let me refine here in another PR.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y, Tomcli The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* fix volcano podgroup update issue Signed-off-by: Weiyu Yen <[email protected]> * queue value shouldn't be reset once it has been set Signed-off-by: Weiyu Yen <[email protected]> * make queue immutable Signed-off-by: Weiyu Yen <[email protected]> * add unit test Signed-off-by: Weiyu Yen <[email protected]> * add retry for update operation Signed-off-by: Weiyu Yen <[email protected]> --------- Signed-off-by: Weiyu Yen <[email protected]>
* fix volcano podgroup update issue Signed-off-by: Weiyu Yen <[email protected]> * queue value shouldn't be reset once it has been set Signed-off-by: Weiyu Yen <[email protected]> * make queue immutable Signed-off-by: Weiyu Yen <[email protected]> * add unit test Signed-off-by: Weiyu Yen <[email protected]> * add retry for update operation Signed-off-by: Weiyu Yen <[email protected]> --------- Signed-off-by: Weiyu Yen <[email protected]>
#2130: Refine the integration tests for the immutable PyTorchJob (#2139) Signed-off-by: Yuki Iwai <[email protected]> Co-authored-by: Weiyu Yen <[email protected]>
What this PR does / why we need it:
This is the fix cause by this PR, the minMember may be updated when the number of replica is changed. However, this also accidentally change the queue value. It also sync up the queue value in the podGroup with the value in runPolicy.SchedulingPolicy.Queue, which is not always applicable to all use cases.
In our use cases we'll inject the queue value according to which org this user belongs to. This change will override the value we set in the queue. The queue value should not be updated once the it is set.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: