fix volcano podgroup update issue #2079

ckyuto · 2024-04-22T23:29:25Z

What this PR does / why we need it:
This is the fix cause by this PR, the minMember may be updated when the number of replica is changed. However, this also accidentally change the queue value. It also sync up the queue value in the podGroup with the value in runPolicy.SchedulingPolicy.Queue, which is not always applicable to all use cases.

In our use cases we'll inject the queue value according to which org this user belongs to. This change will override the value we set in the queue. The queue value should not be updated once the it is set.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

andreyvelich

Thank you for this fix @ckyuto!
Please can you rebase it ?

coveralls · 2024-04-26T17:14:15Z

Pull Request Test Coverage Report for Build 9296657539

Details

0 of 3 (0.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.008%) to 35.423%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controller.v1/common/job.go	0	3	0.0%

Totals
Change from base Build 9295074366:	-0.008%
Covered Lines:	4380
Relevant Lines:	12365

💛 - Coveralls

ckyuto · 2024-05-01T20:29:53Z

@andreyvelich Can you help review?

tenzen-y · 2024-05-02T07:06:11Z

@ckyuto Could you eliminate irrelevant commits?

ckyuto · 2024-05-02T08:35:57Z

tenzen-y
Thanks for the comment. Removed.

ckyuto · 2024-05-06T17:29:42Z

@andreyvelich @tenzen-y I think there's a simple way to fix this. Can I get a review again?

Tomcli

/lgtm

ckyuto · 2024-05-28T06:47:17Z

@tenzen-y the failed flow looks like a transient error. Can you help rerun again?

61.59 Get:80 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 libcurl3-gnutls arm64 7.68.0-1ubuntu2.22 [213 kB]
62.04 Get:81 http://ports.ubuntu.com/ubuntu-ports focal/main arm64 liberror-perl all 0.17029-1 [26.5 kB]
62.11 Get:82 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 git-man all 1:2.25.1-1ubuntu3.11 [887 kB]
64.01 Get:83 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 git arm64 1:2.25.1-1ubuntu3.11 [4437 kB]
73.32 Get:84 http://ports.ubuntu.com/ubuntu-ports focal/universe arm64 libomp5-10 arm64 1:10.0.0-4ubuntu1 [233 kB]
73.74 Get:85 http://ports.ubuntu.com/ubuntu-ports focal/universe arm64 libomp-10-dev arm64 1:10.0.0-4ubuntu1 [44.5 kB]
73.79 Get:86 http://ports.ubuntu.com/ubuntu-ports focal/universe arm64 libomp-dev arm64 1:10.0-50~exp1 [2824 B]
73.82 Fetched 68.2 MB in 40s (1713 kB/s)
73.83 E: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/pool/main/i/isl/libisl22_0.22.1-1_arm64.deb  Undetermined Error [IP: 185.125.190.39 80]
73.83 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------
WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
Dockerfile:9
--------------------
   8 |     
   9 | >>> RUN apt-get update -y && \
  10 | >>>     apt-get install -y --no-install-recommends \
  11 | >>>         ca-certificates \
  12 | >>>         cmake \
  13 | >>>         build-essential \
  14 | >>>         gcc \
  15 | >>>         g++ \
  16 | >>>         git \
  17 | >>>         libomp-dev && \
  18 | >>>     rm -rf /var/lib/apt/lists/*
  19 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get update -y &&     apt-get install -y --no-install-recommends         ca-certificates         cmake         build-essential         gcc         g++         git         libomp-dev &&     rm -rf /var/lib/apt/lists/*" did not complete successfully: exit code: 100

tenzen-y

Generally, lgtm

Could you extend the PyTorchJob integration test to verify the validations?

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller_test.go

Line 119 in bd90bf4

It("Should get the corresponding resources successfully", func() {

ckyuto · 2024-05-29T21:32:01Z

Generally, lgtm

Could you extend the PyTorchJob integration test to verify the validations?

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller_test.go

Line 119 in bd90bf4

It("Should get the corresponding resources successfully", func() {

Updated

tenzen-y

I'm not sure the reason why the CI keeps having the running state even if CI succeeded.
So, could you try to rebase this PR?

tenzen-y · 2024-05-30T02:52:45Z

pkg/controller.v1/pytorch/pytorchjob_controller_test.go

+			updatedJob := &kubeflowv1.PyTorchJob{}
+			Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob")
+
+			updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test"
+			err := testK8sClient.Update(ctx, updatedJob)
+
+			By("Checking that the queue update fails")
+			Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded")
+			Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")


Suggested change

updatedJob := &kubeflowv1.PyTorchJob{}

Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob")

updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test"

err := testK8sClient.Update(ctx, updatedJob)

By("Checking that the queue update fails")

Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded")

Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")

Eventually(func(g Gomega) {

updatedJob := &kubeflowv1.PyTorchJob{}

g.Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob")

updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test"

err := testK8sClient.Update(ctx, updatedJob)

By("Checking that the queue update fails")

g.Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded")

g.Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")

}, testutil.Timeout, testutil.Interval).Should(Succeeded())

The update operation often fails due to other reasons. So, could you use the retry mechanism to avoid flaky tests?

Signed-off-by: Weiyu Yen <[email protected]>

tenzen-y

Thank you!
/lgtm
/approve

tenzen-y · 2024-05-30T05:08:03Z

pkg/controller.v1/pytorch/pytorchjob_controller_test.go

+				Expect(err).To(MatchError(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message"))
+				return err != nil
+			}, testutil.Timeout, testutil.Interval).Should(BeTrue())


Actually, I don't prefer this approach since the root cause is possible to be hidden. Ok, let me refine here in another PR.

google-oss-prow · 2024-05-30T05:08:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y, Tomcli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* fix volcano podgroup update issue Signed-off-by: Weiyu Yen <[email protected]> * queue value shouldn't be reset once it has been set Signed-off-by: Weiyu Yen <[email protected]> * make queue immutable Signed-off-by: Weiyu Yen <[email protected]> * add unit test Signed-off-by: Weiyu Yen <[email protected]> * add retry for update operation Signed-off-by: Weiyu Yen <[email protected]> --------- Signed-off-by: Weiyu Yen <[email protected]>

#2130: Refine the integration tests for the immutable PyTorchJob (#2139) Signed-off-by: Yuki Iwai <[email protected]> Co-authored-by: Weiyu Yen <[email protected]>

google-oss-prow bot added the size/M label Apr 22, 2024

google-oss-prow bot requested review from jinchihe and kuizhiqing April 22, 2024 23:29

ckyuto force-pushed the wyen/fix_pg_update branch from b4457b7 to c4b4547 Compare April 25, 2024 18:51

andreyvelich reviewed Apr 26, 2024

View reviewed changes

ckyuto force-pushed the wyen/fix_pg_update branch from 9397c3d to ada7d06 Compare April 26, 2024 22:26

google-oss-prow bot added size/XL size/M and removed size/M size/XL labels Apr 26, 2024

ckyuto force-pushed the wyen/fix_pg_update branch from c8d7650 to 25715d2 Compare April 26, 2024 22:32

google-oss-prow bot added size/XL size/M and removed size/M size/XL labels Apr 26, 2024

ckyuto force-pushed the wyen/fix_pg_update branch from fdbed5d to adb09ed Compare May 2, 2024 08:15

ckyuto force-pushed the wyen/fix_pg_update branch 2 times, most recently from f3c56ef to 88347fa Compare May 2, 2024 09:40

google-oss-prow bot added size/XS and removed size/M labels May 3, 2024

ckyuto force-pushed the wyen/fix_pg_update branch 2 times, most recently from 0fd9120 to b885e4d Compare May 3, 2024 08:44

Tomcli approved these changes May 7, 2024

View reviewed changes

google-oss-prow bot assigned Tomcli May 7, 2024

google-oss-prow bot added the lgtm label May 7, 2024

google-oss-prow bot added size/S and removed size/XS labels May 28, 2024

ckyuto force-pushed the wyen/fix_pg_update branch from 01fffe9 to 7597718 Compare May 28, 2024 04:33

tenzen-y reviewed May 29, 2024

View reviewed changes

google-oss-prow bot added size/M and removed size/S labels May 29, 2024

ckyuto force-pushed the wyen/fix_pg_update branch from 8a96561 to d42625b Compare May 29, 2024 21:31

tenzen-y reviewed May 30, 2024

View reviewed changes

ckyuto force-pushed the wyen/fix_pg_update branch from d42625b to 1a23620 Compare May 30, 2024 03:43

ckyuto and others added 5 commits May 29, 2024 20:44

fix volcano podgroup update issue

4785a69

Signed-off-by: Weiyu Yen <[email protected]>

queue value shouldn't be reset once it has been set

c292bc9

Signed-off-by: Weiyu Yen <[email protected]>

make queue immutable

c3370f6

Signed-off-by: Weiyu Yen <[email protected]>

add unit test

b72ad04

Signed-off-by: Weiyu Yen <[email protected]>

add retry for update operation

6fb2548

Signed-off-by: Weiyu Yen <[email protected]>

ckyuto force-pushed the wyen/fix_pg_update branch from 1a23620 to 6fb2548 Compare May 30, 2024 03:44

tenzen-y reviewed May 30, 2024

View reviewed changes

google-oss-prow bot added the lgtm label May 30, 2024

google-oss-prow bot added the approved label May 30, 2024

google-oss-prow bot merged commit 00f4d52 into kubeflow:master May 30, 2024
39 checks passed

tenzen-y mentioned this pull request May 30, 2024

Refine the integration tests for the immutable PyTorchJob queueName #2130

Merged

1 task

This was referenced Jun 7, 2024

Automated cherry pick of #2079: fix volcano podgroup update issue #2136

Closed

Automated cherry pick of #2079: fix volcano podgroup update issue #2130: Refine the integration tests for the immutable PyTorchJob #2139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix volcano podgroup update issue #2079

fix volcano podgroup update issue #2079

ckyuto commented Apr 22, 2024 •

edited

Loading

andreyvelich left a comment

coveralls commented Apr 26, 2024 •

edited

Loading

ckyuto commented May 1, 2024

tenzen-y commented May 2, 2024

ckyuto commented May 2, 2024

ckyuto commented May 6, 2024

Tomcli left a comment

ckyuto commented May 28, 2024

tenzen-y left a comment

ckyuto commented May 29, 2024

tenzen-y left a comment

tenzen-y May 30, 2024

ckyuto May 30, 2024

tenzen-y left a comment

tenzen-y May 30, 2024

google-oss-prow bot commented May 30, 2024

fix volcano podgroup update issue #2079

fix volcano podgroup update issue #2079

Conversation

ckyuto commented Apr 22, 2024 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

coveralls commented Apr 26, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9296657539

Details

💛 - Coveralls

ckyuto commented May 1, 2024

tenzen-y commented May 2, 2024

ckyuto commented May 2, 2024

ckyuto commented May 6, 2024

Tomcli left a comment

Choose a reason for hiding this comment

ckyuto commented May 28, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

ckyuto commented May 29, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y May 30, 2024

Choose a reason for hiding this comment

ckyuto May 30, 2024

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y May 30, 2024

Choose a reason for hiding this comment

google-oss-prow bot commented May 30, 2024

ckyuto commented Apr 22, 2024 •

edited

Loading

coveralls commented Apr 26, 2024 •

edited

Loading