-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add the TrainJob state transition design #2298
KEP-2170: Add the TrainJob state transition design #2298
Conversation
/hold |
Pull Request Test Coverage Report for Build 11645733536Details
💛 - Coveralls |
851738c
to
069345f
Compare
39bfab2
to
8ba2206
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @tenzen-y!
Please take a look @kubeflow/wg-training-leads @kannon92 @akshaychitneni @varshaprasad96 @ahg-g @vsoch @danielvegamyhre
// TrainJobSuspended means the TrainJob is suspended. | ||
TrainJobSuspended string = "Suspended" | ||
|
||
// TrainJobCompleted means that the actual jobs have completed its execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be TrainJob, and similar for other statuses ?
// TrainJobCompleted means that the actual jobs have completed its execution. | |
// TrainJobCompleted means that the TrainJob has completed its execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -279,6 +279,43 @@ type TrainJob struct { | |||
Status TrainJobStatus `json:"status,omitempty"` | |||
} | |||
|
|||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define those consts as TrainJobConditionType
similar to Batch/Job: https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L617 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so since our conditions typed are metav1.Condition
and the Type
is typed string.
The batch/v1 Job uses the typed JobConditionType
since Job conditions typed are dedicated JobCondition
: https://github.com/kubernetes/api/blob/2be23f6c5a7184af9846ff458a11751765ea2bde/batch/v1/types.go#L662
When we use the dedicated typed TrainJobConditionType
, we need to cast the TrainJobConditionType
to string everywhere. That is not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right.
|
||
// TrainJobResumedReason is the "Suspended" condition reason. | ||
// When the TrainJob suspension is changed from True to False, this is added. | ||
TrainJobResumedReason string = "Resumed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have. In the batch/v1 Job level, JobResumed
is used as a reason for the Suspended
condition type.
In the JobSet level, ResumeJobs
is used as a reason for the Suspended
condition type.
// TrainJobJobsCreationSucceededReason is the "Created" condition reason. | ||
// When the creating objects succeeded after building succeeded, this is added. | ||
TrainJobJobsCreationSucceededReason string = "JobsCreationSucceeded" | ||
|
||
// TrainJobJobsBuildFailedReason is the "Created" condition reason. | ||
// When the building objects based on the TrainJob and the specified runtime failed, | ||
// this is added. | ||
TrainJobJobsBuildFailedReason string = "JobsBuildFailed" | ||
|
||
// TrainJobJobsCreationFailedReason is "Created" condition reason. | ||
// When the creating objects failed even though building succeeded, this is added. | ||
TrainJobJobsCreationFailedReason string = "JobsCreationFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce those conditions in the 2nd iteration ?
I think, we should discuss if users really want to decouple conditions when TrainJob's object creation failed and when TrainJob fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are reasons, so we use these in the following. Additionally, if we do not distinguish JobCreationFailed
and JobBuildFailed
, it is hard to understand which part failed. So, these reasons must be included in the first phase.
- JobsBuildFailed:
objs, err := run.NewObjects(ctx, trainJob) - JobsCreationFailed:
return creationErr
status:
conditions:
- type: Created
status: false
reason: JobsCreationFailed
status:
conditions:
- type: Created
status: false
reason: JobsBuildFailed
status:
conditions:
- type: Created
status: true
reason: JobsCreationSucceeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have validation webhook, what is the use-case you see when we can hit the JobsBuildFailed reason ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not validate in advance if NewObject succeed since we can not know in advance what happens during the NewObject.
Only during reconciling, we can know the actual error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I guess one of the examples could be if NewObject calls Kubernetes API server and this call fails.
In that case, we want to transition TrainJob to JobsBuildFailed status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we want to transition TrainJob to JobsBuildFailed status.
Yes, that's right. In the current JobSet plugin, the building error could happen in the following:
return nil, err return nil, err
Additionally, every plugin could return errors during Kubeflow Job Pipeline Framerowk.
- type: JobSetSuspended | ||
status: false | ||
- type: JobSetCompleted | ||
status: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to propagate JobSet conditions to the TrainJob ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propagated these JobSet conditions so that runtimes can be propagated to the arbitrary conditions.
But, in the JobSet case, these seem to be slightly redundant. So, I'm wondering if we only extend the runtime interface so that the plugin can propagate arbitrary conditions, but JobSet plugin does not propagate own conditions, then TrainJob does not have the JobSet conditions.
But, I guess that the JobSet StartupPolocy conditions could help to understand what happens in the actual Jobs.
So, we can propagate only StartupPolicyInProgress
and StartupPolicyCompleted
.
@andreyvelich WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm wondering if we only extend the runtime interface so that the plugin can propagate arbitrary conditions, but JobSet plugin does not propagate own conditions, then TrainJob does not have the JobSet conditions.
I would suggest that we add those once we have initial implementation of TrainJob status.
I think, this is valid option, but we should discuss what conditions be be propagated by runtime plugins.
Any thoughts @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we add those once we have initial implementation of TrainJob status.
I think, this is valid option, but we should discuss what conditions be be propagated by runtime plugins.
I think that we need to implement the conditions propagation mechanism, but it's not clear in the JobSet situation. Especially, which conditions should be propagated to the TrainJob. So, I would propose that we extend the runtime interface so that plugins can propagate the conditions to TrainJob. But, in the first status implementation, we do not propagate any JobSet conditions to the TrainJob.
What do you think? @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I would propose that we extend the runtime interface so that plugins can propagate the conditions to TrainJob.
Sure, we can have this. Maybe we should create a tracking issue to implement it ?
I don't think is has a high priority compare to other tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. I will open the issue.
But, let me mention the propagation mechanism in this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
[...] | ||
status: | ||
conditions: | ||
- type: Created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to add transition time and also reason for failure within conditions? I believe it would be help debugging failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pseudo conditions. So, we actually support all fields supported in metav1.Conditions including transition time as you can see there:
training-operator/pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go
Lines 261 to 262 in 66ebecd
// Conditions for the TrainJob. | |
Conditions []metav1.Condition `json:"conditions,omitempty"` |
@andreyvelich I addressed all your comments. PTAL, thanks. |
created_choice --> Created=True: Succeeded to build and deploy Jobs. | ||
created_choice --> Created=False: Failed to build and deploy Jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to add small clarified message what does Failed to build Jobs mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
[*] --> created_choice: TrainJob is submitted. | ||
created_choice --> Created=True: Succeeded to build and deploy Jobs. | ||
created_choice --> Created=False: Failed to build and deploy Jobs. | ||
Created=False --> Created=False: Wait for updated appropriate TrainJob. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by Wait for updated appropriate TrainJob ?
I thought if build or deploy Jobs fails, we transition TrainJob to Failed condition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainJob is the mutable object. So, we wait for updating TrainJob with proper fields here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some API in TrainJob is immutable, like managedBy, but the Trainer API is mutable, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. We allow users to modify the TrainJob except for some fields like managedBy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, in that case my question is why do we transition TrainJob in the Failed state if the underlying JobSet is Failed ? Since TrainJob image is mutable, user can override it.
Or the main motivation is that when TrainJob is in Created state, it only can transition to Failed or Succeeded ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case my question is why do we transition TrainJob in the Failed state if the underlying JobSet is Failed ? Since TrainJob image is mutable, user can override it.
AFAIK, both batch/v1 Job, and JobSet can not be restarted once those have reached the terminal phase (complete or failed).
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
…amework interfaces for each plugin Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
cdf9267
to
1cca202
Compare
@andreyvelich I addressed all your feedback. PTAL, thanks. |
Failed=True --> [*] | ||
|
||
#COMPLETION | ||
terminal_choice --> Completed=True: Actual Jobs (e.g., JobSet) completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually call this condition Succeeded to be consistent with JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L182-L183 ?
From my understanding Completed: True mens that TrainJob is in Failed or Succeeded state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, Job and JobSet succeed (.replicatedJobsStatus.succeed), and failed (.replicatedJobsStatus.failed) count are not terminal phase, and those are intermediate one. Additionally those are not guaranteed consistency when Success and Failure criteria are conflicts.
So, instead of success count, we should rely on the .status.terminalState or conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to have Success state for the TrainJob conditions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainJob has only Suspend, Completed, Failed, and Created conditions.
Signed-off-by: Yuki Iwai <[email protected]>
58d8587
to
b81a633
Compare
// TrainJobComplete means that the TrainJob has completed its execution. | ||
TrainJobComplete string = "Complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use "Complete" condition type name to align with batch/v1 Job for now, then we will open an issue to rename "Completed" to "Complete" in the JobSet side.
Thanks for the update @tenzen-y! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
All green, thanks! /hold cancel |
What this PR does / why we need it:
I added the TrainJob state machine design.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of #2207
Relates to: #2170
Checklist: