-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Implement Initializer builders in the JobSet plugin #2316
KEP-2170: Implement Initializer builders in the JobSet plugin #2316
Conversation
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
SecretRef *corev1.SecretReference `json:"secretRef,omitempty"` | ||
// Reference to the secret with credentials to download dataset. | ||
// Secret must be created in the TrainJob's namespace. | ||
SecretRef *corev1.LocalObjectReference `json:"secretRef,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I updated the API to use *corev1.LocalObjectReference
type for the SecretRef.
I think, we should use this since we don't want to support cross-namespace reference for this secret.
pkg/constants/constants.go
Outdated
// TODO (andreyvelich): Add validation to check that initializer ReplicatedJob has the following volumes. | ||
VolumeMountModelInitializer = corev1.VolumeMount{ | ||
Name: VolumeNameInitializer, | ||
MountPath: "/workspace/model", | ||
} | ||
|
||
// VolumeMountModelInitializer is the volume mount for the dataset initializer container. | ||
VolumeMountDatasetInitializer = corev1.VolumeMount{ | ||
Name: VolumeNameInitializer, | ||
MountPath: "/workspace/dataset", | ||
} | ||
|
||
// VolumeInitializer is the volume for the initializer ReplicatedJob. | ||
// TODO (andreyvelich): We should make VolumeSource configurable. | ||
VolumeInitializer = corev1.Volume{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akshaychitneni @tenzen-y I think, we should validate that runtime has the correct configuration for the volume if it has the Initializer
ReplicatedJob.
I added these consts in this PR, so we can re-use it in the Validation PR.
Otherwise, users will get many errors while using our custom Storage Initializers.
Pull Request Test Coverage Report for Build 11636022716Details
💛 - Coveralls |
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
@@ -47,7 +47,7 @@ manifests: controller-gen ## Generate WebhookConfiguration, ClusterRole and Cust | |||
output:rbac:artifacts:config=manifests/v2/base/rbac \ | |||
output:webhook:artifacts:config=manifests/v2/base/webhook | |||
|
|||
generate: controller-gen ## Generate apidoc, sdk and code containing DeepCopy, DeepCopyInto, and DeepCopyObject method implementations. | |||
generate: controller-gen manifests ## Generate apidoc, sdk and code containing DeepCopy, DeepCopyInto, and DeepCopyObject method implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this?
training-operator/.github/workflows/test-go.yaml
Lines 32 to 35 in 7c5ea70
- name: Check manifests | |
run: | | |
make manifests && git add manifests && | |
git diff --cached --exit-code || (echo 'Please run "make manifests" to generate manifests' && exit 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Oh, you want to separate it.
Don't we want to include manifests
command into generate script ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to say removing "Check manifests" from workflow because we will perform manifests verification with make generate
since this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right!
pkg/constants/constants.go
Outdated
// This is the temporary container that we use in the initializer ReplicatedJob. | ||
// TODO (andreyvelich): Once JobSet supports execution policy, we can remove it. | ||
// TODO (andreyvelich): Add validation to check that initializer ReplicatedJob has this container. | ||
ContainerBusyBox corev1.Container = corev1.Container{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is only for the testing wrappers.
In that case, we should not expose this as a global constants. Instead of this, could you move this to util.v2/testing/constants.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned here, I think we should use these consts for Webhook validation: #2316 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case, could we move these default values into the JobSet plugins since these are semantically tied into the JobSet, then we can reuse these in the JobSet CustomValidationPlugin
and ComponentBuilderPlugin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense!
pkg/constants/constants.go
Outdated
|
||
// VolumeMountModelInitializer is the volume mount for the model initializer container. | ||
// TODO (andreyvelich): Add validation to check that initializer ReplicatedJob has the following volumes. | ||
VolumeMountModelInitializer = corev1.VolumeMount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/constants/constants.go
Outdated
} | ||
|
||
// VolumeMountModelInitializer is the volume mount for the dataset initializer container. | ||
VolumeMountDatasetInitializer = corev1.VolumeMount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -51,6 +52,90 @@ func NewBuilder(objectKey client.ObjectKey, jobSetTemplateSpec kubeflowv2.JobSet | |||
} | |||
} | |||
|
|||
// mergeInitializerEnvs merges the TrainJob and Runtime Pod envs. | |||
func mergeInitializerEnvs(storageUri *string, trainJobEnvs, containerEnv []corev1.EnvVar) []corev1.EnvVar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func mergeInitializerEnvs(storageUri *string, trainJobEnvs, containerEnv []corev1.EnvVar) []corev1.EnvVar { | |
func mergeInitializerEnvs(storageUri string, trainJobEnvs, containerEnv []corev1.EnvVar) []corev1.EnvVar { |
It seems that this pointer is less valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I use this pointer here to verify if StorageUri is set:
if storageUri != nil { |
That also allow us to not use ptr.Deref() here: #2316 (comment)
Does it make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was just missed the line. That makes sense. Thanks.
// mergeInitializerEnvs merges the TrainJob and Runtime Pod envs. | ||
func mergeInitializerEnvs(storageUri *string, trainJobEnvs, containerEnv []corev1.EnvVar) []corev1.EnvVar { | ||
envNames := sets.New[string]() | ||
envs := []corev1.EnvVar{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
envs := []corev1.EnvVar{} | |
var envs []corev1.EnvVar |
Avoiding unneeded malloc would be better.
Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution!
/lgtm
/approve
/hold for CI
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Part of: #2290, #2170.
I implemented builder for dataset and model initializers.
/assign @kubeflow/wg-training-leads @varshaprasad96 @akshaychitneni @deepanker13 @helenxie-bit @Electronic-Waste @saileshd1402 @kannon92 @kuizhiqing @shravan-achar