Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add TrainJob and TrainingRuntime APIs #2223

Merged
merged 12 commits into from
Aug 27, 2024

Conversation

andreyvelich
Copy link
Member

Fixes: #2206

I added APIs for TrainJob, TrainingRuntime, and ClusterTrainingRuntime resources.

/assign @kubeflow/wg-training-leads @kannon92 @mimowo @vsoch @ahg-g @kuizhiqing @alculquicondor @zw0610 @franciscojavierarceo @shravan-achar

/hold for review.

@ahg-g
Copy link

ahg-g commented Aug 15, 2024

@danielvegamyhre

Comment on lines +105 to +107
// Number of training nodes.
// TODO (andreyvelich): Do we want to support dynamic num of nodes in TrainJob for PyTorch elastic: `--nnodes=1:4` ?
NumNodes *int32 `json:"numNodes,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y What do you think about using *string here since for the elastic training user can set:

torchrun --nnodes=1:4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer int here to refer to active/effective number instead of string.
IMO, the string value of nnodes if calculated from a config from like ElasticPolicy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we can specify the elastically policy in the JobSetSpec. So, let's use the typed int here.
After we find some advantages for string, we can introduce IntOrString like Deployment rollingUpdate.

@coveralls
Copy link

coveralls commented Aug 15, 2024

Pull Request Test Coverage Report for Build 10566761788

Details

  • 0 of 622 (0.0%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-1.7%) to 31.801%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go 0 3 0.0%
pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go 0 3 0.0%
pkg/apis/kubeflow.org/v2alpha1/zz_generated.deepcopy.go 0 616 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
Totals Coverage Status
Change from base Build 10512072223: -1.7%
Covered Lines: 3950
Relevant Lines: 12421

💛 - Coveralls

}

// TranJobList is a collection of training jobs.
type TranJobList struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainJobList

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

metav1.ObjectMeta `json:"metadata,omitempty"`

// Specification of the desired ClusterTrainingRuntime.
Spec TrainingRuntimeSpec `json:"spec,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the spec ever be empty for ClusterTrainingRUntime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but I noticed that for all Kubernetes APIs the spec is set with omitempty: https://github.com/kubernetes/api/blob/master/apps/v1/types.go#L820.
@tenzen-y @kannon92 Any specific reason why we do this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in any case, the spec field is defined as an optional field in the Kubernetes. So, the optional TrainingRuntime spec would be better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL.

type TorchSpec struct {
// Number of processes per node.
// This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI.
// Supported values: `auto`, `cpu`, `gpu`, or int value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably use KubeBuilder validations for the enums here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 111 to 117
type MLSpec struct {
// Configuration for the PyTorch runtime.
TorchSpec *TorchSpec `json:"torchSpec,omitempty"`

// Configuration for the MPI Runtime.
MPISpec *MPISpec `json:"mpiSpec,omitempty"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have is that frameworks configurations/spec change quite often. Have we considered using configmap for this so that we don't have a lot of responsibilities to maintain the compatibility, etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our goal is not to add all frameworks configurations here, but only parameters that require additional orchestration such as Elastic Policy, MPISpec. For example, in the future we can add SlurmSpec or FluxSpec here as we discussed with @vsoch here: #2171 (comment).

For Torch our assumption is that torchrun CLI is quite stable and probably won't change in the near future.

@terrytangyuan How do you think we can use ConfigMap for those parameters ?

@tenzen-y
Copy link
Member

tenzen-y commented Aug 16, 2024

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools?

Unless those functions, we can not use the API in the controllers/webhooks.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/L labels Aug 16, 2024
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools?

Unless those functions, we can not use the API in the controllers/webhooks.

I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs.
Does it look good @tenzen-y ?

@tenzen-y
Copy link
Member

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools?
Unless those functions, we can not use the API in the controllers/webhooks.

I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?

That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime".

@tenzen-y
Copy link
Member

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools?
Unless those functions, we can not use the API in the controllers/webhooks.

I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?

That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime".

Created: #2224


// PodGroupSpec represents a PodGroup configuration to enable gang-scheduling.
type PodGroupSpec struct {
// Plugin for the gang-scheduling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going forward with a default?

Copy link
Member Author

@andreyvelich andreyvelich Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, the gang-scheduling is disabled for TrainJob, since it requires plugin to be installed (coscheduling or volcano).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

NumNodes *int32 `json:"numNodes,omitempty"`

// JobSet configuration which will be used by TrainJob.
JobSetSpec *jobsetv1alpha2.JobSetSpec `json:",inline"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why inline here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we discussed it before, here is the example: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#pytorch-distributed-runtime.
This will make API consistent with the JobSet. WDYT @kannon92 @tenzen-y ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking because I don't know what that argument actually does. I usually only see if for really small objects. Never a Spec so not sure if that means literally we are putting the object "inline" or it skips protobuf or api generation? like I've seen it for type..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to give user functionality to set the whole JobSet spec under Training Runtimes.

Copy link
Member

@tenzen-y tenzen-y Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type TrainingRuntimeSpec struct {
	MLPolicy *MLPolicy `json:"mlPolicy,omitempty"`

	JobSetSpec jonsetv2alpha2.JobSetSpec `json:"spec"`
}

type MLPolicy struct {
	// Number of training nodes.
	// Defaults to 1.
	NumNodes *int32 `json:"numNodes,omitempty"`

	MLPolicySource `json:",inline"`
}

type MLPolicySource struct {
	PyTorch ...
}

Maybe, we want to dedicated field for the JobSetSpec so that we can identify the JobSetSpec.
@andreyvelich @kannon92 WDYT?

metav1.ObjectMeta `json:"metadata,omitempty"`

// Specification of the desired TrainingRuntime.
Spec TrainingRuntimeSpec `json:"spec"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Spec TrainingRuntimeSpec `json:"spec"`
Spec TrainingRuntimeSpec `json:"spec, omitempty"`

Similar to other specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Catch!

}

// PodSpecOverrides represents the custom overrides that will be applied for the TrainJob's resources.
type PodSpecOverrides struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y do we need schedulingGates here? I can't remember what we decided a few weeks ago..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do not need to add the schedulingGate since we would expect the schedulingGates are added by the webhook based on the Pod creation event.

Indeed, Kueue adds the schedulingGates to Pods, not Job by webhook, when the Pods are created.

Copy link
Member

@kuizhiqing kuizhiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work !

Suspend *bool `json:"suspend,omitempty"`

// ManagedBy field indicates the controller that manages a TrainJob.
ManagedBy *string `json:"managedBy,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the type of ManagedBy is string, I propose ControllerName inspired by SchedulerName.
Since ManagedBy reminder me of managedFields which is another struct.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. This API is similar to Batch Job managedBy.
Here is the implementation from @mszadkow and @mimowo in V1 APIs: #2203

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've missed it.

Comment on lines +105 to +107
// Number of training nodes.
// TODO (andreyvelich): Do we want to support dynamic num of nodes in TrainJob for PyTorch elastic: `--nnodes=1:4` ?
NumNodes *int32 `json:"numNodes,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer int here to refer to active/effective number instead of string.
IMO, the string value of nnodes if calculated from a config from like ElasticPolicy.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`

// Implementation name for the MPI to create the appropriate hostfile.
MPIImplementation *MPIImplementation `json:"mpiImplementation"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MPIImplementation is a const, does this need to be a pointer? With a pointer, a nil is a valid value. Do we want user to pass a nil value?

Also with omitempty not specified, the zero value of this would be nil. The zero value will be included in all patch requests which seems unnecessary.

Maybe you meant to include the omitempty but accept a nil value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should remove pointer from here. @tenzen-y What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	// Implementation name for the MPI to create the appropriate hostfile.
	// Defaults to OpenMPI
	MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"`

In most cases of optional fields, we should use the omitempty tag.
Please refer to more details here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#optional-vs-required

// Every training runtime contains `trainer` container which represents Trainer.
type Trainer struct {
// Docker image for the training container.
Image string `json:"image,omitempty"`
Copy link

@shravan-achar shravan-achar Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could Image be a pointer to the string? The validation can differentiate between unspecified and empty string.

This marks the field as required in the CRD but a nil value can be passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @shravan-achar Do we want to introduce pointer or +optional for image since user can skip the image for Trainer?
E.g. Kubernetes for container users +optional validation: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2708C6-L2709

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, we can skip specifying this image field when the TrainingRuntime has the image.

    // +optional
    Image *string `json:"image,omitempty"`

Regarding to the optional vs required, please refer to the above my comment.

// the `dataset-initializer` container in the `Initializer` Job.
type DatasetConfig struct {
// Storage uri for the dataset provider.
StorageUri string `json:"storageUri"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could StorageUri be a pointer to the string? The validation could differentiate between unspecified and empty string.

Also omitempty is still necessary? From a PATCH perspective, if this field is unspecified, its zero value will be included in the serialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the storageUri optional given that Training Runtime can contain the default dataset and model.

// InputModel represents the desired pre-trained model configuration.
type InputModel struct {
// Storage uri for the model provider.
StorageUri string `json:"storageUri"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as on L141

Env []corev1.EnvVar `json:"env,omitempty"`

// Reference to the TrainJob's secrets to download dataset.
SecretRef corev1.SecretReference `json:"secretRef,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not take a pointer to corev1.SecretReference ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

// PodSpecOverrides represents the custom overrides that will be applied for the TrainJob's resources.
type PodSpecOverrides struct {
// Names of the training job replicas in the training runtime template to apply the overrides.
TargetReplicatedJobs []string `json:"targetReplicatedJobs"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see the benefit of skipping omitempty . Probably we should include almost everywhere. Let me know your thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip it from here, how we can make the targetReplicatedJobs mandatory parameter ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's possible to use Kubebuilder tags to make it required on the CRD. The omitempty will help with serialization. kubernetes-sigs/controller-tools#944

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TargetReplicatedJobs []string `json:"targetReplicatedJobs"`
// +required
TargetReplicatedJobs []string `json:"targetReplicatedJobs"`

We can make this field as required like this.

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: shravan-achar, kannon92.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Are there any other comments before we can merge this PR and start working on the controller implementation ?

/assign @shravan-achar @tenzen-y @kannon92 @kuizhiqing @terrytangyuan @johnugeorge
/hold cancel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@kannon92
Copy link

LGTM on my end.

@andreyvelich
Copy link
Member Author

We had some discussions with @tenzen-y about APIs and we proposed the following changes:

  • Define JobSetTemplateSpec under TrainingRuntimeSpec API. For some use-cases, Platform Engineers might want to add custom labels and annotations to JobSet for various features, such as alpha.jobset.sigs.k8s.io/exclusive-topology or alpha.jobset.sigs.k8s.io/node-selector: Document labels, annotations and taints for JobSet kubernetes/website#47383.
    We don't want to define custom propagation mechanism from TrainingRuntime metadata to the JobSet, since it is not straighforward.

  • Move numNodes under MLSpec

  • We are still debating between MLSpec vs MLPolicy API name. Any thoughts @kannon92 @kubeflow/wg-training-leads @kuizhiqing @shravan-achar @vsoch ?

@andreyvelich
Copy link
Member Author

/rerun-all

}

// TorchMLSpecSource represents a PyTorch runtime configuration.
type TorchMLSpecSource struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I removed standalone from TorchMLSpecSource API.
I think, we can detect when numNodes=1 and numProcPerNode>1 to set the standalone parameter which sets the default values for rendezvous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users will need it, we can add it in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me.
Thank you for the update.

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

andreyvelich commented Aug 26, 2024

We made the final changes with @tenzen-y:

  • Rename MLSpec to MLPolicy since spec usually represents another Kubernetes resources that will be deployed.
  • Rename PodGroupSpec to PodGroupPolicy for the same reason, and make various schedulers (coscheduling, volcano, or YuniKorn) as one of API.

If we don't have any followup suggestions, we can merge it.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!
/hold
/lgtm
/approve

Feel free to merge this PR.
Additionally, could you update the KEP in a separate PR?

@google-oss-prow google-oss-prow bot added the lgtm label Aug 26, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/hold

Signed-off-by: Andrey Velichkevich <[email protected]>
@tenzen-y
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Aug 26, 2024
Copy link
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the review, we should be ready to merge it.
I will create followup PR to update KEP with the correct APIs.
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 181191e into kubeflow:master Aug 27, 2024
38 of 39 checks passed
@andreyvelich andreyvelich deleted the issue-2206-add-apis branch August 27, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Add APIs for TrainJob and TrainingRuntime
9 participants