-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add TrainJob and TrainingRuntime APIs #2223
Changes from 1 commit
6865663
ed830c8
66e7049
bfa1f20
2bf13c9
72a933e
9d0a686
880560c
49a004c
c28a166
06e7653
7aa4094
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -15,3 +15,173 @@ limitations under the License. | |||||
*/ | ||||||
|
||||||
package v2alpha1 | ||||||
|
||||||
import ( | ||||||
autoscalingv2 "k8s.io/api/autoscaling/v2" | ||||||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||||||
jobsetv1alpha2 "sigs.k8s.io/jobset/api/jobset/v1alpha2" | ||||||
) | ||||||
|
||||||
// ClusterTrainingRuntime represents a training runtime which can be referenced as part of | ||||||
// `trainingRuntimeRef` API in TrainJob. This resource is a cluster-scoped and can be referenced | ||||||
// by TrainJob that created in *any* namespace. | ||||||
type ClusterTrainingRuntime struct { | ||||||
metav1.TypeMeta `json:",inline"` | ||||||
|
||||||
// Standard object's metadata. | ||||||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||||||
|
||||||
// Specification of the desired ClusterTrainingRuntime. | ||||||
Spec TrainingRuntimeSpec `json:"spec,omitempty"` | ||||||
} | ||||||
|
||||||
// ClusterTrainingRuntimeList is a collection of cluster training runtimes. | ||||||
type ClusterTrainingRuntimeList struct { | ||||||
metav1.TypeMeta `json:",inline"` | ||||||
|
||||||
// Standard list metadata. | ||||||
metav1.ListMeta `json:"metadata,omitempty"` | ||||||
|
||||||
// List of ClusterTrainingRuntimes. | ||||||
Items []ClusterTrainingRuntime `json:"items"` | ||||||
} | ||||||
|
||||||
// TrainingRuntime represents a training runtime which can be referenced as part of | ||||||
// `trainingRuntimeRef` API in TrainJob. This resource is a namespaced-scoped and can be referenced | ||||||
// by TrainJob that created in the *same* namespace as the TrainingRuntime. | ||||||
type TrainingRuntime struct { | ||||||
metav1.TypeMeta `json:",inline"` | ||||||
|
||||||
// Standard object's metadata. | ||||||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||||||
|
||||||
// Specification of the desired TrainingRuntime. | ||||||
Spec TrainingRuntimeSpec `json:"spec"` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Similar to other specs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice Catch! |
||||||
} | ||||||
|
||||||
// TrainingRuntimeList is a collection of training runtimes. | ||||||
type TrainingRuntimeList struct { | ||||||
metav1.TypeMeta `json:",inline"` | ||||||
|
||||||
// Standard list metadata. | ||||||
metav1.ListMeta `json:"metadata,omitempty"` | ||||||
|
||||||
// List of TrainingRuntimes. | ||||||
Items []TrainingRuntime `json:"items"` | ||||||
} | ||||||
|
||||||
// TrainingRuntimeSpec represents a specification of the desired training runtime. | ||||||
type TrainingRuntimeSpec struct { | ||||||
// Configuration for the runtime-specific parameters, such as Torch or MPI. | ||||||
MLSpec *MLSpec `json:"mlSpec,omitempty"` | ||||||
|
||||||
// Number of training nodes. | ||||||
// Defaults to 1. | ||||||
NumNodes *int32 `json:"numNodes,omitempty"` | ||||||
|
||||||
// JobSet configuration which will be used by TrainJob. | ||||||
JobSetSpec *jobsetv1alpha2.JobSetSpec `json:",inline"` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why inline here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, we discussed it before, here is the example: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#pytorch-distributed-runtime. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm asking because I don't know what that argument actually does. I usually only see if for really small objects. Never a Spec so not sure if that means literally we are putting the object "inline" or it skips protobuf or api generation? like I've seen it for type.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want to give user functionality to set the whole JobSet spec under Training Runtimes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. type TrainingRuntimeSpec struct {
MLPolicy *MLPolicy `json:"mlPolicy,omitempty"`
JobSetSpec jonsetv2alpha2.JobSetSpec `json:"spec"`
}
type MLPolicy struct {
// Number of training nodes.
// Defaults to 1.
NumNodes *int32 `json:"numNodes,omitempty"`
MLPolicySource `json:",inline"`
}
type MLPolicySource struct {
PyTorch ...
} Maybe, we want to dedicated field for the JobSetSpec so that we can identify the JobSetSpec. |
||||||
|
||||||
// Configuration for the PodGroup to enable gang-scheduling via supported plugins. | ||||||
PodGroupSpec *PodGroupSpec `json:"podGroupSpec,omitempty"` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be inlined? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
} | ||||||
|
||||||
// PodGroupSpec represents a PodGroup configuration to enable gang-scheduling. | ||||||
type PodGroupSpec struct { | ||||||
// Plugin for the gang-scheduling. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we going forward with a default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By default, the gang-scheduling is disabled for TrainJob, since it requires plugin to be installed (coscheduling or volcano). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. |
||||||
Plugin GangSchedulerPlugin `json:"plugin"` | ||||||
|
||||||
// Time threshold to schedule PodGroup for gang-scheduling. | ||||||
ScheduleTimeoutSeconds *string `json:"scheduleTimeoutSeconds,omitempty"` | ||||||
} | ||||||
|
||||||
// GangSchedulerPlugin represents one of the supported gang-scheduling plugins. | ||||||
type GangSchedulerPlugin string | ||||||
|
||||||
const ( | ||||||
// Volcano plugin for gang-scheduling. | ||||||
GangSchedulerPluginVolcano GangSchedulerPlugin = "volcano" | ||||||
|
||||||
// Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling. | ||||||
GangSchedulerPluginCoscheduling GangSchedulerPlugin = "coscheduling" | ||||||
) | ||||||
|
||||||
// MLSpec represents the runtime-specific configuration for various technologies. | ||||||
// One of the following specs can be set. | ||||||
type MLSpec struct { | ||||||
// Configuration for the PyTorch runtime. | ||||||
TorchSpec *TorchSpec `json:"torchSpec,omitempty"` | ||||||
|
||||||
// Configuration for the MPI Runtime. | ||||||
MPISpec *MPISpec `json:"mpiSpec,omitempty"` | ||||||
} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One concern I have is that frameworks configurations/spec change quite often. Have we considered using configmap for this so that we don't have a lot of responsibilities to maintain the compatibility, etc.? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our goal is not to add all frameworks configurations here, but only parameters that require additional orchestration such as Elastic Policy, MPISpec. For example, in the future we can add SlurmSpec or FluxSpec here as we discussed with @vsoch here: #2171 (comment). For @terrytangyuan How do you think we can use ConfigMap for those parameters ? |
||||||
|
||||||
// TorchSpec represents a PyTorch runtime configuration. | ||||||
type TorchSpec struct { | ||||||
// Number of processes per node. | ||||||
// This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI. | ||||||
// Supported values: `auto`, `cpu`, `gpu`, or int value. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could probably use KubeBuilder validations for the enums here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we discussed offline, we will add validations in the separate PRs. |
||||||
// Defaults to `auto`. | ||||||
NumProcPerNode *string `json:"numProcPerNode,omitempty"` | ||||||
|
||||||
// Whether to run single-node multi-worker training. | ||||||
// This value is inserted into the `--standalone` argument of the `torchrun` CLI. | ||||||
// Defaults to false. | ||||||
Standalone *bool `json:"standalone,omitempty"` | ||||||
|
||||||
// Elastic policy for the PyTorch training. | ||||||
ElasticPolicy *TorchElasticPolicy `json:"elasticPolicy,omitempty"` | ||||||
} | ||||||
|
||||||
// TorchElasticPolicy represents a configuration for the PyTorch elastic training. | ||||||
// If this policy is set, the `.spec.numNodes` parameter must be omitted, since min and max node | ||||||
// is used to configure the `torchrun` CLI argument: `--nnodes=minNodes:maxNodes`. | ||||||
// Only `c10d` backend is supported for the Rendezvous communication. | ||||||
type TorchElasticPolicy struct { | ||||||
// How many times the training job can be restarted. | ||||||
// This value is inserted into the `--max-restarts` argument of the `torchrun` CLI and | ||||||
// the `.spec.failurePolicy.maxRestarts` parameter of the training Job. | ||||||
MaxRestarts *int32 `json:"maxRestarts,omitempty"` | ||||||
|
||||||
// Lower limit for the number of nodes to which training job can scale down. | ||||||
MinNodes *int32 `json:"minNodes,omitempty"` | ||||||
|
||||||
// Upper limit for the number of nodes to which training job can scale up. | ||||||
MaxNodes *int32 `json:"maxNodes,omitempty"` | ||||||
|
||||||
// Specification which are used to calculate the desired number of nodes. See the individual | ||||||
// metric source types for more information about how each type of metric must respond. | ||||||
// The HPA will be created to perform auto-scaling. | ||||||
Metrics []autoscalingv2.MetricSpec `json:"metrics,omitempty"` | ||||||
} | ||||||
|
||||||
// MPISpec represents a MPI runtime configuration. | ||||||
type MPISpec struct { | ||||||
// Number of processes per node. | ||||||
// This value is equal to the number of slots for each node in the hostfile. | ||||||
NumProcPerNode *int32 `json:"numProcPerNode,omitempty"` | ||||||
|
||||||
// Implementation name for the MPI to create the appropriate hostfile. | ||||||
MPIImplementation *MPIImplementation `json:"mpiImplementation"` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since MPIImplementation is a const, does this need to be a pointer? With a pointer, a Also with Maybe you meant to include the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think we should remove pointer from here. @tenzen-y What do you think ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. // Implementation name for the MPI to create the appropriate hostfile.
// Defaults to OpenMPI
MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"` In most cases of optional fields, we should use the |
||||||
|
||||||
// Directory where SSH keys are mounted. | ||||||
SSHAuthMountPath *string `json:"SSHAuthMountPath,omitempty"` | ||||||
|
||||||
// Whether to run training process on the launcher Job. | ||||||
// Defaults to false. | ||||||
RunLauncherAsNode *bool `json:"runLauncherAsNode,omitempty"` | ||||||
} | ||||||
|
||||||
// MPIImplementation represents one of the supported MPI implementations. | ||||||
type MPIImplementation string | ||||||
|
||||||
const ( | ||||||
MPIImplementationOpenMPI MPIImplementation = "OpenMPI" | ||||||
MPIImplementationIntel MPIImplementation = "Intel" | ||||||
MPIImplementationMPICH MPIImplementation = "MPICH" | ||||||
) | ||||||
|
||||||
// TODO: Enable this after controller implementation. | ||||||
// func init() { | ||||||
// SchemeBuilder.Register(&ClusterTrainingRuntime{}, &ClusterTrainingRuntimeList{}, &TrainingRuntime{}, &TrainingRuntimeList{}) | ||||||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the spec ever be empty for ClusterTrainingRUntime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but I noticed that for all Kubernetes APIs the
spec
is set withomitempty
: https://github.com/kubernetes/api/blob/master/apps/v1/types.go#L820.@tenzen-y @kannon92 Any specific reason why we do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, in any case, the
spec
field is defined as an optional field in the Kubernetes. So, the optional TrainingRuntime spec would be better.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL.