MLBatch supports PyTorchJobs
, RayJobs
, RayClusters
, as well as
AppWrappers
, which can wrap and bundle together resources such as Pods
,
Jobs
, Deployments
, StatefulSets
, ConfigMaps
, or Secrets
.
This document first explains queues then discusses a few examples workloads, monitoring, borrowing, priorities, and fault-tolerance.
It is not required to clone this repository to use an MLBatch system. However, if you want local copies of the examples to enable you to easily try then, you can recursively clone and enter this repository:
git clone --recursive https://github.com/project-codeflare/mlbatch.git
cd mlbatch
Properly configuring a distributed PyTorchJob
to make effective use of the
MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To
automate this process, we provide a Helm chart that captures best practices and
common configuration options. Using this Helm chart helps eliminate common
mistakes. Please see pytorchjob-generator for
detailed usage instructions.
If you have a Kubernetes YAML file containing one or more non-AppWrapper resources (eg Deployments, Pods, Services, etc), you can use the appwrapper-packager tool to generate an AppWrapper yaml containing those resources.
All workloads must target a local queue in their namespace. The local queue name is specified as a label as follows:
apiVersion: ???
kind: ???
metadata:
name: ???
labels:
kueue.x-k8s.io/queue-name: default-queue # queue name
In MLBatch, the default local queue name is default-queue
.
Workloads submitted as AppWrappers
do not need to explicity specify the local
queue name as it will be automatically added if missing. However, other workload
types (PyTorchJobs
, RayJobs
, RayClusters
) must specify the local queue
name as demonstrated above.
Workloads missing a local queue name will not be admitted. If you forget to
label the workload, you must either delete and resubmit it or use oc edit
to
add the missing label to the metadata section of your workload object.
Submitted workloads are queued and dispatched when enough quota is available, which eventually results in the creation of pods that are submitted to the cluster's scheduler. By default, this scheduler will scheduler pods one at a time and spread pods across nodes to even the load across the cluster. Both behaviors are undesirable for large AI workloads such as pre-training jobs. MLBatch includes and configures Coscheduler to enable gang scheduling and packing. Concretely, Coscheduler as configured will strive to schedule all pods in a job at once using a minimal number of nodes.
PytorchJobs
, RayJobs
, and RayClusters
may be submitted directly to
MLBatch. Please note however that these workloads will not benefit from the
advanced logic provided by AppWrappers
for instance pertaining to
fault-tolerance. Hence, wrapping objects into AppWrappers
is the recommended way of submitting workloads.
To submit an unwrapped PyTorchJob
to MLBatch, simply include the queue name:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: sample-pytorchjob
labels:
kueue.x-k8s.io/queue-name: default-queue # queue name (required)
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
resources:
requests:
cpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-fc858d1
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
resources:
requests:
cpu: 1
Try the above with:
oc apply -n team1 -f samples/pytorchjob.yaml
MLBatch implicitly enables gang scheduling and packing for PyTorchJobs
by
configuring the Kubeflow Training Operator to automatically inject the
necessary scheduling directives into all Pods it creates for PyTorchJobs
.
A Job
, a Pod
, or a Deployment
can be created using an AppWrapper
, for
example:
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
name: sample-job
spec:
components:
- template:
# job specification
apiVersion: batch/v1
kind: Job
metadata:
name: sample-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: busybox
image: quay.io/project-codeflare/busybox:1.36
command: ["sh", "-c", "sleep 30"]
resources:
requests:
cpu: 1
Try the above with:
oc apply -n team1 -f samples/job.yaml
Concretely, the AppWrapper
adds a simple prefix to the Job
specification.
See AppWrappers for more
information and use cases.
MLBatch implicitly enables packing for AppWrappers
. For workloads consisting
of multiple pods, add a PodGroup
to enable gang scheduling, for instance:
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
name: sample-job
spec:
components:
- template:
# pod group specification
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
name: sample-job
spec:
minMember: 2 # replica count
- template:
# job specification
apiVersion: batch/v1
kind: Job
metadata:
name: sample-job
spec:
parallelism: 2 # replica count
completions: 2 # replica count
template:
metadata:
labels:
scheduling.x-k8s.io/pod-group: sample-job # pod group label
spec:
restartPolicy: Never
containers:
- name: busybox
image: quay.io/project-codeflare/busybox:1.36
command: ["sh", "-c", "sleep 5"]
resources:
requests:
cpu: 1
Try the above with:
oc apply -n team1 -f samples/job-with-podgroup.yaml
Check the status of the local queue for the namespace with:
oc -n team1 get localqueue
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/default-queue team1-cluster-queue 0 1
Check the status of the workloads in the namespace with:
oc -n team1 get workloads
NAME QUEUE ADMITTED BY AGE
pytorchjob-sample-pytorchjob-9fc41 default-queue team1-cluster-queue 11m
As usual, replace get
with describe
for more details on the local queue or
workloads. See Kueue for more information.
A workload can borrow unused quotas from other namespaces if not enough quota is
available in the team namespace unless disallowed by the ClusterQueue
of the
team namespace (borrowingLimit
) or target namespace(s) (lendingLimit
).
Borrowed quotas are immediately returned to the target namespace(s) upon request. In other words, the submission of a workload in a target namespace will preempt borrowers if necessary to obtain the quota requested by the new workload.
A workload can specify a priority by means of pod priorities, for instance for a wrapped job:
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
name: sample-job
spec:
components:
- template:
# job specification
apiVersion: batch/v1
kind: Job
metadata:
name: sample-job
spec:
template:
spec:
restartPolicy: Never
priorityClassName: high-priority # workload priority
containers:
- name: busybox
image: quay.io/project-codeflare/busybox:1.36
command: ["sh", "-c", "sleep 5"]
resources:
requests:
cpu: 1
Workloads of equal priority are considered for admission in submission order. Higher-priority workloads are considered for admission before lower-priority workloads irrespective of arrival time. However, workloads that cannot be admitted will not block the admission of newer and possibly lower-priority workloads (if they fit within the quota).
A workload will preempt lower-priority workloads in the same namespace to meet its quota if necessary. It may also preempt newer, equal-priority workloads in the same namespace.
Preemption across namespaces can only be triggered by the reclamation of borrowed quota, which is independent from priorities.
AppWrappers are the mechanism used by the MLBatch system to automate fault detection and retry/recovery of executing workloads. By adding automation, we can achieve higher levels of system utilization by greatly reducing the reliance on constant human monitoring of workload health. AppWrappers should be used to submit all workloads that are intended to run without close human supervision of their progress.
---
title: Overview of AppWrapper Fault Tolerance Phase Transitions
---
stateDiagram-v2
rn : Running
s : Succeeded
f : Failed
rt : Resetting
rs : Resuming
%% Happy Path
rn --> s
%% Requeuing
rn --> f : Retries Exceeded
rn --> rt : Workload Unhealthy
rt --> rs : All Resources Removed
rs --> rn : All Resources Recreated
classDef quota fill:lightblue
class rs quota
class rn quota
class rt quota
classDef failed fill:pink
class f failed
classDef succeeded fill:lightgreen
class s succeeded
Throughout the execution of the workload, the AppWrapper controller monitors the number and health of the workload's Pods. It also watches the top-level created resources and for selected resources types understands how to interpret their status information. This information is combined to determine if a workload is unhealthy. A workload can be deemed unhealthy if any of the following conditions are true:
- There are a non-zero number of
Failed
Pods. - It takes longer than
AdmissionGracePeriod
for the expected number of Pods to reach thePending
state. - It takes longer than the
WarmupGracePeriod
for the expected number of Pods to reach theRunning
state. - If a non-zero number of
Running
Pods are using resources that Autopilot has tagged asNoExecute
. - The status information of a batch/v1 Job or PyTorchJob indicates that it has failed.
- A top-level wrapped resource is externally deleted.
If a workload is determined to be unhealthy by one of the first three
Pod-level conditions above, the AppWrapper controller first waits for
a FailureGracePeriod
to allow the primary resource controller an
opportunity to react and return the workload to a healthy state. The
FailureGracePeriod
is elided for the remaining conditions because the
primary resource controller is not expected to take any further
action. If the FailureGracePeriod
passes and the workload is still
unhealthy, the AppWrapper controller will reset the workload by
deleting its resources, waiting for a RetryPausePeriod
, and then
creating new instances of the resources. During this retry pause, the
AppWrapper does not release the workload's quota; this ensures
that when the resources are recreated they will still have sufficient
quota to execute. The number of times an AppWrapper is reset is
tracked as part of its status; if the number of resets exceeds the
RetryLimit
, then the AppWrapper moves into a Failed
state and its
resources are deleted (thus finally releasing its quota). External deletion
of a top-level wrapped resource will cause the AppWrapper to directly enter
the Failed
state independent of the RetryLimit
.
To support debugging Failed
workloads, an annotation can be added to an
AppWrapper that adds a DeletionOnFailureGracePeriod
between the time the
AppWrapper enters the Failed
state and when the process of deleting its
resources begins. Since the AppWrapper continues to consume quota during this
delayed deletion period, this annotation should be used sparingly and only when
interactive debugging of the failed workload is being actively pursued.
All child resources for an AppWrapper that successfully completed will be
automatically deleted after a SuccessTTLPeriod
after the AppWrapper entered
the Succeeded
state.
The parameters of the retry loop described about are configured at the system level, but can be customized by the user on a per-AppWrapper basis by adding annotations. The table below lists the parameters, gives their default, and the annotation that can be used to customize them. The MLBatch Helm chart also supports customization these values.
Parameter | Default Value | Annotation |
---|---|---|
AdmissionGracePeriod | 1 Minute | workload.codeflare.dev.appwrapper/admissionGracePeriodDuration |
WarmupGracePeriod | 5 Minutes | workload.codeflare.dev.appwrapper/warmupGracePeriodDuration |
FailureGracePeriod | 1 Minute | workload.codeflare.dev.appwrapper/failureGracePeriodDuration |
RetryPausePeriod | 90 Seconds | workload.codeflare.dev.appwrapper/retryPausePeriodDuration |
RetryLimit | 3 | workload.codeflare.dev.appwrapper/retryLimit |
DeletionOnFailureGracePeriod | 0 Seconds | workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration |
SuccessTTL | 7 Days | workload.codeflare.dev.appwrapper/successTTLDuration |
GracePeriodMaximum | 24 Hours | Not Applicable |
The GracePeriodMaximum
imposes a system-wide upper limit on all other grace
periods to limit the potential impact of user-added annotations on overall
system utilization.