diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index c28a9b029c..0327c913ab 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,7 +1,7 @@ diff --git a/docs/adopters.md b/ADOPTERS.md similarity index 100% rename from docs/adopters.md rename to ADOPTERS.md diff --git a/docs/development/developer_guide.md b/CONTRIBUTING.md similarity index 100% rename from docs/development/developer_guide.md rename to CONTRIBUTING.md diff --git a/README.md b/README.md index 4778e070e1..f2a7566eb1 100644 --- a/README.md +++ b/README.md @@ -8,93 +8,78 @@ Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks -such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others. +such as PyTorch, TensorFlow, HuggingFace, Jax, DeepSpeed, XGBoost, PaddlePaddle and others. -Training Operator allows you to use Kubernetes workloads to effectively train your large models -via [Kubernetes Custom Resources APIs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) -or using Training Operator Python SDK. - -> Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes. +You can run high-performance computing (HPC) tasks with the Training Operator and `MPIJob` since it +supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC. +The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version, +please follow [this guide](https://www.kubeflow.org/docs/components/training/user-guides/mpi/) to +install MPI Operator V2. -- For a complete reference of the custom resource definitions, please refer to the API Definition. - - [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go) - - [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go) - - [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go) - - [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go) - - [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go) -- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6) -- For details on its observability, please refer to the [monitoring design doc](docs/monitoring/README.md). +The Training Operator allows you to use Kubernetes workloads to effectively train your large models +via [Kubernetes Custom Resources APIs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) +or using the Training Operator Python SDK. ## Prerequisites -- Version >= 1.25 of Kubernetes cluster and `kubectl` +Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/training/installation/#prerequisites) +for prerequisites to install the Training Operator. ## Installation -### Master Branch +Please follow [the Kubeflow Training Operator guide](https://www.kubeflow.org/docs/components/training/installation/#installing-the-training-operator) +for the detailed instructions on how to install Training Operator. -```bash -kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone" -``` +### Installing the Control Plane -### Stable Release +Run the following command to install the latest stable release of the Training Operator control plane: `v1.8.0`. ```bash -kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0" +kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0" ``` -### TensorFlow Release Only - -For users who prefer to use original TensorFlow controllers, please checkout `v1.2-branch`, patches for bug fixes will still be accepted to this branch. +Run the following command to install the latest changes of the Training Operator control plane: ```bash -kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0" +kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone" ``` -### Python SDK for Kubeflow Training Operator +### Installing the Python SDK -Training Operator provides Python SDK for the custom resources. To learn more about available -SDK APIs check [the `TrainingClient`](sdk/python/kubeflow/training/api/training_client.py). +The Training Operator [implements a Python SDK](https://pypi.org/project/kubeflow-training/) +to simplify creation of distributed training and fine-tuning jobs for Data Scientists. -Use `pip install` command to install the latest release of the SDK: +Run the following command to install the latest stable release of the Training SDK: ``` -pip install kubeflow-training +pip install -U kubeflow-training ``` -Training Operator controller and Python SDK have the same release versions. - -## Quickstart +## Getting Started -Please refer to the [getting started guide](https://www.kubeflow.org/docs/components/training/overview/#getting-started) -to quickly create your first Training Operator Job using Python SDK. +Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob) +to quickly create your first distributed training job using Python SDK. If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow [the PyTorchJob MNIST guide](https://www.kubeflow.org/docs/components/training/pytorch/#creating-a-pytorch-training-job). -## API Documentation - -Please refer to following API Documentation: - -- [Kubeflow.org v1 API Documentation](docs/api/kubeflow.org_v1_generated.asciidoc) - ## Community -The following links provide information about getting involved in the community: +The following links provide information on how to get involved in the community: -- Attend [the AutoML and Training Working Group](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit) community meeting. +- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) community meeting. - Join our [`#kubeflow-training` Slack channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack). -- Check out [who is using the Training Operator](./docs/adopters.md). +- Check out [who is using the Training Operator](ADOPTERS.md). This is a part of Kubeflow, so please see [readme in kubeflow/kubeflow](https://github.com/kubeflow/kubeflow#get-involved) to get in touch with the community. ## Contributing -Please refer to the [DEVELOPMENT](docs/development/developer_guide.md) +Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md). ## Change Log -Please refer to [CHANGELOG](CHANGELOG.md) +Please refer to [CHANGELOG](CHANGELOG.md). ## Version Matrix @@ -102,21 +87,39 @@ The following table lists the most recent few versions of the operator. | Operator Version | API Version | Kubernetes Version | | ---------------------- | ----------- | ------------------ | -| `v1.0.x` | `v1` | 1.16+ | -| `v1.1.x` | `v1` | 1.16+ | -| `v1.2.x` | `v1` | 1.16+ | -| `v1.3.x` | `v1` | 1.18+ | | `v1.4.x` | `v1` | 1.23+ | | `v1.5.x` | `v1` | 1.23+ | | `v1.6.x` | `v1` | 1.23+ | | `v1.7.x` | `v1` | 1.25+ | -| `latest` (master HEAD) | `v1` | 1.25+ | +| `v1.8.x` | `v1` | 1.27+ | +| `latest` (master HEAD) | `v1` | 1.27+ | -## Acknowledgement +## Reference -This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators. +For a complete reference of the custom resource definitions, please refer to the API Definition. + +- [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go) +- [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go) +- [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go) +- [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go) +- [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go) + +For details on the Training Operator custom resources APIs, refer to +[the following API documentation](docs/api/kubeflow.org_v1_generated.asciidoc) + +## Acknowledgement -- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS). -- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS). -- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS). -- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS). +This project was originally started as a distributed training operator for TensorFlow and later we +merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience +for both users and developers. We are very grateful to all who filed issues or helped resolve them, +asked and answered questions, and were part of inspiring discussions. +We'd also like to thank everyone who's contributed to and maintained the original operators. + +- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors) + and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS). +- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors) + and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS). +- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors) + and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS). +- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and + [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS). diff --git a/docs/roadmap.md b/ROADMAP.md similarity index 100% rename from docs/roadmap.md rename to ROADMAP.md diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000000..383701aec3 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,5 @@ +# Training Operator Documentation + +Welcome to Kubeflow Training Operator! + +The Training Operator documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/training/). diff --git a/docs/design/tf_job_design_doc.md b/docs/design/tf_job_design_doc.md deleted file mode 100644 index 93cc8f494f..0000000000 --- a/docs/design/tf_job_design_doc.md +++ /dev/null @@ -1,117 +0,0 @@ -# Design Doc TFJob K8s CRD - - - -# Objective - -The goal is to make it easy to run TensorFlow training (and distributed training in particular) on Kubernetes (K8s). I propose doing this by creating a K8s custom resource descriptor (CRD) and associated controller. The CRD takes care of managing the K8s resources needed to run a training job. - -# Background - -Kubernetes makes it easy to manage processes by providing a process (as opposed to VM centric) view of the world. Kubernetes also provides essential building blocks for complex distributed applications. For example, K8s provides built in support for DNS, health checking, logs collections, metrics collection, storage, etc.... - -In K8s, [Controllers](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) are responsible for ensuring a set of [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/) are running. A Pod is the basic building block in K8s and describes one or more processes that should be colocated (same ip). K8s comes with a number of built in controllers. For example, a [ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) can ensure N Pods are running with a particular specification. A [Job controller](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) can be used to run a binary to completion. - -The built in [Controllers](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) are insufficient for running a distributed TensorFlow job. TensorFlow is a stateful application; each parameter server and worker needs to be uniquely addressable to support all the different patterns of distributed training. K8s has a [stateful sets controller](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/). However, stateful sets are intended for stateful services that run forever (e.g. a sharded in memory cache service like Redis) as opposed to jobs intended to run to completion. - -Consequently, running a distributed TF job on K8s today means cobbling together a solution out of the built in primitives. Typically, this means managing multiple resources manually. For example, a user could create 1 stateful set for parameter servers, 1 stateful set for the workers, and 1 job for the master. - -To address the limitations of the built in resources, K8s supports [Custom Resources (CRD) and Controllers.](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) Using a CRD, it is easy to create a controller with the desired semantics for a particular workload while hiding users from the implementation. The K8s community has quickly adopted this pattern contributing [numerous CRDs](https://github.com/coreos/awesome-kubernetes-extensions) for various workloads. - - -# Requirements and Scale - -I think O(100) jobs is a reasonable upper bound for the number of TF training jobs the average K8s customer will be running simultaneously in a single cluster. - -The input from the K8s team that developed CRDs and various controllers is that most controllers use a non-distributed, multi-threaded design and that scaling is not a problem. - - -# Design - - -## TFJob Resource - -The TFJob CRD defines a TFJob resource for K8s. -The [TFJob](https://github.com/kubeflow/training-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L29) -resource is a collection of TfReplicas. Each TfReplica corresponds to a -set of TensorFlow processes performing a role in the job; -e.g. master, parameter server or worker. The set of replica types can be expanded (it is just an enum) to support new TF patterns such as eval workers. Figure 1. shows an example yaml spec for a distributed job. - - -``` -apiVersion: "kubeflow.org/v1alpha1" -kind: "TFJob" -metadata: - name: "example-job" -spec: - replicaSpecs: - - replicas: 1 - tfReplicaType: MASTER - template: - spec: - containers: - - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff - name: tensorflow - args: - - --log_dir=gs://my-job/log-dir - restartPolicy: OnFailure - - replicas: 2 - tfReplicaType: WORKER - template: - spec: - containers: - - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff - name: tensorflow - args: - - --log_dir=gs://my-job/log-dir - restartPolicy: OnFailure - - replicas: 1 - tfReplicaType: PS -``` -**Fig 1.** An example job spec for a distributed Training job with 1 master, 2 workers and 1 PS. - -As illustrated by Fig 1, I made an explicit decision not to try to hide or replace K8s abstractions. For example, each TfReplica contains a standard K8s [PodTemplate](https://kubernetes.io/docs/api-reference/v1.7/#podtemplate-v1-core) to specify the processes (including TF) to run in each replica. I did this because K8s already provides a widely adopted and understood API. So introducing new concepts in place of K8s concepts is just confusing. Furthermore, exposing the [PodTemplate](https://kubernetes.io/docs/api-reference/v1.7/#podtemplate-v1-core) makes it easy for TFJob users to leverage K8s features. For example, TFJob users can use K8s to attach volumes to their TF processes. This makes it easy to use TF in conjunction with any storage system supported by K8s (e.g. PDs, NFS, etc...) - -**Defaults** - -The controller can be used to configure defaults for TFJob to create a simpler user experience. The most common use for this right now is supporting GPUs. To use GPUs, the NVIDIA drivers and libraries need to be mounted from the host into the container. This step should become unnecessary with Kubernetes 1.8. The TFJob controller will automatically add these volume mounts based on configuration specified when the controller is started. This prevents users from having to specify them for each job. Instead, only the cluster administrator who deploys the TFJob controller needs to know how the volumes should be configured. - -Another use case is minimizing the boilerplate users have to write to run standard processes (e.g. [Parameter Servers](https://github.com/kubeflow/training-operator/pull/36#discussion_r141135711)) using official TF Docker images. - - -## Controller - -The controller manages a distributed TFJob by creating a series of Job controllers Fig 2. The TFJob controller sets the environment variable TF_CONFIG to make the TensorFlow cluster spec and replica type (PS, WORKER, MASTER) and replica index available to TensorFlow code. The Job controller takes care of restarting TensorFlow processes that terminate due to an error. Additional logic in the TFJob controller looks at exit codes and fails the job if a TF process exits with an exit code indicating a permanent error. The TFJob controller treats exit codes of 1-127 as permanent errors; this is an arbitrary convention. - -When the master exits successfully or with a permanent error the job is considered finished. There is an open issue([issues/61](https://github.com/kubeflow/training-operator/issues/61)) to make the changes necessary to support evaluation with the Estimator API in 1.4. The pods aren't deleted until the TFJob is deleted. This allows the logs to be fetched via kubectl logs. - -![Resources for TFJob](./../diagrams/tfjob_k8s_resources.svg) - - -## Non-distributed training - -A TFJob can handle non-distributed training; the TFJob spec would consist of a single replica of type master. - - -## in-graph replication - -The current design can handle in-graph replication. In-graph vs between-graph replication is determined by the code the user runs in the workers and master. - - -## Testing - -TFJob is using [Prow](https://github.com/kubernetes/test-infra), K8s test infrastructure, to run E2E tests continuously; e.g. presubmits, postsubmits etc... The K8s test-infra team has allowed us to use the Prow instance they maintain so we don't need to support our own instance. - -One advantage of Prow over Jenkins is that its API is Kubernetes centric meaning it uses concepts (e.g. Pods, Secrets, etc...) that are very familiar to K8s developers. So Prow is much more intuitive to TFJob developers than Jenkins. - - -# Alternatives Considered - - -## Helm and/or Ksonnet - -Rather than use a CRD, we could use a tool like Helm or Ksonnet to create templates to simplify creating the different K8s resources needed to manage a TensorFlow job. This is in line with the current approach in [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem/tree/master/kubernetes). - -One disadvantage of templates is that they do not provide a mechanism to add custom control logic. None of the K8s builtin controllers provide a mechanism for distinguishing between retryable and permanent errors. Furthermore, the built in controllers don't propagate errors; if worker i fails with a permanent error this error won't cause the parameter servers and master controllers to be terminated. - -Another major disadvantage is that the templating approach forces users to manually manage multiple K8s resources. diff --git a/docs/diagrams/tfjob_k8s_resources.svg b/docs/diagrams/tfjob_k8s_resources.svg deleted file mode 100644 index 17dbe8a8cc..0000000000 --- a/docs/diagrams/tfjob_k8s_resources.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - diff --git a/docs/testing/e2e_debugging.md b/docs/testing/e2e_debugging.md deleted file mode 100644 index 8169591daf..0000000000 --- a/docs/testing/e2e_debugging.md +++ /dev/null @@ -1,122 +0,0 @@ -# How to debug an E2E test for Kubeflow Training Operator - -TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here: -[`sdk/python/test/e2e`](../../sdk/python/test/e2e) - -[E2E Testing](./e2e_testing.md) gives an overview of writing e2e tests. This guidance concentrates more on the e2e failure debugging. - -## Prerequsite - -1. Install python 3.7 - -2. Clone `kubeflow/testing` repo under `$GOPATH/src/kubeflow/` - -3. Install [ksonnet](https://ksonnet.io/) - -``` -wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz -tar -xvzf ks_0.13.1_linux_amd64.tar.gz -sudo cp ks_0.13.1_linux_amd64/ks /usr/local/bin/ks-13 -``` - -> We would like to deprecate `ksonnet` but may takes some time. Feel free to pick up [the issue](https://github.com/kubeflow/training-operator/issues/1468) if you are interested in it. -> If your platform is darwin or windows, feel free to download binaries in [ksonnet v0.13.1](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) - -4. Deploy HEAD training operator version in your environment - -``` -IMG=kubeflow/training-operator:e2e-debug-prid make docker-build - -# Optional - load image into kind cluster if you are using kind -kind load docker-image kubeflow/training-operator:e2e-debug-1462 - -kubectl set image deployment.v1.apps/training-operator training-operator=kubeflow/training-operator:e2e-debug-1462 -``` - -## Run E2E Tests locally - -1. Set environments - -``` -export KUBEFLOW_PATH=$GOPATH/src/github.com/kubeflow -export KUBEFLOW_TRAINING_REPO=$KUBEFLOW_PATH/training-operator -export KUBEFLOW_TESTING_REPO=$KUBEFLOW_PATH/testing -export PYTHONPATH=$KUBEFLOW_TRAINING_REPO:$KUBEFLOW_TRAINING_REPO/py:$KUBEFLOW_TESTING_REPO/py:$KUBEFLOW_TRAINING_REPO/sdk/python -``` - -2. Install python dependencies - -``` -pip3 install -r $KUBEFLOW_TESTING_REPO/py/kubeflow/testing/requirements.txt -``` - -> Note: if you have meet problem install requirement, you may need to `sudo apt-get install libffi-dev`. Feel free to share error logs if you don't know how to handle it. - -3. Run Tests - -``` -# enter the ksonnet app to run tests -cd $KUBEFLOW_TRAINING_REPO/test/workflows - -# run individual test that failed in the presubmit job. -python3 -m kubeflow.tf_operator.pod_names_validation_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=pod-names-validation-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts -python3 -m kubeflow.tf_operator.cleanpod_policy_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=cleanpod-policy-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts -python3 -m kubeflow.tf_operator.simple_tfjob_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=simple-tfjob-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=2 --artifacts_path=/tmp/output/artifact -``` - -## Check results - -You can either check logs or check results in `/tmp/output/artifact`. - -``` -$ ls -al /tmp/output/artifact -junit_test_simple_tfjob_cpu.xml - -$ cat /tmp/output/artifact/junit_test_simple_tfjob_cpu.xml - -``` - -## Common issues - -1. ksonnet is not installed - -``` -ERROR|2021-11-16T03:06:06|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception [Errno 2] No such file or directory: 'ks-13': 'ks-13' -Traceback (most recent call last): - File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test - test_func() - File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 53, in test_pod_names - self.params) - File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/util.py", line 579, in setup_ks_app - cwd=app_dir) - File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/util.py", line 59, in run - command, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - File "/usr/local/lib/python3.7/subprocess.py", line 775, in __init__ - restore_signals, start_new_session) - File "/usr/local/lib/python3.7/subprocess.py", line 1522, in _execute_child - raise child_exception_type(errno_num, err_msg, err_filename) -FileNotFoundError: [Errno 2] No such file or directory: 'ks-13': 'ks-13' -``` - -Please check `Prerequsite` section to install ksonnet. - -2. TypeError: load() missing 1 required positional argument: 'Loader' - -``` -ERROR|2021-11-16T03:04:12|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception load() missing 1 required positional argument: 'Loader' -Traceback (most recent call last): - File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test - test_func() - File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 51, in test_pod_names - ks_cmd = ks_util.get_ksonnet_cmd(self.app_dir) - File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/ks_util.py", line 47, in get_ksonnet_cmd - results = yaml.load(app_yaml) -TypeError: load() missing 1 required positional argument: 'Loader' -``` - -This is the pyyaml compatibility issue. Please check if you are using pyyaml==6.0.0. If so, downgrade to `5.4.1` instead. - -``` -pip3 uninstall pyyaml -pip3 install pyyaml==5.4.1 --user -``` diff --git a/docs/testing/e2e_testing.md b/docs/testing/e2e_testing.md deleted file mode 100644 index 80c33ac488..0000000000 --- a/docs/testing/e2e_testing.md +++ /dev/null @@ -1,91 +0,0 @@ -# How to Write an E2E Test for Kubeflow Training Operator - -TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here: -[`sdk/python/test/e2e`](../../sdk/python/test/e2e) - -The E2E tests for Kubeflow Training operator are implemented as Argo workflows. For more background and details -about Argo (not required for understanding the rest of this document), please take a look at -[this link](https://github.com/kubeflow/testing/blob/master/README.md). - -Test results can be monitored at the [Prow dashboard](http://prow.kubeflow-testing.com/?repo=kubeflow%2Ftraining-operator). - -At a high level, the E2E test suites are structured as Python test classes. Each test class contains -one or more tests. A test typically runs the following: - -- Create a ksonnet component using a TFJob spec; -- Creates the specified TFJob; -- Verifies some expected results (e.g. number of pods started, job status); -- Deletes the TFJob. - -## Adding a Test Method - -An example can be found [here](https://github.com/kubeflow/training-operator/blob/master/py/kubeflow/tf_operator/simple_tfjob_tests.py). - -A test class can have several test methods. Each method executes a series of user actions (e.g. -starting or deleting a TFJob), and performs verifications of expected results (e.g. TFJob exits with -correct status, pods are deleted, etc). - -Test classes should follow this pattern: - -```python -class MyTest(test_util.TestCase): - def __init__(self, args): - # Initialize environment - - def test_case_1(self): - # Test code - - def test_case_2(self): - # Test code - -if __name__ == "__main__" - test_runner.main(module=__name__) -``` - -The code here ideally should only contain API calls. Any common functionalities used by the test code should -be added to one of the helper modules: - -- k8s_util - for K8s operations like querying/deleting a pod -- ks_util - for ksonnet operations -- tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase - -## Adding a TFJob Spec - -This is needed if you want to use your own TFJob spec instead of an existing one. An example can be found -[here](https://github.com/kubeflow/training-operator/tree/master/test/workflows/components/simple_tfjob_v1.jsonnet). -All TFJob specs should be placed in the same directory. - -These are similar to actual TFJob specs. Note that many of these are using the -[training-operator-test-server](https://github.com/kubeflow/training-operator/tree/master/test/test-server) as the test image. -This gives us more control over when each replica exits, and allows us to send specific requests like fetching the -runtime TensorFlow config. - -## Adding a New Test Class - -This is needed if you are creating a new test class. Creating a new test class is recommended if you are implementing -a new feature, and want to group all relevant E2E tests together. - -New test classes should be added as Argo workflow steps to the -[workflows.libsonnet](https://github.com/kubeflow/training-operator/blob/master/test/workflows/components/workflows.libsonnet) file. - -Under the templates section, add the following to the dag: - -``` - { - name: "my-test", - template: "my-test", - dependencies: ["setup-kubeflow"], - }, -``` - -This will configure Argo to run `my-test` after setting up the Kubeflow cluster. - -Next, add the following lines toward the end of the file: - -``` - $.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTestTemplate( - "my-test"), -``` - -This assumes that there is a corresponding Python file named `my_test.py` (note the difference between dashes and -underscores).