Update README and out-of-date docs

Signed-off-by: Andrey Velichkevich <[email protected]>
andreyvelich · Sep 9, 2024 · 7e04c49 · 7e04c49
1 parent 6ddeb2b
commit 7e04c49
Show file tree

Hide file tree

Showing 10 changed files with 66 additions and 391 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,7 +1,7 @@
 <!--  Thanks for sending a pull request! Here are some tips for you:
 1. If this is your first time, check our contributor guidelines: https://www.kubeflow.org/docs/about/contributing
 2. To know more about Training Operator, check the developer guide:
-    https://github.com/kubeflow/training-operator/blob/master/docs/development/developer_guide.md
+    https://github.com/kubeflow/training-operator/blob/master/CONTRIBUTING.md
 3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
 -->
 

diff --git a/docs/adopters.md → ADOPTERS.md b/docs/adopters.md → ADOPTERS.md
diff --git a/docs/development/developer_guide.md → CONTRIBUTING.md b/docs/development/developer_guide.md → CONTRIBUTING.md
diff --git a/README.md b/README.md
@@ -8,115 +8,118 @@
 
 Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and
 scalable distributed training of machine learning (ML) models created with various ML frameworks
-such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.
+such as PyTorch, TensorFlow, HuggingFace, Jax, DeepSpeed, XGBoost, PaddlePaddle and others.
 
-Training Operator allows you to use Kubernetes workloads to effectively train your large models
-via [Kubernetes Custom Resources APIs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
-or using Training Operator Python SDK.
-
-> Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.
+You can run high-performance computing (HPC) tasks with the Training Operator and `MPIJob` since it
+supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
+The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
+please follow [this guide](https://www.kubeflow.org/docs/components/training/user-guides/mpi/) to
+install MPI Operator V2.
 
-- For a complete reference of the custom resource definitions, please refer to the API Definition.
-  - [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go)
-  - [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go)
-  - [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go)
-  - [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go)
-  - [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go)
-- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
-- For details on its observability, please refer to the [monitoring design doc](docs/monitoring/README.md).
+The Training Operator allows you to use Kubernetes workloads to effectively train your large models
+via [Kubernetes Custom Resources APIs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+or using the Training Operator Python SDK.
 
 ## Prerequisites
 
-- Version >= 1.25 of Kubernetes cluster and `kubectl`
+Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/training/installation/#prerequisites)
+for prerequisites to install the Training Operator.
 
 ## Installation
 
-### Master Branch
+Please follow [the Kubeflow Training Operator guide](https://www.kubeflow.org/docs/components/training/installation/#installing-the-training-operator)
+for the detailed instructions on how to install Training Operator.
 
-```bash
-kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
-```
+### Installing the Control Plane
 
-### Stable Release
+Run the following command to install the latest stable release of the Training Operator control plane: `v1.8.0`.
 
 ```bash
-kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
+kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0"
 ```
 
-### TensorFlow Release Only
-
-For users who prefer to use original TensorFlow controllers, please checkout `v1.2-branch`, patches for bug fixes will still be accepted to this branch.
+Run the following command to install the latest changes of the Training Operator control plane:
 
 ```bash
-kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"
+kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
 ```
 
-### Python SDK for Kubeflow Training Operator
+### Installing the Python SDK
 
-Training Operator provides Python SDK for the custom resources. To learn more about available
-SDK APIs check [the `TrainingClient`](sdk/python/kubeflow/training/api/training_client.py).
+The Training Operator [implements a Python SDK](https://pypi.org/project/kubeflow-training/)
+to simplify creation of distributed training and fine-tuning jobs for Data Scientists.
 
-Use `pip install` command to install the latest release of the SDK:
+Run the following command to install the latest stable release of the Training SDK:
 
 ```
-pip install kubeflow-training
+pip install -U kubeflow-training
 ```
 
-Training Operator controller and Python SDK have the same release versions.
-
-## Quickstart
+## Getting Started
 
-Please refer to the [getting started guide](https://www.kubeflow.org/docs/components/training/overview/#getting-started)
-to quickly create your first Training Operator Job using Python SDK.
+Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob)
+to quickly create your first distributed training job using Python SDK.
 
 If you want to work directly with Kubernetes Custom Resources provided by Training Operator,
 follow [the PyTorchJob MNIST guide](https://www.kubeflow.org/docs/components/training/pytorch/#creating-a-pytorch-training-job).
 
-## API Documentation
-
-Please refer to following API Documentation:
-
-- [Kubeflow.org v1 API Documentation](docs/api/kubeflow.org_v1_generated.asciidoc)
-
 ## Community
 
-The following links provide information about getting involved in the community:
+The following links provide information on how to get involved in the community:
 
-- Attend [the AutoML and Training Working Group](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit) community meeting.
+- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) community meeting.
 - Join our [`#kubeflow-training` Slack channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack).
-- Check out [who is using the Training Operator](./docs/adopters.md).
+- Check out [who is using the Training Operator](ADOPTERS.md).
 
 This is a part of Kubeflow, so please see [readme in kubeflow/kubeflow](https://github.com/kubeflow/kubeflow#get-involved) to get in touch with the community.
 
 ## Contributing
 
-Please refer to the [DEVELOPMENT](docs/development/developer_guide.md)
+Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md).
 
 ## Change Log
 
-Please refer to [CHANGELOG](CHANGELOG.md)
+Please refer to [CHANGELOG](CHANGELOG.md).
 
 ## Version Matrix
 
 The following table lists the most recent few versions of the operator.
 
 | Operator Version       | API Version | Kubernetes Version |
 | ---------------------- | ----------- | ------------------ |
-| `v1.0.x`               | `v1`        | 1.16+              |
-| `v1.1.x`               | `v1`        | 1.16+              |
-| `v1.2.x`               | `v1`        | 1.16+              |
-| `v1.3.x`               | `v1`        | 1.18+              |
 | `v1.4.x`               | `v1`        | 1.23+              |
 | `v1.5.x`               | `v1`        | 1.23+              |
 | `v1.6.x`               | `v1`        | 1.23+              |
 | `v1.7.x`               | `v1`        | 1.25+              |
-| `latest` (master HEAD) | `v1`        | 1.25+              |
+| `v1.8.x`               | `v1`        | 1.27+              |
+| `latest` (master HEAD) | `v1`        | 1.27+              |
 
-## Acknowledgement
+## Reference
 
-This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.
+For a complete reference of the custom resource definitions, please refer to the API Definition.
+
+- [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go)
+- [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go)
+- [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go)
+- [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go)
+- [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go)
+
+For details on the Training Operator custom resources APIs, refer to
+[the following API documentation](docs/api/kubeflow.org_v1_generated.asciidoc)
+
+## Acknowledgement
 
-- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS).
-- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS).
-- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS).
-- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).
+This project was originally started as a distributed training operator for TensorFlow and later we
+merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience
+for both users and developers. We are very grateful to all who filed issues or helped resolve them,
+asked and answered questions, and were part of inspiring discussions.
+We'd also like to thank everyone who's contributed to and maintained the original operators.
+
+- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors)
+  and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS).
+- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors)
+  and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS).
+- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors)
+  and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS).
+- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and
+  [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).
diff --git a/docs/roadmap.md → ROADMAP.md b/docs/roadmap.md → ROADMAP.md
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,5 @@
+# Training Operator Documentation
+
+Welcome to Kubeflow Training Operator!
+
+The Training Operator documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/training/).
diff --git a/docs/design/tf_job_design_doc.md b/docs/design/tf_job_design_doc.md
diff --git a/docs/diagrams/tfjob_k8s_resources.svg b/docs/diagrams/tfjob_k8s_resources.svg