This repository has been archived by the owner on May 28, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 93
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Followup to #111 which adds a tutorial for GKE (the prior PR was for Amazon EKS). After all comments are addressed, before merging I will run through the tutorial again manually to make sure it works. - [ ] Final manual test "Fast follow" followups include: - [ ] Load test - [ ] Multiple models - [ ] Link to the tutorial from somewhere to make it discoverable. (Link to it from https://github.com/anyscale/aviary/blob/master/README.md I guess?) - [ ] Add a production guide using [RayService](https://ray-project.github.io/kuberay/guidance/rayservice/) More followups copied from #111: > Autoscaler: The current image is missing some dependencies for KubeRay autoscaling. > Frontend: The frontend cannot be launched directly due to some dependency issues (e.g. gradio, pymongo, boto3...). --------- Signed-off-by: Archit Kulkarni <[email protected]> Co-authored-by: Antoni Baum <[email protected]>
- Loading branch information
1 parent
bdb7b59
commit e1b2612
Showing
5 changed files
with
277 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
# Deploy Aviary on Googke Kubernetes Engine (GKE) using KubeRay | ||
|
||
In this tutorial, we will: | ||
|
||
1. Set up a Kubernetes cluster on GKE. | ||
2. Deploy the KubeRay operator and a Ray cluster on GKE. | ||
3. Run an LLM model with Aviary. | ||
|
||
* Note that this document will be extended to include Ray autoscaling and the deployment of multiple models in the near future. | ||
|
||
## Step 1: Create a Kubernetes cluster on GKE | ||
|
||
Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you will need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). | ||
|
||
```sh | ||
gcloud container clusters create aviary-gpu-cluster \ | ||
--num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \ | ||
--zone=us-west1-b --machine-type e2-standard-8 | ||
``` | ||
|
||
This command creates a Kubernetes cluster named `aviary-gpu-cluster` with 1 node in the `us-west1-b` zone. In this example, we use the `e2-standard-8` machine type, which has 8 vCPUs and 32 GB RAM. The cluster has autoscaling enabled, so the number of nodes can increase or decrease based on the workload. | ||
|
||
You can also create a cluster from the [Google Cloud Console](https://console.cloud.google.com/kubernetes/list). | ||
|
||
## Step 2: Create a GPU node pool | ||
|
||
Run the following command to create a GPU node pool for Ray GPU workers. | ||
(You can also create it from the Google Cloud Console; see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints#create_a_node_pool_with_node_taints) for more details.) | ||
|
||
```sh | ||
gcloud container node-pools create gpu-node-pool \ | ||
--accelerator type=nvidia-l4-vws,count=4 \ | ||
--zone us-west1-b \ | ||
--cluster aviary-gpu-cluster \ | ||
--num-nodes 1 \ | ||
--min-nodes 0 \ | ||
--max-nodes 1 \ | ||
--enable-autoscaling \ | ||
--machine-type g2-standard-48 \ | ||
--node-taints=ray.io/node-type=worker:NoSchedule | ||
``` | ||
|
||
The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. In this example, we use the [NVIDIA L4](https://cloud.google.com/compute/docs/gpus#l4-gpus) GPU. The machine type `g2-standard-48` has 4 GPUs, 48 vCPUs and 192 GB RAM. | ||
|
||
Because this tutorial is for deploying 1 LLM, the maximum size of this GPU node pool is 1. | ||
If you want to deploy multiple LLMs in this cluster, you may need to increase the value of the max size. | ||
|
||
The taint `ray.io/node-type=worker:NoSchedule` prevents CPU-only Pods such as the Kuberay operator, Ray head, and CoreDNS pods from being scheduled on this GPU node pool. This is because GPUs are expensive, so we want to use this node pool for Ray GPU workers only. | ||
|
||
Concretely, any Pod that does not have the following toleration will not be scheduled on this GPU node pool: | ||
|
||
```yaml | ||
tolerations: | ||
- key: ray.io/node-type | ||
operator: Equal | ||
value: worker | ||
effect: NoSchedule | ||
``` | ||
This toleration has already been added to the RayCluster YAML manifest `ray-cluster.aviary-gke.yaml` used in Step 6. | ||
|
||
For more on taints and tolerations, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). | ||
|
||
## Step 3: Configure `kubectl` to connect to the cluster | ||
|
||
Run the following command to download credentials and configure the Kubernetes CLI to use them. | ||
|
||
```sh | ||
gcloud container clusters get-credentials aviary-gpu-cluster --zone us-west1-b | ||
``` | ||
|
||
For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl). | ||
|
||
## Step 4: Install NVIDIA GPU device drivers | ||
|
||
This step is required for GPU support on GKE. See the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) for more details. | ||
|
||
```sh | ||
# Install NVIDIA GPU device driver | ||
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml | ||
# Verify that your nodes have allocatable GPUs | ||
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" | ||
# Example output: | ||
# NAME GPU | ||
# ... 4 | ||
# ... <none> | ||
``` | ||
|
||
## Step 5: Install the KubeRay operator | ||
|
||
```sh | ||
# Install both CRDs and KubeRay operator v0.5.0. | ||
helm repo add kuberay https://ray-project.github.io/kuberay-helm/ | ||
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 | ||
# It should be scheduled on the CPU node. If it is not, something is wrong. | ||
``` | ||
|
||
## Step 6: Create a RayCluster with Aviary | ||
|
||
If you are running this tutorial on the Google Cloud Shell, please copy the file `deploy/kuberay/ray-cluster.aviary-gke.yaml` to the Google Cloud Shell. You may find it useful to use the [Cloud Shell Editor](https://cloud.google.com/shell/docs/editor-overview) to edit the file. | ||
|
||
Now you can create a RayCluster with Aviary. Aviary is included in the image `anyscale/aviary:latest`, which is specified in the RayCluster YAML manifest `ray-cluster.aviary-gke.yaml`. | ||
|
||
```sh | ||
# path: deploy/kuberay | ||
kubectl apply -f ray-cluster.aviary-gke.yaml | ||
``` | ||
|
||
Note the following aspects of the YAML file: | ||
|
||
* The `tolerations` for workers match the taints we specified in Step 2. This ensures that the Ray GPU workers are scheduled on the GPU node pool. | ||
|
||
```yaml | ||
# Please add the following taints to the GPU node. | ||
tolerations: | ||
- key: "ray.io/node-type" | ||
operator: "Equal" | ||
value: "worker" | ||
effect: "NoSchedule" | ||
``` | ||
|
||
* The field `rayStartParams.resources` has been configured to allow Ray to schedule Ray tasks and actors appropriately. The `mosaicml--mpt-7b-chat.yaml` file uses two [custom resources](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources), `accelerator_type_cpu` and `accelerator_type_a10`. See [the Ray documentation](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html) for more details on resources. | ||
|
||
```yaml | ||
# Ray head: The Ray head has a Pod resource limit of 2 CPUs. | ||
rayStartParams: | ||
resources: '"{\"accelerator_type_cpu\": 2}"' | ||
# Ray workers: The Ray worker has a Pod resource limit of 48 CPUs and 4 GPUs. | ||
rayStartParams: | ||
resources: '"{\"accelerator_type_cpu\": 48, \"accelerator_type_a10\": 4}"' | ||
``` | ||
|
||
## Step 7: Deploy a LLM model with Aviary | ||
|
||
```sh | ||
# Step 7.1: Log in to the head Pod | ||
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) | ||
kubectl exec -it $HEAD_POD -- bash | ||
# Step 7.2: Deploy the `mosaicml/mpt-7b-chat` model | ||
aviary run --model ./models/mosaicml--mpt-7b-chat.yaml | ||
|
||
# Step 7.3: Check the Serve application status | ||
serve status | ||
|
||
# [Example output] | ||
# name: default | ||
# app_status: | ||
# status: RUNNING | ||
# message: '' | ||
# deployment_timestamp: 1686006910.9571936 | ||
# deployment_statuses: | ||
# - name: default_mosaicml--mpt-7b-chat | ||
# status: HEALTHY | ||
# message: '' | ||
# - name: default_RouterDeployment | ||
# status: HEALTHY | ||
# message: '' | ||
|
||
# Step 7.4: List all models | ||
export AVIARY_URL="http://localhost:8000" | ||
aviary models | ||
|
||
# [Example output] | ||
# Connecting to Aviary backend at: http://localhost:8000/ | ||
# mosaicml/mpt-7b-chat | ||
|
||
# Step 7.5: Send a query to `mosaicml/mpt-7b-chat`. | ||
aviary query --model mosaicml/mpt-7b-chat --prompt "What are the top 5 most popular programming languages?" | ||
|
||
# [Example output] | ||
# Connecting to Aviary backend at: http://localhost:8000/ | ||
# mosaicml/mpt-7b-chat: | ||
# 1. Python | ||
# 2. Java | ||
# 3. JavaScript | ||
# 4. C++ | ||
# 5. C# | ||
``` | ||
|
||
## Step 8: Clean up resources | ||
|
||
**Warning: GPU nodes are extremely expensive. Please remember to delete the cluster if you no longer need it.** | ||
|
||
```sh | ||
# Step 8.1: Delete the RayCluster | ||
# path: deploy/kuberay | ||
kubectl delete -f ray-cluster.aviary-gke.yaml | ||
|
||
# Step 8.2: Uninstall the KubeRay operator chart | ||
helm uninstall kuberay-operator | ||
|
||
# Step 8.3: Delete the GKE cluster | ||
gcloud container clusters delete aviary-gpu-cluster | ||
``` | ||
|
||
See the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/deleting-a-cluster) for more details on deleting a GKE cluster. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
apiVersion: ray.io/v1alpha1 | ||
kind: RayCluster | ||
metadata: | ||
labels: | ||
controller-tools.k8s.io: "1.0" | ||
name: aviary | ||
spec: | ||
rayVersion: 'nightly' # should match the Ray version in the image of the containers | ||
# Ray head pod template | ||
headGroupSpec: | ||
# The `rayStartParams` are used to configure the `ray start` command. | ||
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. | ||
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. | ||
rayStartParams: | ||
resources: '"{\"accelerator_type_cpu\": 2}"' | ||
dashboard-host: '0.0.0.0' | ||
#pod template | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-head | ||
image: anyscale/aviary:latest | ||
resources: | ||
limits: | ||
cpu: 2 | ||
memory: 8Gi | ||
requests: | ||
cpu: 2 | ||
memory: 8Gi | ||
ports: | ||
- containerPort: 6379 | ||
name: gcs-server | ||
- containerPort: 8265 # Ray dashboard | ||
name: dashboard | ||
- containerPort: 10001 | ||
name: client | ||
workerGroupSpecs: | ||
# the pod replicas in this group typed worker | ||
- replicas: 1 | ||
minReplicas: 0 | ||
maxReplicas: 1 | ||
# logical group name, for this called small-group, also can be functional | ||
groupName: gpu-group | ||
rayStartParams: | ||
resources: '"{\"accelerator_type_cpu\": 48, \"accelerator_type_a10\": 4}"' | ||
# pod template | ||
template: | ||
spec: | ||
containers: | ||
- name: llm | ||
image: anyscale/aviary:latest | ||
lifecycle: | ||
preStop: | ||
exec: | ||
command: ["/bin/sh","-c","ray stop"] | ||
resources: | ||
limits: | ||
cpu: "48" | ||
memory: "192G" | ||
nvidia.com/gpu: 4 | ||
requests: | ||
cpu: "36" | ||
memory: "128G" | ||
nvidia.com/gpu: 4 | ||
# Please ensure the following taint has been applied to the GPU node in the cluster. | ||
tolerations: | ||
- key: "ray.io/node-type" | ||
operator: "Equal" | ||
value: "worker" | ||
effect: "NoSchedule" | ||
nodeSelector: | ||
cloud.google.com/gke-accelerator: nvidia-l4-vws |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters