Skip to content

Commit

Permalink
Merge branch 'master' of github.com:ray-project/ray into jjyao/rettttry
Browse files Browse the repository at this point in the history
  • Loading branch information
jjyao committed Oct 17, 2024
2 parents 4bf5f52 + d5fa9a0 commit 8ea5340
Show file tree
Hide file tree
Showing 106 changed files with 2,957 additions and 1,584 deletions.
2 changes: 2 additions & 0 deletions .vale/styles/config/vocabularies/General/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,5 @@ GKE
namespace
ARM
breakpoint
deduplicate[s]
deduplication
16 changes: 16 additions & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,10 @@ ray_cc_library(
"src/ray/rpc/common.cc",
"src/ray/rpc/grpc_server.cc",
"src/ray/rpc/server_call.cc",
"src/ray/rpc/rpc_chaos.cc",
],
hdrs = glob([
"src/ray/rpc/rpc_chaos.h",
"src/ray/rpc/client_call.h",
"src/ray/rpc/common.h",
"src/ray/rpc/grpc_client.h",
Expand Down Expand Up @@ -514,6 +516,7 @@ ray_cc_library(
"@boost//:bimap",
"@com_github_grpc_grpc//src/proto/grpc/health/v1:health_proto",
"@com_google_absl//absl/container:btree",
"//src/ray/util:thread_checker",
],
)

Expand Down Expand Up @@ -1551,6 +1554,19 @@ ray_cc_test(
],
)

ray_cc_test(
name = "rpc_chaos_test",
size = "small",
srcs = [
"src/ray/rpc/test/rpc_chaos_test.cc",
],
tags = ["team:core"],
deps = [
":grpc_common_lib",
"@com_google_googletest//:gtest_main",
],
)

ray_cc_test(
name = "core_worker_client_pool_test",
size = "small",
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions doc/source/cluster/kubernetes/k8s-ecosystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ k8s-ecosystem/ingress
k8s-ecosystem/prometheus-grafana
k8s-ecosystem/pyspy
k8s-ecosystem/volcano
k8s-ecosystem/yunikorn
k8s-ecosystem/kubeflow
k8s-ecosystem/kueue
k8s-ecosystem/istio
Expand All @@ -18,6 +19,7 @@ k8s-ecosystem/istio
* {ref}`kuberay-prometheus-grafana`
* {ref}`kuberay-pyspy-integration`
* {ref}`kuberay-volcano`
* {ref}`kuberay-yunikorn`
* {ref}`kuberay-kubeflow-integration`
* {ref}`kuberay-kueue`
* {ref}`kuberay-istio`
190 changes: 190 additions & 0 deletions doc/source/cluster/kubernetes/k8s-ecosystem/yunikorn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
(kuberay-yunikorn)=

# KubeRay integration with Apache YuniKorn

[Apache YuniKorn](https://yunikorn.apache.org/) is a light-weight, universal resource scheduler for container orchestrator systems. It performs fine-grained resource sharing for various workloads efficiently on a large scale, multi-tenant, and cloud-native environment. YuniKorn brings a unified, cross-platform, scheduling experience for mixed workloads that consist of stateless batch workloads and stateful services.

KubeRay's Apache YuniKorn integration enables more efficient scheduling of Ray Pods in multi-tenant Kubernetes environments.

:::{note}

This feature requires KubeRay version 1.2.2 or newer, and it's in alpha testing.

:::

## Step 1: Create a Kubernetes cluster with KinD
Run the following command in a terminal:

```shell
kind create cluster
```

## Step 2: Install Apache YuniKorn

You need to successfully install Apache YuniKorn on your Kubernetes cluster before enabling Apache YuniKorn integration with KubeRay.
See [Get Started](https://yunikorn.apache.org/docs/) for Apache YuniKorn installation instructions.

## Step 3: Install the KubeRay operator with Apache YuniKorn support

When installing KubeRay operator using Helm, pass the `--set batchScheduler.name=yunikorn` flag at the command line:

```shell
helm install kuberay-operator kuberay/kuberay-operator --version 1.2.2 --set batchScheduler.name=yunikorn
```

## Step 4: Use Apache YuniKorn for gang scheduling

This example uses gang scheduling with Apache YuniKorn and KubeRay.

First, create a queue with a capacity of 4 CPUs and 6Gi of RAM by editing the ConfigMap:

Run `kubectl edit configmap -n yunikorn yunikorn-defaults`

Helm creates this ConfigMap during the installation of the Apache YuniKorn Helm chart.

Add a `queues.yaml` config under the `data` key. The `ConfigMap` should look like the following:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
# Metadata for the ConfigMap, skip for brevity.
data:
queues.yaml: |
partitions:
- name: default
queues:
- name: root
queues:
- name: test
submitacl: "*"
parent: false
resources:
guaranteed:
memory: 6G
vcore: 4
max:
memory: 6G
vcore: 4
```
Save the changes and exit the editor. This configuration creates a queue named `root.test` with a capacity of 4 CPUs and 6Gi of RAM.

Next, create a RayCluster with a head node with 1 CPU and 2GiB of RAM, and two workers with 1 CPU and 1GiB of RAM each, for a total of 3 CPU and 4GiB of RAM:

```shell
# Path: kuberay/ray-operator/config/samples
# Configure the necessary labels on the RayCluster custom resource for Apache YuniKorn scheduler's gang scheduling:
# - `ray.io/gang-scheduling-enabled`: Set to `true` to enable gang scheduling.
# - `yunikorn.apache.org/app-id`: Set to a unique identifier for the application in Kubernetes, even across different namespaces.
# - `yunikorn.apache.org/queue`: Set to the name of one of the queues in Apache YuniKorn.
wget https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.yunikorn-scheduler.yaml
kubectl apply -f ray-cluster.yunikorn-scheduler.yaml
```

Check the RayCluster that the KubeRay operator created:

```shell
$ kubectl describe raycluster test-yunikorn-0

Name: test-yunikorn-0
Namespace: default
Labels: ray.io/gang-scheduling-enabled=true
yunikorn.apache.org/app-id=test-yunikorn-0
yunikorn.apache.org/queue=root.test
Annotations: <none>
API Version: ray.io/v1
Kind: RayCluster
Metadata:
Creation Timestamp: 2024-09-29T09:52:30Z
Generation: 1
Resource Version: 951
UID: cae1dbc9-5a67-4b43-b0d9-be595f21ab85
# Other fields are skipped for brevity
````

Note the labels on the RayCluster: `ray.io/gang-scheduling-enabled=true`, `yunikorn.apache.org/app-id=test-yunikorn-0`, and `yunikorn.apache.org/queue=root.test`.

:::{note}

You only need the `ray.io/gang-scheduling-enabled` label when you require gang scheduling. If you don't set this label, YuniKorn schedules the Ray cluster without enforcing gang scheduling.
:::
Because the queue has a capacity of 4 CPU and 6GiB of RAM, this resource should schedule successfully without any issues.
```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-yunikorn-0-head-98fmp 1/1 Running 0 67s
test-yunikorn-0-worker-worker-42tgg 1/1 Running 0 67s
test-yunikorn-0-worker-worker-467mn 1/1 Running 0 67s
```
Verify the scheduling by checking the [Apache YuniKorn dashboard](https://yunikorn.apache.org/docs/#access-the-web-ui).
```shell
kubectl port-forward svc/yunikorn-service 9889:9889 -n yunikorn
```
Go to `http://localhost:9889/#/applications` to see the running apps.
![Apache YuniKorn dashboard](../images/yunikorn-dashboard-apps-running.png)
Next, add an additional RayCluster with the same configuration of head and worker nodes, but with a different name:
```shell
# Replace the name with `test-yunikorn-1`.
sed 's/test-yunikorn-0/test-yunikorn-1/' ray-cluster.yunikorn-scheduler.yaml | kubectl apply -f-
```
Now all the Pods for `test-yunikorn-1` are in the `Pending` state:
```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-yunikorn-0-head-98fmp 1/1 Running 0 4m22s
test-yunikorn-0-worker-worker-42tgg 1/1 Running 0 4m22s
test-yunikorn-0-worker-worker-467mn 1/1 Running 0 4m22s
test-yunikorn-1-head-xl2r5 0/1 Pending 0 71s
test-yunikorn-1-worker-worker-l6ttz 0/1 Pending 0 71s
test-yunikorn-1-worker-worker-vjsts 0/1 Pending 0 71s
tg-test-yunikorn-1-headgroup-vgzvoot0dh 0/1 Pending 0 69s
tg-test-yunikorn-1-worker-eyti2bn2jv 1/1 Running 0 69s
tg-test-yunikorn-1-worker-k8it0x6s73 0/1 Pending 0 69s
```
Apache YuniKorn creates the Pods with the `tg-` prefix for gang scheduling purpose.
Go to `http://localhost:9889/#/applications` and to see `test-yunikorn-1` in the `Accepted` state but not running yet:
![Apache YuniKorn dashboard](../images/yunikorn-dashboard-apps-pending.png)
Because the new cluster requires more CPU and RAM than the queue allows, even though one of the Pods would fit in the remaining 1 CPU and 2GiB of RAM, Apache YuniKorn doesn't place the cluster's Pods until there's enough room for all of the Pods. Without using Apache YuniKorn for gang scheduling in this way, KubeRay would place one of the Pods, and only partially allocating the cluster.

Delete the first RayCluster to free up resources in the queue:

```shell
kubectl delete raycluster test-yunikorn-0
```

Now all the Pods for the second cluster change to the `Running` state, because enough resources are now available to schedule the entire set of Pods:

Check the Pods again to see that the second cluster is now up and running:

```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-yunikorn-1-head-xl2r5 1/1 Running 0 3m34s
test-yunikorn-1-worker-worker-l6ttz 1/1 Running 0 3m34s
test-yunikorn-1-worker-worker-vjsts 1/1 Running 0 3m34s
```

Clean up the resources:

```shell
kubectl delete raycluster test-yunikorn-1
```
4 changes: 2 additions & 2 deletions doc/source/ray-contribute/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,8 @@ RLlib, Tune, Autoscaler, and most Python files do not require you to build and c

.. code-block:: shell
# For example, for Python 3.8:
pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
# For example, for Python 3.9:
pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl
4. Replace Python files in the installed package with your local editable copy. We provide a simple script to help you do this: ``python python/ray/setup-dev.py``. Running the script will remove the ``ray/tune``, ``ray/rllib``, ``ray/autoscaler`` dir (among other directories) bundled with the ``ray`` pip package, and replace them with links to your local code. This way, changing files in your git clone will directly affect the behavior of your installed Ray.

Expand Down
14 changes: 6 additions & 8 deletions doc/source/ray-core/handling-dependencies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Concepts
Preparing an environment using the Ray Cluster launcher
-------------------------------------------------------

The first way to set up dependencies is to is to prepare a single environment across the cluster before starting the Ray runtime.
The first way to set up dependencies is to prepare a single environment across the cluster before starting the Ray runtime.

- You can build all your files and dependencies into a container image and specify this in your your :ref:`Cluster YAML Configuration <cluster-config>`.

Expand Down Expand Up @@ -327,9 +327,7 @@ To ensure your local changes show up across all Ray workers and can be imported
# No need to import my_module inside this function.
my_module.test()

ray.get(f.remote())

Note: This feature is currently limited to modules that are packages with a single directory containing an ``__init__.py`` file. For single-file modules, you may use ``working_dir``.
ray.get(test_my_module.remote())

.. _runtime-environments-api-ref:

Expand Down Expand Up @@ -358,13 +356,15 @@ The ``runtime_env`` is a Python dictionary or a Python class :class:`ray.runtime
Note: If the local directory contains symbolic links, Ray follows the links and the files they point to are uploaded to the cluster.

- ``py_modules`` (List[str|module]): Specifies Python modules to be available for import in the Ray workers. (For more ways to specify packages, see also the ``pip`` and ``conda`` fields below.)
Each entry must be either (1) a path to a local directory, (2) a URI to a remote zip or wheel file (see :ref:`remote-uris` for details), (3) a Python module object, or (4) a path to a local `.whl` file.
Each entry must be either (1) a path to a local file or directory, (2) a URI to a remote zip or wheel file (see :ref:`remote-uris` for details), (3) a Python module object, or (4) a path to a local `.whl` file.

- Examples of entries in the list:

- ``"."``

- ``"/local_dependency/my_module"``
- ``"/local_dependency/my_dir_module"``

- ``"/local_dependency/my_file_module.py"``

- ``"s3://bucket/my_module.zip"``

Expand All @@ -380,8 +380,6 @@ The ``runtime_env`` is a Python dictionary or a Python class :class:`ray.runtime

Note: For option (1), if the local directory contains a ``.gitignore`` file, the files and paths specified there are not uploaded to the cluster. You can disable this by setting the environment variable `RAY_RUNTIME_ENV_IGNORE_GITIGNORE=1` on the machine doing the uploading.

Note: This feature is currently limited to modules that are packages with a single directory containing an ``__init__.py`` file. For single-file modules, you may use ``working_dir``.

- ``excludes`` (List[str]): When used with ``working_dir`` or ``py_modules``, specifies a list of files or paths to exclude from being uploaded to the cluster.
This field uses the pattern-matching syntax used by ``.gitignore`` files: see `<https://git-scm.com/docs/gitignore>`_ for details.
Note: In accordance with ``.gitignore`` syntax, if there is a separator (``/``) at the beginning or middle (or both) of the pattern, then the pattern is interpreted relative to the level of the ``working_dir``.
Expand Down
61 changes: 53 additions & 8 deletions doc/source/ray-observability/user-guides/configure-logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ System logs may include information about your applications. For example, ``runt
This is the log file of the agent containing logs of create or delete requests and cache hits and misses.
For the logs of the actual installations (for example, ``pip install`` logs), see the ``runtime_env_setup-[job_id].log`` file (see below).
- ``runtime_env_setup-ray_client_server_[port].log``: Logs from installing {ref}`Runtime Environments <runtime-environments>` for a job when connecting with {ref}`Ray Client <ray-client-ref>`.
- ``runtime_env_setup-[job_id].log``: Logs from installing {ref}`Runtime Environments <runtime-environments>` for a Task, Actor or Job. This file is only present if a Runtime Environment is installed.
- ``runtime_env_setup-[job_id].log``: Logs from installing {ref}`runtime environments <runtime-environments>` for a Task, Actor, or Job. This file is only present if you install a runtime environment.


(log-redirection-to-driver)=
Expand Down Expand Up @@ -136,13 +136,58 @@ The output is as follows:
(task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication)
```

This feature is especially useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when imported. Configure this feature as follows:

1. Set ``RAY_DEDUP_LOGS=0`` to disable this feature entirely.
2. Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=<int>`` to change the agggregation window.
3. Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=<string>`` to specify log messages to never deduplicate.
4. Set ``RAY_DEDUP_LOGS_SKIP_REGEX=<string>`` to specify log messages to skip printing.

This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them.

Configure the following environment variables on the driver process **before importing Ray** to customize log deduplication:

* Set ``RAY_DEDUP_LOGS=0`` to turn off this feature entirely.
* Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=<int>`` to change the aggregation window.
* Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=<string>`` to specify log messages to never deduplicate.
* Example:
```python
import os
os.environ["RAY_DEDUP_LOGS_ALLOW_REGEX"] = "ABC"

import ray

@ray.remote
def f():
print("ABC")
print("DEF")

ray.init()
ray.get([f.remote() for _ in range(5)])

# 2024-10-10 17:54:19,095 INFO worker.py:1614 -- Connecting to existing Ray cluster at address: 172.31.13.10:6379...
# 2024-10-10 17:54:19,102 INFO worker.py:1790 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
# (f pid=1574323) ABC
# (f pid=1574323) DEF
# (f pid=1574321) ABC
# (f pid=1574318) ABC
# (f pid=1574320) ABC
# (f pid=1574322) ABC
# (f pid=1574322) DEF [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
```
* Set ``RAY_DEDUP_LOGS_SKIP_REGEX=<string>`` to specify log messages to skip printing.
* Example:
```python
import os
os.environ["RAY_DEDUP_LOGS_SKIP_REGEX"] = "ABC"

import ray

@ray.remote
def f():
print("ABC")
print("DEF")

ray.init()
ray.get([f.remote() for _ in range(5)])
# 2024-10-10 17:55:05,308 INFO worker.py:1614 -- Connecting to existing Ray cluster at address: 172.31.13.10:6379...
# 2024-10-10 17:55:05,314 INFO worker.py:1790 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
# (f pid=1574317) DEF
# (f pid=1575229) DEF [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
```


## Distributed progress bars (tqdm)
Expand Down
4 changes: 0 additions & 4 deletions doc/source/rllib/rllib-examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -254,12 +254,8 @@ RLModules
- |old_stack| `How to using the "Repeated" space of RLlib for variable lengths observations <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/complex_struct_space.py>`__:
How to use RLlib's `Repeated` space to handle variable length observations.
- |old_stack| `How to write a custom Keras model <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/custom_keras_model.py>`__:
Example of using a custom Keras model.
- |old_stack| `How to register a custom model with supervised loss <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_model_loss_and_metrics.py>`__:
Example of defining and registering a custom model with a supervised loss.
- |old_stack| `How to train with batch normalization <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/batch_norm_model.py>`__:
Example of adding batch norm layers to a custom model.
- |old_stack| `How to write a custom model with its custom API <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_model_api.py>`__:
Shows how to define a custom Model API in RLlib, such that it can be used inside certain algorithms.
- |old_stack| `How to write a "trajectory ciew API" utilizing model <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/trajectory_view_utilizing_models.py>`__:
An example on how a model can use the trajectory view API to specify its own input.
Expand Down
Loading

0 comments on commit 8ea5340

Please sign in to comment.