Merge branch 'master' of github.com:ray-project/ray into jjyao/rettttry

ray-project · Oct 17, 2024 · 8ea5340 · 8ea5340
2 parents 4bf5f52 + d5fa9a0
commit 8ea5340
Show file tree

Hide file tree

Showing 106 changed files with 2,957 additions and 1,584 deletions.
diff --git a/.vale/styles/config/vocabularies/General/accept.txt b/.vale/styles/config/vocabularies/General/accept.txt
@@ -17,3 +17,5 @@ GKE
 namespace
 ARM
 breakpoint
+deduplicate[s]
+deduplication
diff --git a/BUILD.bazel b/BUILD.bazel
@@ -93,8 +93,10 @@ ray_cc_library(
         "src/ray/rpc/common.cc",
         "src/ray/rpc/grpc_server.cc",
         "src/ray/rpc/server_call.cc",
+        "src/ray/rpc/rpc_chaos.cc",
     ],
     hdrs = glob([
+        "src/ray/rpc/rpc_chaos.h",
         "src/ray/rpc/client_call.h",
         "src/ray/rpc/common.h",
         "src/ray/rpc/grpc_client.h",
@@ -514,6 +516,7 @@ ray_cc_library(
         "@boost//:bimap",
         "@com_github_grpc_grpc//src/proto/grpc/health/v1:health_proto",
         "@com_google_absl//absl/container:btree",
+        "//src/ray/util:thread_checker",
     ],
 )
 
@@ -1551,6 +1554,19 @@ ray_cc_test(
     ],
 )
 
+ray_cc_test(
+    name = "rpc_chaos_test",
+    size = "small",
+    srcs = [
+        "src/ray/rpc/test/rpc_chaos_test.cc",
+    ],
+    tags = ["team:core"],
+    deps = [
+        ":grpc_common_lib",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
 ray_cc_test(
     name = "core_worker_client_pool_test",
     size = "small",

diff --git a/doc/source/cluster/kubernetes/images/yunikorn-dashboard-apps-pending.png b/doc/source/cluster/kubernetes/images/yunikorn-dashboard-apps-pending.png
diff --git a/doc/source/cluster/kubernetes/images/yunikorn-dashboard-apps-running.png b/doc/source/cluster/kubernetes/images/yunikorn-dashboard-apps-running.png
diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem.md b/doc/source/cluster/kubernetes/k8s-ecosystem.md
@@ -9,6 +9,7 @@ k8s-ecosystem/ingress
 k8s-ecosystem/prometheus-grafana
 k8s-ecosystem/pyspy
 k8s-ecosystem/volcano
+k8s-ecosystem/yunikorn
 k8s-ecosystem/kubeflow
 k8s-ecosystem/kueue
 k8s-ecosystem/istio
@@ -18,6 +19,7 @@ k8s-ecosystem/istio
 * {ref}`kuberay-prometheus-grafana`
 * {ref}`kuberay-pyspy-integration`
 * {ref}`kuberay-volcano`
+* {ref}`kuberay-yunikorn`
 * {ref}`kuberay-kubeflow-integration`
 * {ref}`kuberay-kueue`
 * {ref}`kuberay-istio`
diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/yunikorn.md b/doc/source/cluster/kubernetes/k8s-ecosystem/yunikorn.md
@@ -0,0 +1,190 @@
+(kuberay-yunikorn)=
+
+# KubeRay integration with Apache YuniKorn
+
+[Apache YuniKorn](https://yunikorn.apache.org/) is a light-weight, universal resource scheduler for container orchestrator systems. It performs fine-grained resource sharing for various workloads efficiently on a large scale, multi-tenant, and cloud-native environment. YuniKorn brings a unified, cross-platform, scheduling experience for mixed workloads that consist of stateless batch workloads and stateful services.
+
+KubeRay's Apache YuniKorn integration enables more efficient scheduling of Ray Pods in multi-tenant Kubernetes environments.
+
+:::{note}
+
+This feature requires KubeRay version 1.2.2 or newer, and it's in alpha testing.
+
+:::
+
+## Step 1: Create a Kubernetes cluster with KinD
+Run the following command in a terminal:
+
+```shell
+kind create cluster
+```
+
+## Step 2: Install Apache YuniKorn
+
+You need to successfully install Apache YuniKorn on your Kubernetes cluster before enabling Apache YuniKorn integration with KubeRay.
+See [Get Started](https://yunikorn.apache.org/docs/) for Apache YuniKorn installation instructions.
+
+## Step 3: Install the KubeRay operator with Apache YuniKorn support
+
+When installing KubeRay operator using Helm, pass the `--set batchScheduler.name=yunikorn` flag at the command line:
+
+```shell
+helm install kuberay-operator kuberay/kuberay-operator --version 1.2.2 --set batchScheduler.name=yunikorn
+```
+
+## Step 4: Use Apache YuniKorn for gang scheduling
+
+This example uses gang scheduling with Apache YuniKorn and KubeRay.
+
+First, create a queue with a capacity of 4 CPUs and 6Gi of RAM by editing the ConfigMap:
+
+Run `kubectl edit configmap -n yunikorn yunikorn-defaults`
+
+Helm creates this ConfigMap during the installation of the Apache YuniKorn Helm chart.
+
+Add a `queues.yaml` config under the `data` key. The `ConfigMap` should look like the following:
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  # Metadata for the ConfigMap, skip for brevity.
+data:
+  queues.yaml: |
+    partitions:
+      - name: default
+        queues:
+          - name: root
+            queues:
+              - name: test
+                submitacl: "*"
+                parent: false
+                resources:
+                  guaranteed:
+                    memory: 6G
+                    vcore: 4
+                  max:
+                    memory: 6G
+                    vcore: 4
+```
+
+Save the changes and exit the editor. This configuration creates a queue named `root.test` with a capacity of 4 CPUs and 6Gi of RAM.
+
+Next, create a RayCluster with a head node with 1 CPU and 2GiB of RAM, and two workers with 1 CPU and 1GiB of RAM each, for a total of 3 CPU and 4GiB of RAM:
+
+```shell
+# Path: kuberay/ray-operator/config/samples
+# Configure the necessary labels on the RayCluster custom resource for Apache YuniKorn scheduler's gang scheduling:
+# - `ray.io/gang-scheduling-enabled`: Set to `true` to enable gang scheduling.
+# - `yunikorn.apache.org/app-id`: Set to a unique identifier for the application in Kubernetes, even across different namespaces.
+# - `yunikorn.apache.org/queue`: Set to the name of one of the queues in Apache YuniKorn.
+wget https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.yunikorn-scheduler.yaml
+kubectl apply -f ray-cluster.yunikorn-scheduler.yaml
+```
+
+Check the RayCluster that the KubeRay operator created:
+
+```shell
+$ kubectl describe raycluster test-yunikorn-0
+
+Name:         test-yunikorn-0
+Namespace:    default
+Labels:       ray.io/gang-scheduling-enabled=true
+              yunikorn.apache.org/app-id=test-yunikorn-0
+              yunikorn.apache.org/queue=root.test
+Annotations:  <none>
+API Version:  ray.io/v1
+Kind:         RayCluster
+Metadata:
+  Creation Timestamp:  2024-09-29T09:52:30Z
+  Generation:          1
+  Resource Version:    951
+  UID:                 cae1dbc9-5a67-4b43-b0d9-be595f21ab85
+# Other fields are skipped for brevity
+````
+
+Note the labels on the RayCluster: `ray.io/gang-scheduling-enabled=true`, `yunikorn.apache.org/app-id=test-yunikorn-0`, and `yunikorn.apache.org/queue=root.test`.
+
+:::{note}
+
+You only need the `ray.io/gang-scheduling-enabled` label when you require gang scheduling. If you don't set this label, YuniKorn schedules the Ray cluster without enforcing gang scheduling.
+
+:::
+
+Because the queue has a capacity of 4 CPU and 6GiB of RAM, this resource should schedule successfully without any issues.
+
+```shell
+$ kubectl get pods
+
+NAME                                  READY   STATUS    RESTARTS   AGE
+test-yunikorn-0-head-98fmp            1/1     Running   0          67s
+test-yunikorn-0-worker-worker-42tgg   1/1     Running   0          67s
+test-yunikorn-0-worker-worker-467mn   1/1     Running   0          67s
+```
+
+Verify the scheduling by checking the [Apache YuniKorn dashboard](https://yunikorn.apache.org/docs/#access-the-web-ui).
+
+```shell
+kubectl port-forward svc/yunikorn-service 9889:9889 -n yunikorn
+```
+
+Go to `http://localhost:9889/#/applications` to see the running apps.
+
+![Apache YuniKorn dashboard](../images/yunikorn-dashboard-apps-running.png)
+
+Next, add an additional RayCluster with the same configuration of head and worker nodes, but with a different name:
+
+```shell
+# Replace the name with `test-yunikorn-1`.
+sed 's/test-yunikorn-0/test-yunikorn-1/' ray-cluster.yunikorn-scheduler.yaml | kubectl apply -f-
+```
+
+Now all the Pods for `test-yunikorn-1` are in the `Pending` state:
+
+```shell
+$ kubectl get pods
+
+NAME                                      READY   STATUS    RESTARTS   AGE
+test-yunikorn-0-head-98fmp                1/1     Running   0          4m22s
+test-yunikorn-0-worker-worker-42tgg       1/1     Running   0          4m22s
+test-yunikorn-0-worker-worker-467mn       1/1     Running   0          4m22s
+test-yunikorn-1-head-xl2r5                0/1     Pending   0          71s
+test-yunikorn-1-worker-worker-l6ttz       0/1     Pending   0          71s
+test-yunikorn-1-worker-worker-vjsts       0/1     Pending   0          71s
+tg-test-yunikorn-1-headgroup-vgzvoot0dh   0/1     Pending   0          69s
+tg-test-yunikorn-1-worker-eyti2bn2jv      1/1     Running   0          69s
+tg-test-yunikorn-1-worker-k8it0x6s73      0/1     Pending   0          69s
+```
+
+Apache YuniKorn creates the Pods with the `tg-` prefix for gang scheduling purpose.
+
+Go to `http://localhost:9889/#/applications` and to see `test-yunikorn-1` in the `Accepted` state but not running yet:
+
+![Apache YuniKorn dashboard](../images/yunikorn-dashboard-apps-pending.png)
+
+Because the new cluster requires more CPU and RAM than the queue allows, even though one of the Pods would fit in the remaining 1 CPU and 2GiB of RAM, Apache YuniKorn doesn't place the cluster's Pods until there's enough room for all of the Pods. Without using Apache YuniKorn for gang scheduling in this way, KubeRay would place one of the Pods, and only partially allocating the cluster.
+
+Delete the first RayCluster to free up resources in the queue:
+
+```shell
+kubectl delete raycluster test-yunikorn-0
+```
+
+Now all the Pods for the second cluster change to the `Running` state, because enough resources are now available to schedule the entire set of Pods:
+
+Check the Pods again to see that the second cluster is now up and running:
+
+```shell
+$ kubectl get pods
+
+NAME                                  READY   STATUS    RESTARTS   AGE
+test-yunikorn-1-head-xl2r5            1/1     Running   0          3m34s
+test-yunikorn-1-worker-worker-l6ttz   1/1     Running   0          3m34s
+test-yunikorn-1-worker-worker-vjsts   1/1     Running   0          3m34s
+```
+
+Clean up the resources:
+
+```shell
+kubectl delete raycluster test-yunikorn-1
+```
diff --git a/doc/source/ray-contribute/development.rst b/doc/source/ray-contribute/development.rst
@@ -92,8 +92,8 @@ RLlib, Tune, Autoscaler, and most Python files do not require you to build and c
 
 .. code-block:: shell
 
-    # For example, for Python 3.8:
-    pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
+    # For example, for Python 3.9:
+    pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl
 
 4. Replace Python files in the installed package with your local editable copy. We provide a simple script to help you do this: ``python python/ray/setup-dev.py``. Running the script will remove the  ``ray/tune``, ``ray/rllib``, ``ray/autoscaler`` dir (among other directories) bundled with the ``ray`` pip package, and replace them with links to your local code. This way, changing files in your git clone will directly affect the behavior of your installed Ray.
 

diff --git a/doc/source/ray-core/handling-dependencies.rst b/doc/source/ray-core/handling-dependencies.rst
@@ -37,7 +37,7 @@ Concepts
 Preparing an environment using the Ray Cluster launcher
 -------------------------------------------------------
 
-The first way to set up dependencies is to is to prepare a single environment across the cluster before starting the Ray runtime.
+The first way to set up dependencies is to prepare a single environment across the cluster before starting the Ray runtime.
 
 - You can build all your files and dependencies into a container image and specify this in your your :ref:`Cluster YAML Configuration <cluster-config>`.
 
@@ -327,9 +327,7 @@ To ensure your local changes show up across all Ray workers and can be imported
       # No need to import my_module inside this function.
       my_module.test()
 
-  ray.get(f.remote())
-
-Note: This feature is currently limited to modules that are packages with a single directory containing an ``__init__.py`` file.  For single-file modules, you may use ``working_dir``.
+  ray.get(test_my_module.remote())
 
 .. _runtime-environments-api-ref:
 
@@ -358,13 +356,15 @@ The ``runtime_env`` is a Python dictionary or a Python class :class:`ray.runtime
   Note: If the local directory contains symbolic links, Ray follows the links and the files they point to are uploaded to the cluster.
 
 - ``py_modules`` (List[str|module]): Specifies Python modules to be available for import in the Ray workers.  (For more ways to specify packages, see also the ``pip`` and ``conda`` fields below.)
-  Each entry must be either (1) a path to a local directory, (2) a URI to a remote zip or wheel file (see :ref:`remote-uris` for details), (3) a Python module object, or (4) a path to a local `.whl` file.
+  Each entry must be either (1) a path to a local file or directory, (2) a URI to a remote zip or wheel file (see :ref:`remote-uris` for details), (3) a Python module object, or (4) a path to a local `.whl` file.
 
   - Examples of entries in the list:
 
     - ``"."``
 
-    - ``"/local_dependency/my_module"``
+    - ``"/local_dependency/my_dir_module"``
+
+    - ``"/local_dependency/my_file_module.py"``
 
     - ``"s3://bucket/my_module.zip"``
 
@@ -380,8 +380,6 @@ The ``runtime_env`` is a Python dictionary or a Python class :class:`ray.runtime
 
   Note: For option (1), if the local directory contains a ``.gitignore`` file, the files and paths specified there are not uploaded to the cluster.  You can disable this by setting the environment variable `RAY_RUNTIME_ENV_IGNORE_GITIGNORE=1` on the machine doing the uploading.
 
-  Note: This feature is currently limited to modules that are packages with a single directory containing an ``__init__.py`` file.  For single-file modules, you may use ``working_dir``.
-
 - ``excludes`` (List[str]): When used with ``working_dir`` or ``py_modules``, specifies a list of files or paths to exclude from being uploaded to the cluster.
   This field uses the pattern-matching syntax used by ``.gitignore`` files: see `<https://git-scm.com/docs/gitignore>`_ for details.
   Note: In accordance with ``.gitignore`` syntax, if there is a separator (``/``) at the beginning or middle (or both) of the pattern, then the pattern is interpreted relative to the level of the ``working_dir``.

diff --git a/doc/source/ray-observability/user-guides/configure-logging.md b/doc/source/ray-observability/user-guides/configure-logging.md
@@ -62,7 +62,7 @@ System logs may include information about your applications. For example, ``runt
   This is the log file of the agent containing logs of create or delete requests and cache hits and misses.
   For the logs of the actual installations (for example, ``pip install`` logs), see the ``runtime_env_setup-[job_id].log`` file (see below).
 - ``runtime_env_setup-ray_client_server_[port].log``: Logs from installing {ref}`Runtime Environments <runtime-environments>` for a job when connecting with {ref}`Ray Client <ray-client-ref>`.
-- ``runtime_env_setup-[job_id].log``: Logs from installing {ref}`Runtime Environments <runtime-environments>` for a Task, Actor or Job.  This file is only present if a Runtime Environment is installed.
+- ``runtime_env_setup-[job_id].log``: Logs from installing {ref}`runtime environments <runtime-environments>` for a Task, Actor, or Job. This file is only present if you install a runtime environment.
 
 
 (log-redirection-to-driver)=
@@ -136,13 +136,58 @@ The output is as follows:
 (task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication)
 ```
 
-This feature is especially useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when imported. Configure this feature as follows:
-
-1. Set ``RAY_DEDUP_LOGS=0`` to disable this feature entirely.
-2. Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=<int>`` to change the agggregation window.
-3. Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=<string>`` to specify log messages to never deduplicate.
-4. Set ``RAY_DEDUP_LOGS_SKIP_REGEX=<string>`` to specify log messages to skip printing.
-
+This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them. 
+
+Configure the following environment variables on the driver process **before importing Ray** to customize log deduplication:
+
+* Set ``RAY_DEDUP_LOGS=0`` to turn off this feature entirely.
+* Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=<int>`` to change the aggregation window.
+* Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=<string>`` to specify log messages to never deduplicate.
+    * Example:
+        ```python
+        import os
+        os.environ["RAY_DEDUP_LOGS_ALLOW_REGEX"] = "ABC"
+
+        import ray
+
+        @ray.remote
+        def f():
+            print("ABC")
+            print("DEF")
+
+        ray.init()
+        ray.get([f.remote() for _ in range(5)])
+
+        # 2024-10-10 17:54:19,095 INFO worker.py:1614 -- Connecting to existing Ray cluster at address: 172.31.13.10:6379...
+        # 2024-10-10 17:54:19,102 INFO worker.py:1790 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
+        # (f pid=1574323) ABC
+        # (f pid=1574323) DEF
+        # (f pid=1574321) ABC
+        # (f pid=1574318) ABC
+        # (f pid=1574320) ABC
+        # (f pid=1574322) ABC
+        # (f pid=1574322) DEF [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
+        ```
+* Set ``RAY_DEDUP_LOGS_SKIP_REGEX=<string>`` to specify log messages to skip printing.
+    * Example:
+        ```python
+        import os
+        os.environ["RAY_DEDUP_LOGS_SKIP_REGEX"] = "ABC"
+
+        import ray
+
+        @ray.remote
+        def f():
+            print("ABC")
+            print("DEF")
+
+        ray.init()
+        ray.get([f.remote() for _ in range(5)])
+        # 2024-10-10 17:55:05,308 INFO worker.py:1614 -- Connecting to existing Ray cluster at address: 172.31.13.10:6379...
+        # 2024-10-10 17:55:05,314 INFO worker.py:1790 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
+        # (f pid=1574317) DEF
+        # (f pid=1575229) DEF [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
+        ```
 
 
 ## Distributed progress bars (tqdm)

diff --git a/doc/source/rllib/rllib-examples.rst b/doc/source/rllib/rllib-examples.rst
@@ -254,12 +254,8 @@ RLModules
 - |old_stack| `How to using the "Repeated" space of RLlib for variable lengths observations <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/complex_struct_space.py>`__:
    How to use RLlib's `Repeated` space to handle variable length observations.
 - |old_stack| `How to write a custom Keras model <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/custom_keras_model.py>`__:
-   Example of using a custom Keras model.
-- |old_stack| `How to register a custom model with supervised loss <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_model_loss_and_metrics.py>`__:
    Example of defining and registering a custom model with a supervised loss.
 - |old_stack| `How to train with batch normalization <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/batch_norm_model.py>`__:
-   Example of adding batch norm layers to a custom model.
-- |old_stack| `How to write a custom model with its custom API <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_model_api.py>`__:
    Shows how to define a custom Model API in RLlib, such that it can be used inside certain algorithms.
 - |old_stack| `How to write a "trajectory ciew API" utilizing model <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/trajectory_view_utilizing_models.py>`__:
    An example on how a model can use the trajectory view API to specify its own input.