Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/job manager logger #48003

Closed
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Oct 15, 2024

  1. [aDAG] support buffered input (ray-project#47272)

    \Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit
    
    When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur.
    
    Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout.
    Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks.
    This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm).
    
    Main changes
    
    Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner.
    When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    42361bc View commit details
    Browse the repository at this point in the history
  2. [aDAG] Clean up arg_to_consumers in _get_or_compile() (ray-project#47514

    )
    
    Clean up the code.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    ruisearch42 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f1e2704 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8d20388 View commit details
    Browse the repository at this point in the history
  4. [Core][aDag] Support multi node multi reader (ray-project#47480)

    This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing.
    
    multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6625ee2 View commit details
    Browse the repository at this point in the history
  5. Allow control of some serve configuration via env vars (ray-project#4…

    …7533)
    
    When a serve app is launched, serve will startup automatically. In
    certain places like k8s, it can be difficult to preconfigure serve (e.g.
    in the [ray-cluster helm
    chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml)
    there is no ability to set the default serve arguments).
    
    This means you need to either be explicit when you start serve, or if it
    starts up automatically you may need to shut it down, then restart it,
    which is inconvenient.
    
    Signed-off-by: Tim Paine <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    timkpaine authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    290a14a View commit details
    Browse the repository at this point in the history
  6. Update incremental build troubleshooting tip with style nits (ray-pro…

    …ject#47592)
    
    Style nits.
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    Signed-off-by: angelinalg <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    angelinalg authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a0430bb View commit details
    Browse the repository at this point in the history
  7. [observability][export-api] Write driver job events (ray-project#47418)

    Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
    Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a6a63e2 View commit details
    Browse the repository at this point in the history
  8. [core][dashboard] push down job_or_submission_id to GCS. (ray-project…

    …#47492)
    
    GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload.
    
    This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls.
    
    ---------
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: Jiajun Yao <[email protected]>
    Co-authored-by: Jiajun Yao <[email protected]>
    Co-authored-by: Alexey Kudinkin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    3 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6e790d9 View commit details
    Browse the repository at this point in the history
  9. [Doc][KubeRay] Add description tables for RayCluster Status in the ob…

    …servability doc (ray-project#47462)
    
    Signed-off-by: Rueian <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rueian authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e591c40 View commit details
    Browse the repository at this point in the history
  10. [aDAG] Fix ranks ordering for custom NCCL group (ray-project#47594)

    The ranks should be in the order of the actors.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    ruisearch42 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    87519fa View commit details
    Browse the repository at this point in the history
  11. [RLlib] RLModule: InferenceOnlyAPI. (ray-project#47572)

    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    47d9b0d View commit details
    Browse the repository at this point in the history
  12. [Data] Remove _default_metadata_providers (ray-project#47575)

    _default_metadata_providers adds a layer of indirection.
    
    ---------
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    bveeramani authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    747b6f5 View commit details
    Browse the repository at this point in the history
  13. [Serve] Remove unused Serve constants (ray-project#47593)

    Went through all the constants in the file and remove the ones that's no
    
    Signed-off-by: Gene Su <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    GeneDer authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    15132d5 View commit details
    Browse the repository at this point in the history
  14. Fix windows://:task_event_buffer_test (ray-project#47577)

    Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows).
    This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step.
    This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows.
    
    Signed-off-by: Nikita Vemuri <[email protected]>
    Co-authored-by: Nikita Vemuri <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e038cb0 View commit details
    Browse the repository at this point in the history
  15. [RLlib] RLModule API: SelfSupervisedLossAPI for RLModules that brin…

    …g their own loss (algo independent). (ray-project#47581)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    644874d View commit details
    Browse the repository at this point in the history
  16. [GCS] Optimize GetAllJobInfo API for performance (ray-project#47530)

    Signed-off-by: liuxsh9 <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    liuxsh9 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b1c7caa View commit details
    Browse the repository at this point in the history
  17. [Serve] fix default serve logger behavior (ray-project#47600)

    Re: ray-project#47229
    
    Previous PR to setup default serve logger has some unexpected
    consequence. Mainly combined with Serve's stdout redirect feature (when
    `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve
    logger and redirect all stdout/stderr into serve's log files instead
    going to the console. This caused on the Anyscale platform unable to
    identify ray start command is running successfully and unable to start
    the cluster. This PR fixes this behavior by only configure Serve's
    default logger with stream handler and skip configuring file handler
    altogether.
    
    Signed-off-by: Gene Su <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    GeneDer authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    4b38d57 View commit details
    Browse the repository at this point in the history
  18. [core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…

    …rkers. (ray-project#47212)
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: Jiajun Yao <[email protected]>
    Co-authored-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    102ec9d View commit details
    Browse the repository at this point in the history
  19. [core][adag] Separate the outputs of execute and execute_async to mul…

    …tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305)
    
    ## Why are these changes needed?
    Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get
    back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on
    whether `execute` or `execute_async` is invoked, that points to a list
    of all of the outputs. To retrieve one of the outputs, you have to get
    and deserialize all of them at the same time.
    
    This PR separates the output of `execute` and `execute_async` to a list
    of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by
    `MultiOutputNode`. This is particularly useful for vLLM tensor
    parallelism. Since all shards return the same results, we only need to
    fetch result from one of the workers.
    
    Closes ray-project#46908.
    
    ---------
    
    Signed-off-by: jeffreyjeffreywang <[email protected]>
    Signed-off-by: Jeffrey Wang <[email protected]>
    Co-authored-by: jeffreyjeffreywang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c47c430 View commit details
    Browse the repository at this point in the history
  20. [serve] Faster detection of dead replicas (ray-project#47237)

    ## Why are these changes needed?
    
    Detect replica death earlier on handles/routers. Currently routers will
    process replica death if the actor death error is thrown during active
    probing or system message.
    1. cover one more case: process replica death if error is thrown _while_
    request was being processed on the replica.
    2. improved handling: if error is detected on the system message,
    meaning router found out replica is dead after assigning a request to
    that replica, retry the request.
    
    ### Performance evaluation
    (master results pulled from
    https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42)
    
    Latency:
    | metric | master | this PR | % change |
    | -- | -- | -- | -- |
    | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 |
    | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 |
    | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89
    |
    | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 |
    | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 |
    | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22
    |
    | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 |
    | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 |
    -2.67 |
    | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 |
    -1.96 |
    
    Throughput:
    | metric | master | this PR | % change |
    | -- | -- | -- | -- |
    | http_avg_rps | 359.14 | 357.81 | -0.37 |
    | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 |
    | grpc_avg_rps | 506.16 | 485.92 | -4.0 |
    | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 |
    | handle_avg_rps | 604.52 | 641.66 | 6.14 |
    | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 |
    
    Results: everything except for grpc results are within noise. As for
    grpc results, they have always been relatively noisy (see below), so the
    results are actually also within the noise that we've been seeing. There
    is also no reason why latency for a request would only increase for grpc
    and not http or handle for the changes in this PR, so IMO this is safe.
    ![Screenshot 2024-08-21 at 11 54
    55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378)
    
    ## Related issue number
    
    closes ray-project#47219
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    21379f3 View commit details
    Browse the repository at this point in the history
  21. [spark] Improve Ray-on-spark fault tolerance in case of Spark executo…

    …r being down (e.g. spot instance termination) (ray-project#47493)
    
    Signed-off-by: Weichen Xu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    WeichenXu123 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    591a4d0 View commit details
    Browse the repository at this point in the history
  22. [serve] skip failure test on windows (ray-project#47630)

    Skip test_replica_actor_died on windows.
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    60d78b1 View commit details
    Browse the repository at this point in the history
  23. [serve] reorganize replica scheduler classes (ray-project#47615)

    ## Why are these changes needed?
    
    Pull replica scheduler and replica wrapper out from `common.py` into
    their own files.
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    0f9fa48 View commit details
    Browse the repository at this point in the history
  24. [Core] Remove code accidently got in (ray-project#47612)

    Idk how this was genearted
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3c2b92c View commit details
    Browse the repository at this point in the history
  25. [Core][aDAG] support multi readers in multi node when dag is created …

    …from an actor (ray-project#47601)
    
    Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself.
    
    This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()).
    
    This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    35fe4ba View commit details
    Browse the repository at this point in the history
  26. [core] out of band serialization exception (ray-project#47544)

    Introduce an env var to raise an exception when there's out of band seriailzation of object ref
    Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
    Update an anti-pattern doc.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    0af4ca7 View commit details
    Browse the repository at this point in the history
  27. [core][experimental] Allocate a channel for each InputAttributeNode (r…

    …ay-project#47564)
    
    Change 1: Remove class DAGInputAdapter.
    
    Without this PR, the entire input data will be written to the channel, even if a reader only wants to retrieve partial input data via InputAttributeNode. Then, the entire input data will be read by the READ operation, and the partial input will be retrieved during the COMPUTE operation (code)
    In this PR, each InputAttributeNode has its own channel, and only the corresponding input data will be written to the channel. Therefore, we no longer need to use DAGInputAdapter to retrieve the partial input data during the COMPUTE operation.
    Change 2: If the DAG contains any InputAttributeNode, create a channel for each InputAttributeNode. Then, write the partial input data to the corresponding channel (code).
    
    Change 3: There are some if/else statements to handle InputNode and InputAttributeNode for creating CachedChannel. This PR unifies the logic because InputNode and different InputAttributeNode are no longer considered consumers of only one input channel. Each InputAttributeNode has its own channel.
    
    Change 4: Move RayDAGArgs from compiled_dag_node.py to common.py to avoid importing it inside _adapt.
    
    Without this, this PR is about 5% slower than the baseline in the case "Benchmark: single actor, no InputAttributeNode". With this change, the performance is almost the same as, or slightly better than, the baseline. See "Benchmark: single actor, no InputAttributeNode" below for more details.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ebb984e View commit details
    Browse the repository at this point in the history
  28. [Data] Add partitioning parameter to read_parquet (ray-project#47553

    )
    
    To extract path partition information with `read_parquet`, you pass a
    PyArrow `partitioning` object to `dataset_kwargs`. For example:
    ```
    schema = pa.schema([("one", pa.int32()), ("two", pa.string())])
    partitioning = pa.dataset.partitioning(schema, flavor="hive")
    ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning))
    ```
    
    This is problematic for two reasons:
    1. It tightly couples the interface with the implementation;
    partitioning only works if we use `pyarrow.Dataset` in a specific way in
    the implementation.
    2. It's inconsistent with all of the other file-based API. All other
    APIs use expose a top-level `partitioning` parameter (rather than
    `dataset_kwargs`) where you pass a Ray Data `Partitioning` object
    (rather than a PyArrow partitioning object).
    
    ---------
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    bveeramani authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    804b4f3 View commit details
    Browse the repository at this point in the history
  29. [spark] Refine comment in Starting ray worker spark task (ray-project…

    …#47670)
    
    Signed-off-by: Weichen Xu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    WeichenXu123 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    96175fb View commit details
    Browse the repository at this point in the history
  30. [Core][aDAG] Set buffer size to 1 for regression (ray-project#47639)

    There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation.
    With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    2a7679d View commit details
    Browse the repository at this point in the history
  31. Add perf metrics for 2.36.0 (ray-project#47574)

    ```
    REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json
    REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json
    REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json
    REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json
    REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json
    REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json
    REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json
    REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json
    REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json
    REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json
    REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json
    REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json
    REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json
    REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json
    REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json
    REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json
    REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json
    REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json
    REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json
    REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json
    REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json
    ```
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    de9be8f View commit details
    Browse the repository at this point in the history
  32. [RLlib] Add "shuffle batch per epoch" option. (ray-project#47458)

    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3af892c View commit details
    Browse the repository at this point in the history
  33. Configuration menu
    Copy the full SHA
    d738010 View commit details
    Browse the repository at this point in the history
  34. [Core] Make JobSupervisor logs structured (ray-project#47699)

    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ca4be70 View commit details
    Browse the repository at this point in the history
  35. [serve] wrap obj ref in result wrapper in deployment response (ray-pr…

    …oject#47655)
    
    ## Why are these changes needed?
    
    Abstract `ray.ObjectRef` and `ray.ObjectRefGenerator` in a result
    wrapper that the deployment response can directly call into.
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    73b528b View commit details
    Browse the repository at this point in the history
  36. [Core] Fix broken dashboard worker page (ray-project#47714)

    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9dbbe38 View commit details
    Browse the repository at this point in the history
  37. [core][experimental] Remove unused attr CompiledDAG._type_hints (ray-…

    …project#47706)
    
    CompiledDAG._type_hints is not used.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c1bdd25 View commit details
    Browse the repository at this point in the history
  38. [Data] Re-phrase the streaming executor current usage string (ray-pro…

    …ject#47515)
    
    ## Why are these changes needed?
    
    The progress bar for ray data could still end up showing higher
    utilization of what the cluster currently have.
    ray-project#46729 was the first attempt to
    fix it which addressed the issue in static clusters, but we still have
    that issue for clusters that autoscales. This change simply rephrase the
    string so it is less confusing.
    
    Before
    <img width="1249" alt="image"
    src="https://github.com/user-attachments/assets/049ea096-a87f-4767-ba04-6d00d7c2755d">
    
    After
    <img width="1248" alt="image"
    src="https://github.com/user-attachments/assets/cb74c0dc-1f33-4b22-b31c-e83df2a5d408">
    
    This comes from the fact that operators don't track the task state (and
    currently ray core does not even provide that api). Which means Ray data
    operators does not know if the task is assigned to a node or not, so
    once the task is submitted to ray it is marked active even if it is
    pending a node assignment. The dashboard does better here since it does
    have extra information from the task.
    
    <img width="1493" alt="image"
    src="https://github.com/user-attachments/assets/9315b884-3e61-4b32-8400-7f76e15b6a4b">
    
    In the future we can visit adding the core api for remote state
    reporting and allowing operators to provide more detailed state (active,
    pending_scheduled, pending_node_assignment).
    
    ## Related issue number
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    ---------
    
    Signed-off-by: Sofian Hnaide <[email protected]>
    Co-authored-by: scottjlee <[email protected]>
    Co-authored-by: matthewdeng <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    3 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f966d2e View commit details
    Browse the repository at this point in the history
  39. [serve] improve tests (ray-project#47722)

    ## Why are these changes needed?
    
    - We can make some tests asynchronous instead of having to rely on
    `_to_object_ref`.
    - we can use `RayActorError` instead of `ActorDiedError`
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3f439f8 View commit details
    Browse the repository at this point in the history
  40. [Core] Add test case where there is dead node for /nodes?view=summary…

    … endpoint (ray-project#47727)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9e28fb7 View commit details
    Browse the repository at this point in the history
  41. [Dashboard] Optimizing performance of Ray Dashboard (ray-project#47617)

    Signed-off-by: Alexey Kudinkin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    alexeykudinkin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    0426ee4 View commit details
    Browse the repository at this point in the history
  42. [core][aDAG] Fix a bug where multi arg + exception doesn't work (ray-…

    …project#47704)
    
    Currently, when there's an exception, there's only 1 return value, but multi ref assumes that the return value has to match the # of output channels. It fixes the issue by duplicating exception to match the number of output channels.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    dd8ee01 View commit details
    Browse the repository at this point in the history
  43. [fake autoscaler] use check_call in fake multi node test utils (ray-p…

    …roject#47772)
    
    so that output is printed to logs
    
    and also use "sys.executable" rather than "python"
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    361c10e View commit details
    Browse the repository at this point in the history
  44. [RLlib] RLModule: Simplify defining custom distribution classes and a…

    …dd better defaults. (ray-project#47775)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    605c640 View commit details
    Browse the repository at this point in the history
  45. [fake autoscaler] remove the redundant mkdir (ray-project#47786)

    - docker compose service volume short syntax uses bind (similar to `-v`
    and will create the dir if not exist
    - the code was not mapping the dir to host path, so it actually has no
    meaningful effect when it is running in a container, such as on CI
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    aaa3d8d View commit details
    Browse the repository at this point in the history
  46. [Data] Simplify and consolidate progress bar outputs (ray-project#47692)

    ## Why are these changes needed?
    
    Currently, the progress bar is pretty verbose because it is very
    information dense. This PR:
    - Reorganizes progress output to group by relevant concepts and
    clarifies labels
    - Standardizes global and operator-level progress bar outputs
    - Removes the use of all emojis (poor rendering on some platforms /
    external logging systems)
    
    Progress bar before this PR:
    <img width="1403" alt="Screenshot at Sep 16 13-00-17"
    src="https://github.com/user-attachments/assets/4f459b77-06ba-4395-b883-e4c9ac8ca2ef">
    
    Progress bar after this PR:
    <img width="1502" alt="Screenshot at Sep 23 13-48-32"
    src="https://github.com/user-attachments/assets/0c0f8c94-9439-4fd4-ae1a-2857b3a87b59">
    
    Will follow up with a docs PR once we merge this change, so that I don't
    need to continuously modify the docs.
    
    In the future, we should restructure the way progress bars are
    grouped/tracked, so that we can tabulate the op-level progress bar
    outputs.
    
    ## Related issue number
    
    ## Checks
    
    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [x] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [x] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    ---------
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    scottjlee authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c7ff9c8 View commit details
    Browse the repository at this point in the history
  47. Add perf metrics for 2.37.0 (ray-project#47791)

    for release perf checking.
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6aae543 View commit details
    Browse the repository at this point in the history
  48. [Serve] add dependencies on openssl (ray-project#47738)

    Add `pyOpenSSL` dependency for Serve. And update test docker file to use
    ray[serve-grpc] dependencies.
    
    Signed-off-by: Gene Su <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    GeneDer authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    bc7f7b0 View commit details
    Browse the repository at this point in the history
  49. [docker] Update latest Docker dependencies for 2.36.0 release (ray-pr…

    …oject#47748)
    
    Created by release automation bot.
    
    Update with commit f298a75
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    247be0b View commit details
    Browse the repository at this point in the history
  50. [docker] Update latest Docker dependencies for 2.36.1 release (ray-pr…

    …oject#47801)
    
    Created by release automation bot.
    
    Update with commit 18b2d94
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: Kevin H. Luu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    d4e7f7f View commit details
    Browse the repository at this point in the history
  51. [observability][export-api] Write submission job events (ray-project#…

    …47468)
    
    Add ExportEventLoggerAdapter which will be used to write export events to file from python files. Only a single ExportEventLoggerAdapter instance will exist per source type, so callers can create or get this instance using get_export_event_logger which is thread safe.
    Write Submission Job export events to file from JobInfoStorageClient.put_info which is called to update the JobInfo data in the internal KV store.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f994475 View commit details
    Browse the repository at this point in the history
  52. Move export events to separate folder (ray-project#47747)

    Move export events from session_latest/logs/events to session_latest/logs/export_events
    Keeping both event types in the same folder doesn't cause any issue for Ray -- export event files are already filtered out for /events API in
    ray/python/ray/dashboard/modules/event/event_utils.py
    
    Line 22 in 1e48a03
    
     all_source_types = set(event_consts.EVENT_SOURCE_ALL)
    However moving these to a separate folder would be better for existing downstream consumers to avoid handling export events in the events folder if they turn the flag on
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c8f16e3 View commit details
    Browse the repository at this point in the history
  53. [release] stream the full anyscale log to buildkite (ray-project#47808)

    Currently we only print 100 last lines of anyscale job log to buildkite.
    This PR removes that limit and prints everything instead. CC:
    @kouroshHakha
    
    Test:
    - CI
    
    Signed-off-by: can <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    can-anyscale authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    1b8fcac View commit details
    Browse the repository at this point in the history
  54. Configuration menu
    Copy the full SHA
    575ee94 View commit details
    Browse the repository at this point in the history
  55. [docker] Update latest Docker dependencies for 2.37.0 release (ray-pr…

    …oject#47812)
    
    Created by release automation bot.
    
    Update with commit d2982b7
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    1d25a39 View commit details
    Browse the repository at this point in the history
  56. [RLlib] Fix action masking example. (ray-project#47817)

    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ce75400 View commit details
    Browse the repository at this point in the history
  57. [Core] Separate the attempt_number with the task_status in memory sum…

    …mary and object list (ray-project#47818)
    
    # Current status:
    * When we retrieve the information from GCS, the task_status as well as
    the attempts are in 2 fields and the task status is an enum.
    * Later during reconstruction, the 2 fields are combined into 1 and the
    number of attempts is added to the task_status field.
    * That's why when displaying the objects, the function isn't able to
    convert the string back to enum.
    
    # Proposed solution:
    * Instead of combining the 2 fields (task_status and attempt), we will
    keep the 2 fields and added an additional field (attempt_number) in the
    Object State
    * In this way, we will keep the task_status as enum and put the attempt
    number information in a different field
    # Changes in this PR:
    * Added the `attempt_number` in `ObjectState` and
    `task_attempt_number_counts` in `ObjectSummaryPerKey`
      * Added logic to populate the fields as proposed above
    * Updated the logic for the memory summary function to display the
    attempt number in a new column
      * Corresponding tests added as well
    
    Signed-off-by: Mengjin Yan <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MengjinYan authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    788db07 View commit details
    Browse the repository at this point in the history
  58. [RLlib; docs] New API stack migration guide. (ray-project#47779)

    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    55397ea View commit details
    Browse the repository at this point in the history
  59. [RLlib; new API stack by default] Switch on new API stack by default …

    …for SAC and DQN. (ray-project#47217)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    27985d4 View commit details
    Browse the repository at this point in the history
  60. [Core] Fix a Typo in dict_to_state function parameter name (ray-proje…

    …ct#47822)
    
    Signed-off-by: Mengjin Yan <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MengjinYan authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    90742fb View commit details
    Browse the repository at this point in the history
  61. [core] Introducing InstrumentedIOContextWithThread. (ray-project#47831)

    Previously we had several ad-hoc places to do a "thread and io_context"
    pattern: create a thread dedicated to an asio io_context, then workload
    can post async tasks onto it. This makes duplicate code: everywhere we
    create threads, implement stop and join.
    
    Introducing InstrumentedIOContextWithThread that does exactly this and
    replaces existing usages.
    
    Also fixes some absl::Time computations with best practice.
    
    This is refactoring. Should have no runtime difference.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    1383374 View commit details
    Browse the repository at this point in the history
  62. [RLlib] Discontinue support for "hybrid" API stack (using RLModule + …

    …Learner, but still on RolloutWorker and Policy) (ray-project#46085)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    417cdd2 View commit details
    Browse the repository at this point in the history
  63. [Core] Fix object reconstruction hang on arguments pending creation (r…

    …ay-project#47645)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6c160b3 View commit details
    Browse the repository at this point in the history
  64. [core][experimental] Fix test_execution_schedule_gpu (ray-project#47753)

    Pass a GPU tensor to execute, but it gets converted into a CPU tensor. The issue may be related to ray-project#46440.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3714afd View commit details
    Browse the repository at this point in the history
  65. [core] Change many Ray ID logs to WithField. (ray-project#47844)

    Use structured logging by changing more `<< node_id` to use
    `.WithField(node_id)`. This is not intended to be a complete work, but
    it should cover most of the cases. We did the work for NodeID, WorkerID,
    ActorID, JobID, TaskID, PlacementGroupID.
    
    Some logs have multiple IDs. To avoid confusion, for these we only use
    WithField(object_id) don't use WithField on either of the Node IDs.
    
    This PR should have no change on Ray other than logs.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    256c177 View commit details
    Browse the repository at this point in the history
  66. [RLlib] Cleanup examples folder (vol 30): BC pretraining, then PPO fi…

    …netuning (new API stack with RLModule checkpoints). (ray-project#47838)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a4f62b6 View commit details
    Browse the repository at this point in the history
  67. [RLlib] MultiAgentEnv API enhancements (related to defining obs-/acti…

    …on spaces for agents). (ray-project#47830)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    1ac860f View commit details
    Browse the repository at this point in the history
  68. [RLlib] Add log-std clipping to 'MLPHead's. (ray-project#47827)

    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    43a8a1d View commit details
    Browse the repository at this point in the history
  69. Configuration menu
    Copy the full SHA
    cfbda91 View commit details
    Browse the repository at this point in the history
  70. [kuberay] Update docs for KubeRay v1.2.2 (ray-project#47867)

    change kuberay helm and branch reference versions to v1.2.2
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    71bb74b View commit details
    Browse the repository at this point in the history
  71. [Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb (

    …ray-project#47832)
    
    Currently, when using tensor type in Ray Data if single tensor in a
    block grows above 2Gb (due to use of signed `int32` as offsets) this
    would result in the following issue:
    
    ```
    pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
    ```
    
    Consequently, this change adds support for tensors of > 4Gb in size,
    while maintaining compatibility with existing datasets already using
    tensors.
    
    This is done by forking off `ArrowTensorType` in 2:
    
     - `ArrowTensorType` (v1) remaining intact
    - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as
    now using `int64` offsets
    
    ---------
    
    Signed-off-by: Peter Wang <[email protected]>
    Signed-off-by: Alexey Kudinkin <[email protected]>
    Co-authored-by: Peter Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    cd61cb3 View commit details
    Browse the repository at this point in the history
  72. [Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeI…

    …D()) == sync_reactors_.end() (ray-project#47861)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    21af246 View commit details
    Browse the repository at this point in the history
  73. [RLlib] New API stack: (Multi)RLModule overhaul vol 01 (some preparat…

    …ory cleanups). (ray-project#47884)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    072d349 View commit details
    Browse the repository at this point in the history
  74. [RLlib] New API stack: (Multi)RLModule overhaul vol 02 (VPG RLModule,…

    … Algo, and Learner example classes). (ray-project#47885)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    383d7ff View commit details
    Browse the repository at this point in the history
  75. [RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…

    …eric `_forward` to further simplify the user experience). (ray-project#47889)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e4401e5 View commit details
    Browse the repository at this point in the history
  76. [RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …

    …DreamerV3 on new API stack remains with tf now). (ray-project#47892)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    759b0c8 View commit details
    Browse the repository at this point in the history
  77. [core] Change debug_string from returning a string to streaming to an…

    … ostream. (ray-project#47893)
    
    We have a convenience function `debug_string` used in Ray logs: it
    prints printables (operator<<), containers, pairs. However it returns a
    std::string which is feed into RAY_LOG(). This makes a copy.
    
    Changes the signature to return a `DebugStringWrapper` which holds const
    reference to the argument, and is printable for all already supported
    types. Additionally supports std::tuple.
    
    This should only have marginal perf benefits since we typically don't
    debug_string a very big data structure.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b50f7c1 View commit details
    Browse the repository at this point in the history
  78. [Serve / Jobs] Check if conda env exists before removing (ray-project…

    …#47922)
    
    ## Why are these changes needed?
    Fixes some failing/flaky unit tests tests, which fail with errors like:
    ```
    EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b
    ```
    - TestBackwardsCompatibility.test_cli
    - test_failed_driver_exit_code
    
    Previously failing test now passes with this PR applied:
    https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934
    
    ## Related issue number
    
    ## Checks
    
    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [x] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [x] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [x] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    scottjlee authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8844a78 View commit details
    Browse the repository at this point in the history
  79. [job] don't continue on test setup (ray-project#47927)

    when the conda env exists, should just remove it and continue
    the testing
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    122b382 View commit details
    Browse the repository at this point in the history
  80. [core][experimental] Avoid false positives in deadlock detection (ray…

    …-project#47912)
    
    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b16a782 View commit details
    Browse the repository at this point in the history
  81. [serve] Stop scheduling task early when requests have been cancelled (r…

    …ay-project#47847)
    
    In `fulfill_pending_requests`, there are two nested loops:
    - the outer loop greedily fulfills more requests so that if backoff
    doesn't occur, it's not necessary for new asyncio tasks to be started to
    fulfill each request
    - the inner loop handles backoff if replicas can't be found to fulfill
    the next request
    
    The outer loop will be stopped if there are enough tasks to handle all
    pending requests. However if all replicas are at max capacity, it's
    possible for the inner loop to continue to loop even when the task is no
    longer needed (e.g. when a request has been cancelled), because the
    inner loop simply continues to try to find an available replica without
    checking if the current task is even necessary.
    
    This PR makes sure that at the end of each iteration of the inner loop,
    it clears out requests in `pending_requests_to_fulfill` that have been
    cancelled, and then breaks out of the loop if there are enough tasks to
    handle the remaining requests.
    
    Tests:
    - Added a test that tests for the scenario where a request is cancelled
    while it's trying to find an available replica
    - Also modified the tests in `test_pow_2_scheduler.py` so that the
    backoff sequence is small values (1ms), and the timeouts in the tests
    are also low `10ms`, so that the unit tests run much faster (~5s now
    compared to ~30s before).
    
    ## Related issue number
    
    related: ray-project#47585
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    d360d45 View commit details
    Browse the repository at this point in the history
  82. [RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…

    …cs, SpecDict, TensorSpec). (ray-project#47915)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b2a8acf View commit details
    Browse the repository at this point in the history
  83. [RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…

    … not catch correct `ObjectLostError`). (ray-project#47940)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    fe2aea0 View commit details
    Browse the repository at this point in the history
  84. [RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…

    …oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    80824d0 View commit details
    Browse the repository at this point in the history
  85. [Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (ray-…

    …project#47659)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    da339ad View commit details
    Browse the repository at this point in the history
  86. Configuration menu
    Copy the full SHA
    315bdf1 View commit details
    Browse the repository at this point in the history
  87. [core] Add more debug string types (ray-project#47928)

    Followup on ray-project#47893, add more
    "blessed container types" to debug string function.
    
    Signed-off-by: dentiny <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    dentiny authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c5bbfe8 View commit details
    Browse the repository at this point in the history
  88. [deps] add grpcio-tools into anyscale dependencies (ray-project#47955)

    so that it participates in the dependency resolving process
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    364ee39 View commit details
    Browse the repository at this point in the history
  89. [RLlib] Quick-fix for default RLModules in combination with a user-pr…

    …ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    d04f8d3 View commit details
    Browse the repository at this point in the history
  90. [RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…

    …amples. (ray-project#47970)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ca5d29b View commit details
    Browse the repository at this point in the history
  91. Configuration menu
    Copy the full SHA
    3c1aa3b View commit details
    Browse the repository at this point in the history
  92. [serve] Fix failing test pow 2 scheduler on windows (ray-project#47975)

    ## Why are these changes needed?
    
    Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio
    is slower on windows, so the shortened timeouts for some tests cause the
    tests to fail because tasks didn't get a chance to start/finish
    executing.
    
    Failing tests on windows:
    - `test_multiple_queries_with_different_model_ids`
    - `test_queue_len_cache_replica_at_capacity_is_probed`
    - `test_queue_len_cache_background_probing`
    
    ## Related issue number
    
    Closes ray-project#47950
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7585842 View commit details
    Browse the repository at this point in the history
  93. [data] fix reading multiple parquet files with ragged ndarrays (ray-p…

    …roject#47961)
    
    ## Why are these changes needed?
    
    PyArrow infers parquet schema only based on the first file. This will
    cause errors when reading multiple files with ragged ndarrays.
    
    This PR fixes this issue by not using the inferred schema for reading.
    
    <!-- Please give a short summary of the change and the problem this
    solves. -->
    
    ## Related issue number
    Fixes ray-project#47960
    
    ---------
    
    Signed-off-by: Hao Chen <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    raulchen authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ca871bc View commit details
    Browse the repository at this point in the history
  94. [core] Decouple create worker vs pop worker request. (ray-project#47694)

    Now, when you call PopWorker(), it finds an idle one or creates a
    worker. If a new worker is created, the worker is associated to the
    request and can only be used by it.
    
    This PR decouples the worker creation and the worker-to-task assignment,
    by adding an abstraction namely PopWorkerRequest. Now, if a req triggers
    a worker creation, the req is put into a queue. If there are workers
    ready, that is a PushWorker is called, either from a newly started
    worker or a released worker, Ray matches the first fitting request in
    the queue. This reduces latency.
    
    Later it can also be used to pre-start workers more meaningfully.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6a38914 View commit details
    Browse the repository at this point in the history
  95. [core] Add metrics for gcs jobs (ray-project#47793)

    This PR adds metrics for job states within job manager.
    
    In detail, a gauge stats is sent via opencensus exporter, so running ray
    jobs could be tracked and alerts could be created later on.
    
    Fault tolerance is not considered, according to
    [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
    state is re-constructed at restart.
    
    On testing, the best way is to observe via opencensus backend (i.e.
    google monitoring dashboard), but not easy for open-source contributors;
    or to have a mock / fake exporter implementation, which I don't find in
    the code base.
    
    Signed-off-by: dentiny <[email protected]>
    Co-authored-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    669d699 View commit details
    Browse the repository at this point in the history
  96. upgrade grpcio version (ray-project#47982)

    to at least 1.66.1
    
    this is already being overwritten to 1.66.1+ when during release tests
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f2b09d4 View commit details
    Browse the repository at this point in the history
  97. [Chore][Core] Address PR 47807 comments (ray-project#48002)

    PR 47807 was auto-merged without applying the doc reviews, so this
    commit addresses them.
    
    Signed-off-by: Chi-Sheng Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MortalHappiness authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    aed856b View commit details
    Browse the repository at this point in the history
  98. [core] Add thread check to job mgr callback (ray-project#48005)

    This PR followup for comment
    ray-project#47793 (comment),
    and adds a thread checking to GCS job manager callback to make sure no
    concurrent access for data members.
    
    Signed-off-by: dentiny <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    dentiny authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    4cf016c View commit details
    Browse the repository at this point in the history
  99. unneccessary file removed

    Signed-off-by: ujjawal-khare <[email protected]>
    ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    155a415 View commit details
    Browse the repository at this point in the history
  100. unneccessary files removed

    Signed-off-by: ujjawal-khare <[email protected]>
    ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e7c94c5 View commit details
    Browse the repository at this point in the history
  101. lint error fixed

    Signed-off-by: ujjawal-khare <[email protected]>
    ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b83e7ad View commit details
    Browse the repository at this point in the history
  102. lint error fixed

    Signed-off-by: ujjawal-khare <[email protected]>
    ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b77c5ad View commit details
    Browse the repository at this point in the history
  103. [Serve] fix grpc performance issue (ray-project#47338)

    This PR fixes part of the problem by creating the payload message once
    and reusing it throughout the benchmark.
    
    Ran the release test on this change
    [build](https://buildkite.com/ray-project/release/builds/21663#01918fe1-853b-46f2-9699-c4045b182b8c)
    now seeing the `grpc_10mb_p50_latency` now dropped to ~58ms from ~80ms
    previously.
    
    The rest of the issue came from the existing gRPC server implementation
    requires to wait on the entirety of the unary request before it's able
    to continue it's work on replica. We will need to create a new HTTP2
    proxy and pass the request transparently between the replica and the
    proxy to speed thing up. Will follow up in the future on
    ray-project#47370
    
    Closes ray-project#47371
    
    Signed-off-by: Gene Su <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    GeneDer authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7ad9a3b View commit details
    Browse the repository at this point in the history
  104. [observability][export-api] Write node events (ray-project#47221)

    Write node events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
    Event write is called whenever a value in the node event data schema is modified. Typically this occurs in the callback after writing NodeTable to the GCS table
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    cdc86fa View commit details
    Browse the repository at this point in the history
  105. [RLlib] Cleanup examples folder (vol 23): Float16 training support an…

    …d new example script. (ray-project#47362)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    1ea718b View commit details
    Browse the repository at this point in the history
  106. [core][dashboard] Update nodes on delta. (ray-project#47367)

    Like actor_head.py, we now update DataSource.nodes on delta. It first
    queries all node infos, then subscribes node deltas. Each delta updates:
    
    1. DataSource.nodes[node_id]
    2. DataSource.agents[node_id]
    3. a warning generated after
    RAY_DASHBOARD_HEAD_NODE_REGISTRATION_TIMEOUT = 10s
    
    Note on (2) agents: it's read from internal kv, and is not readily
    available until the agent.py is spawned and writes its own port to
    internal kv. So we make an async task for each node to poll this port
    every 1s.
    
    It occurs that the get-all-then-subscribe code has a TOCTOU problem, so
    also updated actor_head.py to first subscribe then get all actors.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ddec4a5 View commit details
    Browse the repository at this point in the history
  107. [RLlib] Cleanup examples folder (vol 24): Mixed-precision training (a…

    …nd float16 inference) through new example script. (ray-project#47116)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8820918 View commit details
    Browse the repository at this point in the history
  108. Split python/ray/tests/test_actor_retry over two files (ray-project#4…

    …7188)
    
    The `test_actor_retry` tests are failing/flaky on windows. They pass locally. I have not been able to access the CI logs to see what is going wrong. In order to shrink the problem (is it a overall timeout? Is one of the tests failing?) we can start by splitting the tests into two files.
    
    Toward solving ray-project#43845.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    mattip authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9f76655 View commit details
    Browse the repository at this point in the history
  109. [RLlib; Offline RL] - Enable reading old-stack SampleBatch data in …

    …new stack Offline RL. (ray-project#47359)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    05fad3f View commit details
    Browse the repository at this point in the history
  110. [serve] redeploy in between each microbenchmark (ray-project#47404)

    Redeploy in between each microbenchmark.
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    fa17d58 View commit details
    Browse the repository at this point in the history
  111. Configuration menu
    Copy the full SHA
    28d7347 View commit details
    Browse the repository at this point in the history
  112. [doc] Instruction for troubleshooting side nav when building incremen…

    …tally (ray-project#47372)
    
    Signed-off-by: khluu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e6c08e1 View commit details
    Browse the repository at this point in the history
  113. [Doc] Run pre-commit on cluster docs (ray-project#47342)

    Currently we have no linting on any part of the docs code. This PR runs
    pre-commit on the cluster docs.
    
    This PR fixes the following issues:
    
    ```
    trim trailing whitespace.................................................Failed
    - hook id: trailing-whitespace
    - exit code: 1
    - files were modified by this hook
    
    Fixing doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md
    Fixing doc/source/cluster/running-applications/job-submission/cli.rst
    Fixing doc/source/cluster/configure-manage-dashboard.md
    Fixing doc/source/cluster/kubernetes/user-guides/pod-security.md
    Fixing doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md
    Fixing doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md
    Fixing doc/source/cluster/vms/references/ray-cluster-configuration.rst
    Fixing doc/source/cluster/running-applications/job-submission/quickstart.rst
    Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md
    Fixing doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md
    Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md
    Fixing doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md
    Fixing doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md
    Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml
    Fixing doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md
    Fixing doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md
    Fixing doc/source/cluster/running-applications/job-submission/sdk.rst
    Fixing doc/source/cluster/running-applications/job-submission/ray-client.rst
    Fixing doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md
    Fixing doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md
    Fixing doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml
    Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml
    Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md
    Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml
    Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md
    Fixing doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md
    Fixing doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md
    Fixing doc/source/cluster/kubernetes/user-guides/logging.md
    Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md
    Fixing doc/source/cluster/kubernetes/examples/rayjob-batch-inference-example.md
    Fixing doc/source/cluster/metrics.md
    Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md
    Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md
    Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md
    Fixing doc/source/cluster/faq.rst
    Fixing doc/source/cluster/running-applications/job-submission/openapi.yml
    Fixing doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md
    Fixing doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md
    Fixing doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md
    Fixing doc/source/cluster/kubernetes/user-guides/config.md
    Fixing doc/source/cluster/kubernetes/user-guides/pod-command.md
    
    fix end of files.........................................................Failed
    - hook id: end-of-file-fixer
    - exit code: 1
    - files were modified by this hook
    
    Fixing doc/source/cluster/kubernetes/images/rbac-clusterrole.svg
    Fixing doc/source/cluster/running-applications/job-submission/cli.rst
    Fixing doc/source/cluster/vms/user-guides/community/slurm.rst
    Fixing doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md
    Fixing doc/source/cluster/images/ray-job-diagram.svg
    Fixing doc/source/cluster/kubernetes/user-guides/observability.md
    Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md
    Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml
    Fixing doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg
    Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md
    Fixing doc/source/cluster/cli.rst
    Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml
    Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md
    Fixing doc/source/cluster/kubernetes/user-guides/logging.md
    Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md
    Fixing doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg
    Fixing doc/source/cluster/kubernetes/images/kubeflow-architecture.svg
    Fixing doc/source/cluster/faq.rst
    Fixing doc/source/cluster/running-applications/job-submission/openapi.yml
    Fixing doc/source/cluster/kubernetes/images/AutoscalerOperator.svg
    
    check for added large files..............................................Passed
    check python ast.........................................................Passed
    check json...........................................(no files to check)Skipped
    check toml...........................................(no files to check)Skipped
    black....................................................................Passed
    flake8...................................................................Passed
    prettier.............................................(no files to check)Skipped
    mypy.................................................(no files to check)Skipped
    isort (python)...........................................................Passed
    rst directives end with two colons.......................................Passed
    rst ``inline code`` next to normal text..................................Passed
    use logger.warning(......................................................Passed
    check for not-real mock methods..........................................Passed
    ShellCheck v0.9.0........................................................Passed
    clang-format.........................................(no files to check)Skipped
    Google Java Formatter................................(no files to check)Skipped
    Check for Ray docstyle violations........................................Passed
    Check for Ray import order violations....................................Passed
    ```
    
    Signed-off-by: pdmurray <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    peytondmurray authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8438af2 View commit details
    Browse the repository at this point in the history
  114. [RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…

    …ating to new API stack. (ray-project#47425)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    51be505 View commit details
    Browse the repository at this point in the history
  115. [RLlib] Remove 2nd Learner ConnectorV2 pass from PPO (add new GAE Con…

    …nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f2c5415 View commit details
    Browse the repository at this point in the history
  116. [RLlib; Offline RL] CQL: Support multi-GPU/CPU setup and different le…

    …arning rates for actor, critic, and alpha. (ray-project#47402)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    be99650 View commit details
    Browse the repository at this point in the history
  117. [aDAG] Support multi-read of the same shm channel (ray-project#47311)

    If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    ruisearch42 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ed38e38 View commit details
    Browse the repository at this point in the history
  118. Configuration menu
    Copy the full SHA
    c13190c View commit details
    Browse the repository at this point in the history
  119. [serve] Fix broken microbenchmarks (ray-project#47430)

    With serve shutdown in between every microbenchmark, serve needs to be
    started with grpc options every time for the grpc microbenchmarks.
    
    ## Related issue number
    
    closes ray-project#47424
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8edb0e3 View commit details
    Browse the repository at this point in the history
  120. [ADAG] Support tasks with multiple return values in aDAG (ray-project…

    …#47024)
    
    aDAG currently does not support multiple return values. We would like to
    add general support for multiple return values.
    
    This PR supports multiple returns by returning a separate
    `ClassMethodNode` for each return value of the tuple. It is an
    incremental change for `ClassMethodNode`, addign
    `_is_class_method_output`, `_class_method_call`, `_output_idx`.
    `_output_idx` is used to guide channel allocation and output writes.
    User needs to specify `num_returns > 1` to hint multiple return values.
    The upstream task allocates a separate output channel for each return
    value. A downstream task reads from one of the output channels.
    
    ## What is done?
    
    We modify `ClassMethodNode` to handle two logics, one is a class method
    call which is the original semantics (`self.is_class_method_call ==
    True`), another is a class method output which is responsible for one of
    the multiple return values (`self.is_class_method_output == True`).
    
    We modify `WriterInterface` to support writes to multiple
    `output_channels` with `output_idxs`. If an output index is None, it
    means the complete return value is written to the output channel.
    Otherwise, the return value is a tuple and the index is used to extract
    the value to be written to the output channel.
    
    We allocate separate output channels to different readers. The
    downstream tasks of a `ClassMethodNode` with
    `self.is_class_method_output == True` are the readers of an output
    channel of its upstream `ClassMethodNode`. The example below
    demonstrates this.
    
    ```
    upstream ClassMethodNode (self.is_class_method_call == True, self.output_channels = [c1, c2])
    --> downstream ClassMethodNode (self.is_class_method_method == True, self.output_channels[c1])
    --> ...
    ```
    
    Closes ray-project#45569
    
    ---------
    
    Signed-off-by: Weixin Deng <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    dengwxn authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    150c8ba View commit details
    Browse the repository at this point in the history
  121. Configuration menu
    Copy the full SHA
    8c990d8 View commit details
    Browse the repository at this point in the history
  122. [RLlib] Add option to use torch.lr_scheduler classes for learning r…

    …ate schedules. (ray-project#47453)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    105a904 View commit details
    Browse the repository at this point in the history
  123. [observability][export-api] Write node events (ray-project#47422)

    Same code changes as [observability][export-api] Write node events ray-project#47221
    Move test into a separate file to create a separate bazel target that can be skipped on Windows
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9e0a00d View commit details
    Browse the repository at this point in the history
  124. Configuration menu
    Copy the full SHA
    637c16c View commit details
    Browse the repository at this point in the history
  125. [RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…

    …ating to new API stack (by config). (ray-project#47427)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    68c117a View commit details
    Browse the repository at this point in the history
  126. [serve] add streaming to microbenchmarks (ray-project#47466)

    Add streaming microbenchmark to release tests. Only HTTP, intermediate
    router, and handle for now (no grpc).
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    716f314 View commit details
    Browse the repository at this point in the history
  127. feat: quickstart install button (ray-project#47479)

    ![CleanShot 2024-09-04 at 11 12
    44@2x](https://github.com/user-attachments/assets/9c8dfd64-c565-4285-a1ce-774c6fce2997)
    
    Signed-off-by: Saihajpreet Singh <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    saihaj authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    383b47a View commit details
    Browse the repository at this point in the history
  128. Revert "[Doc] Add Algolia search to docs" (ray-project#47483)

    Reverts ray-project#46477
    
    Signed-off-by: ujjawal-khare <[email protected]>
    can-anyscale authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    dcb8d6d View commit details
    Browse the repository at this point in the history
  129. [release] simplify the process of getting job logs (ray-project#47470)

    The current logic to parse logs from anyscale job is very complicated.
    It first downloads all the logs from the cluster, and try to guess the
    main job logs and error job logs. The logic of getting error job log is
    no longer neccessary.
    
    The new API offers a much simpler way to get the log, update to that
    API.
    
    Test:
    - CI
    - so much cleaner:
    https://buildkite.com/ray-project/release/builds/22057#0191ba75-2f0b-4a0b-9bad-8603003eba4c/741-742
    
    ---------
    
    Signed-off-by: can <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    can-anyscale authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3125db2 View commit details
    Browse the repository at this point in the history
  130. [Core] Fix runtime env race condition when uploading the same package…

    … concurrently (ray-project#47482)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b42f473 View commit details
    Browse the repository at this point in the history
  131. [core][dashboard] Pass in cluster ID in hex for dashboard, dash agent…

    …, rt env agent. (ray-project#47490)
    
    This saves 1 RPC for each GcsClient, which can be O(#nodes).
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a4621ce View commit details
    Browse the repository at this point in the history
  132. [core][experimental] Correct num_input_consumers for CachedChannel (r…

    …ay-project#47489)
    
    Without this PR, the num_input_consumers would be 1 because both inp[0] and inp[1] are only referred to in one task on the actor, so CachedChannel will not be created. The read will eventually time out because the mutable object is being read by the same actor twice.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8a530a2 View commit details
    Browse the repository at this point in the history
  133. Revert Revert "[Doc] Add Algolia search to docs" (ray-project#47487)

    Redo https://github.com/ray-project/ray/pull/47483/files. The previous
    PR was based on a too old base so it gets merged successfully without
    re-compiling the dependencies
    
    Also allow the dry-run of generating build cache to run on premerge, to
    block changes that can break it.
    
    Test:
    - CI
    
    Signed-off-by: can <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    can-anyscale authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    98f6186 View commit details
    Browse the repository at this point in the history
  134. [observability][export-api] Write actor events (ray-project#47303)

    Write actor events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
    Event write is called whenever a value in the actor event data schema is modified. Typically this occurs before writing ActorTableData to the GCS table or publishing the data for the dashboard
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    0c75290 View commit details
    Browse the repository at this point in the history
  135. [ADAG] Log Executable Task Events (ray-project#47345)

    Support logging events for execution task for better observability.
    
    Users can turn on event profiling by setting RAY_ADAG_ENABLE_PROFILING as True
    
    The event tracks the following metadata of a task:
    
    Signed-off-by: ujjawal-khare <[email protected]>
    woshiyyya authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e14400f View commit details
    Browse the repository at this point in the history
  136. [Core] Fix test_runtime_env_working_dir_4 for Windows (ray-project#47505

    )
    
    Windows path needs to be escaped.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    692f9df View commit details
    Browse the repository at this point in the history
  137. [observability][export-api] Write task events (ray-project#47193)

    Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
    All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
    The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
    Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
    Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    eca534a View commit details
    Browse the repository at this point in the history
  138. Configuration menu
    Copy the full SHA
    184e293 View commit details
    Browse the repository at this point in the history
  139. Configuration menu
    Copy the full SHA
    950ad18 View commit details
    Browse the repository at this point in the history
  140. fix quickstart image path (ray-project#47535)

    | Before | After |
    |--------|------|
    |![CleanShot 2024-09-06 at 10 33
    56@2x](https://github.com/user-attachments/assets/0b8dff77-3a7f-4bc7-b117-39fcd4edd69f)
    | ![CleanShot 2024-09-06 at 10 33
    18@2x](https://github.com/user-attachments/assets/ef4c67ba-df95-48c9-8c70-273b75ed5296)
    |
    
    Signed-off-by: Saihajpreet Singh <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    saihaj authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    246e395 View commit details
    Browse the repository at this point in the history
  141. Configuration menu
    Copy the full SHA
    1e4e4d0 View commit details
    Browse the repository at this point in the history
  142. [aDAG] Allow custom NCCL group for aDAG (ray-project#47141)

    Allow custom NCCL group for aDAG so that we can reuse what the user already created.
    
    Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability.
    
    vLLM prototype: vllm-project/vllm#7568
    
    Signed-off-by: ujjawal-khare <[email protected]>
    ruisearch42 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    4792e1d View commit details
    Browse the repository at this point in the history
  143. Configuration menu
    Copy the full SHA
    8b89a9d View commit details
    Browse the repository at this point in the history
  144. [Core] Remove ray._raylet.check_health (ray-project#47526)

    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    bb015e4 View commit details
    Browse the repository at this point in the history
  145. [observability][export-api] Write actor events (ray-project#47529)

    - Add back code changes from [observability][export-api] Write actor events ray-project#47303
    - Separate out actor manager export event test into a separate file so we can skip on windows. Update BUILD rule so all tests in src/ray/gcs/gcs_server/test/export_api are skipped on windows
    
    Signed-off-by: Nikita Vemuri <[email protected]>
    Co-authored-by: Nikita Vemuri <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9cf02de View commit details
    Browse the repository at this point in the history
  146. [observability][export-api] Write task events (ray-project#47538)

    - Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
    - Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).
    
    Signed-off-by: Nikita Vemuri <[email protected]>
    Co-authored-by: Nikita Vemuri <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6f4aaf6 View commit details
    Browse the repository at this point in the history
  147. [RLlib; Offline RL] - Replace GAE in MARWILOfflinePreLearner with `…

    …GeneralAdvantageEstimation` connector in learner pipeline. (ray-project#47532)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8e61bab View commit details
    Browse the repository at this point in the history
  148. [data] Change fixture from shutdown_only to `ray_start_regular_shar…

    …ed` for `test_csv_read_filter_non_csv_file` (ray-project#47513)
    
    ## Why are these changes needed?
    Seems that ray-project#47467 ended up
    breaking some niche setup for this test, by changing the fixture from
    `shutdown_only` to `ray_start_regular_shared` we are able to get the
    test passing again.
    
    ## Related issue number
    
    ## Checks
    
    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [x] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [x] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [x] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    Signed-off-by: Matthew Owen <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    omatthew98 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9eef3b5 View commit details
    Browse the repository at this point in the history
  149. Add perf metrics for 2.35.0 (ray-project#47283)

    ```
    REGRESSION 13.65%: client__get_calls (THROUGHPUT) regresses from 1119.7725751916082 to 966.9141307622872 in microbenchmark.json
    REGRESSION 9.23%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.184014305625574 to 18.32083810818594 in microbenchmark.json
    REGRESSION 8.40%: multi_client_tasks_async (THROUGHPUT) regresses from 23311.858831941317 to 21353.682091539627 in microbenchmark.json
    REGRESSION 6.66%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 3038.941703794114 to 2836.601104413851 in microbenchmark.json
    REGRESSION 4.39%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4456.606860484332 to 4261.050694056448 in microbenchmark.json
    REGRESSION 3.77%: actors_per_second (THROUGHPUT) regresses from 627.338335492887 to 603.6854672610009 in benchmarks/many_actors.json
    REGRESSION 3.47%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.679337230724197 to 13.204885454613315 in microbenchmark.json
    REGRESSION 3.38%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2055.7051275912527 to 1986.177233156469 in microbenchmark.json
    REGRESSION 2.44%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5167.9800954515 to 5041.760637338739 in microbenchmark.json
    REGRESSION 2.33%: placement_group_create/removal (THROUGHPUT) regresses from 824.4108502776797 to 805.1759941825478 in microbenchmark.json
    REGRESSION 1.64%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.485273551888224 to 5.39514490847805 in microbenchmark.json
    REGRESSION 1.28%: single_client_tasks_sync (THROUGHPUT) regresses from 986.5998779605792 to 973.959307673384 in microbenchmark.json
    REGRESSION 0.95%: pgs_per_second (THROUGHPUT) regresses from 22.249430148995714 to 22.037557767422825 in benchmarks/many_pgs.json
    REGRESSION 0.66%: n_n_actor_calls_async (THROUGHPUT) regresses from 26545.931713712664 to 26370.461840482538 in microbenchmark.json
    REGRESSION 0.53%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9060.701663275304 to 9012.880467992636 in microbenchmark.json
    REGRESSION 0.28%: single_client_tasks_async (THROUGHPUT) regresses from 8011.455682416454 to 7988.9069673790045 in microbenchmark.json
    REGRESSION 0.19%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1486.2327104183764 to 1483.4703793760418 in microbenchmark.json
    REGRESSION 107.66%: dashboard_p95_latency_ms (LATENCY) regresses from 34.039 to 70.687 in benchmarks/many_nodes.json
    REGRESSION 30.19%: stage_0_time (LATENCY) regresses from 8.773437261581421 to 11.421970844268799 in stress_tests/stress_test_many_tasks.json
    REGRESSION 27.05%: dashboard_p50_latency_ms (LATENCY) regresses from 3.87 to 4.917 in benchmarks/many_nodes.json
    REGRESSION 9.72%: dashboard_p99_latency_ms (LATENCY) regresses from 119.573 to 131.198 in benchmarks/many_nodes.json
    REGRESSION 9.58%: stage_1_avg_iteration_time (LATENCY) regresses from 23.938837790489195 to 26.23279986381531 in stress_tests/stress_test_many_tasks.json
    REGRESSION 9.41%: stage_3_time (LATENCY) regresses from 3035.906775712967 to 3321.615835428238 in stress_tests/stress_test_many_tasks.json
    REGRESSION 6.37%: dashboard_p95_latency_ms (LATENCY) regresses from 3542.989 to 3768.817 in benchmarks/many_actors.json
    REGRESSION 4.93%: dashboard_p99_latency_ms (LATENCY) regresses from 358.789 to 376.468 in benchmarks/many_pgs.json
    REGRESSION 3.70%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.28579454300001 to 18.961532712000007 in scalability/object_store.json
    REGRESSION 3.56%: avg_pg_create_time_ms (LATENCY) regresses from 0.9371462897900398 to 0.9705077387385862 in stress_tests/stress_test_placement_group.json
    REGRESSION 3.24%: stage_2_avg_iteration_time (LATENCY) regresses from 61.69442081451416 to 63.694758081436156 in stress_tests/stress_test_many_tasks.json
    REGRESSION 2.07%: 10000_get_time (LATENCY) regresses from 23.411743029999997 to 23.896780481999997 in scalability/single_node.json
    REGRESSION 1.74%: dashboard_p50_latency_ms (LATENCY) regresses from 167.38 to 170.294 in benchmarks/many_tasks.json
    REGRESSION 1.51%: 1000000_queued_time (LATENCY) regresses from 186.319367591 to 189.12986922100004 in scalability/single_node.json
    REGRESSION 1.39%: avg_pg_remove_time_ms (LATENCY) regresses from 0.9081441951950084 to 0.9207600330309926 in stress_tests/stress_test_placement_group.json
    REGRESSION 0.59%: dashboard_p95_latency_ms (LATENCY) regresses from 12.055 to 12.126 in benchmarks/many_pgs.json
    ```
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c475f45 View commit details
    Browse the repository at this point in the history
  150. [Core] Reconstruct actor to run lineage reconstruction triggered acto…

    …r task (ray-project#47396)
    
    Currently if we need to rerun an actor task to recover a lost object but the actor is dead, the actor task will fail immediately. This PR allows the actor to be restarted (if it doesn't violate max_restarts) so that the actor task can run to recover lost objects.
    
    In terms of the state machine, we add a state transition from DEAD to RESTARTING.
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    03e5832 View commit details
    Browse the repository at this point in the history
  151. [aDAG] support buffered input (ray-project#47272)

    \Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit
    
    When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur.
    
    Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout.
    Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks.
    This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm).
    
    Main changes
    
    Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner.
    When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7625128 View commit details
    Browse the repository at this point in the history
  152. [aDAG] Clean up arg_to_consumers in _get_or_compile() (ray-project#47514

    )
    
    Clean up the code.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    ruisearch42 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    cbe6687 View commit details
    Browse the repository at this point in the history
  153. Configuration menu
    Copy the full SHA
    53e641a View commit details
    Browse the repository at this point in the history
  154. [Core][aDag] Support multi node multi reader (ray-project#47480)

    This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing.
    
    multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    eb14e06 View commit details
    Browse the repository at this point in the history
  155. Allow control of some serve configuration via env vars (ray-project#4…

    …7533)
    
    When a serve app is launched, serve will startup automatically. In
    certain places like k8s, it can be difficult to preconfigure serve (e.g.
    in the [ray-cluster helm
    chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml)
    there is no ability to set the default serve arguments).
    
    This means you need to either be explicit when you start serve, or if it
    starts up automatically you may need to shut it down, then restart it,
    which is inconvenient.
    
    Signed-off-by: Tim Paine <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    timkpaine authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    50bd27a View commit details
    Browse the repository at this point in the history
  156. Update incremental build troubleshooting tip with style nits (ray-pro…

    …ject#47592)
    
    Style nits.
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    Signed-off-by: angelinalg <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    angelinalg authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    12afcd1 View commit details
    Browse the repository at this point in the history
  157. [observability][export-api] Write driver job events (ray-project#47418)

    Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
    Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a0d3355 View commit details
    Browse the repository at this point in the history
  158. [core][dashboard] push down job_or_submission_id to GCS. (ray-project…

    …#47492)
    
    GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload.
    
    This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls.
    
    ---------
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: Jiajun Yao <[email protected]>
    Co-authored-by: Jiajun Yao <[email protected]>
    Co-authored-by: Alexey Kudinkin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    3 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f541305 View commit details
    Browse the repository at this point in the history
  159. [Doc][KubeRay] Add description tables for RayCluster Status in the ob…

    …servability doc (ray-project#47462)
    
    Signed-off-by: Rueian <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rueian authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    54ce249 View commit details
    Browse the repository at this point in the history
  160. [aDAG] Fix ranks ordering for custom NCCL group (ray-project#47594)

    The ranks should be in the order of the actors.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    ruisearch42 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a0fb580 View commit details
    Browse the repository at this point in the history
  161. [RLlib] RLModule: InferenceOnlyAPI. (ray-project#47572)

    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6e90110 View commit details
    Browse the repository at this point in the history
  162. [Data] Remove _default_metadata_providers (ray-project#47575)

    _default_metadata_providers adds a layer of indirection.
    
    ---------
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    bveeramani authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    69ca5c5 View commit details
    Browse the repository at this point in the history
  163. [Serve] Remove unused Serve constants (ray-project#47593)

    Went through all the constants in the file and remove the ones that's no
    
    Signed-off-by: Gene Su <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    GeneDer authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8f9236c View commit details
    Browse the repository at this point in the history
  164. Fix windows://:task_event_buffer_test (ray-project#47577)

    Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows).
    This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step.
    This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows.
    
    Signed-off-by: Nikita Vemuri <[email protected]>
    Co-authored-by: Nikita Vemuri <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    bd0d6eb View commit details
    Browse the repository at this point in the history
  165. [RLlib] RLModule API: SelfSupervisedLossAPI for RLModules that brin…

    …g their own loss (algo independent). (ray-project#47581)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6293a1f View commit details
    Browse the repository at this point in the history
  166. [GCS] Optimize GetAllJobInfo API for performance (ray-project#47530)

    Signed-off-by: liuxsh9 <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    liuxsh9 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    649148c View commit details
    Browse the repository at this point in the history
  167. [Serve] fix default serve logger behavior (ray-project#47600)

    Re: ray-project#47229
    
    Previous PR to setup default serve logger has some unexpected
    consequence. Mainly combined with Serve's stdout redirect feature (when
    `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve
    logger and redirect all stdout/stderr into serve's log files instead
    going to the console. This caused on the Anyscale platform unable to
    identify ray start command is running successfully and unable to start
    the cluster. This PR fixes this behavior by only configure Serve's
    default logger with stream handler and skip configuring file handler
    altogether.
    
    Signed-off-by: Gene Su <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    GeneDer authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    92f0741 View commit details
    Browse the repository at this point in the history
  168. [core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…

    …rkers. (ray-project#47212)
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: Jiajun Yao <[email protected]>
    Co-authored-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7e0d054 View commit details
    Browse the repository at this point in the history
  169. [core][adag] Separate the outputs of execute and execute_async to mul…

    …tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305)
    
    ## Why are these changes needed?
    Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get
    back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on
    whether `execute` or `execute_async` is invoked, that points to a list
    of all of the outputs. To retrieve one of the outputs, you have to get
    and deserialize all of them at the same time.
    
    This PR separates the output of `execute` and `execute_async` to a list
    of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by
    `MultiOutputNode`. This is particularly useful for vLLM tensor
    parallelism. Since all shards return the same results, we only need to
    fetch result from one of the workers.
    
    Closes ray-project#46908.
    
    ---------
    
    Signed-off-by: jeffreyjeffreywang <[email protected]>
    Signed-off-by: Jeffrey Wang <[email protected]>
    Co-authored-by: jeffreyjeffreywang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7ec6491 View commit details
    Browse the repository at this point in the history
  170. [serve] Faster detection of dead replicas (ray-project#47237)

    ## Why are these changes needed?
    
    Detect replica death earlier on handles/routers. Currently routers will
    process replica death if the actor death error is thrown during active
    probing or system message.
    1. cover one more case: process replica death if error is thrown _while_
    request was being processed on the replica.
    2. improved handling: if error is detected on the system message,
    meaning router found out replica is dead after assigning a request to
    that replica, retry the request.
    
    ### Performance evaluation
    (master results pulled from
    https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42)
    
    Latency:
    | metric | master | this PR | % change |
    | -- | -- | -- | -- |
    | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 |
    | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 |
    | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89
    |
    | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 |
    | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 |
    | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22
    |
    | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 |
    | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 |
    -2.67 |
    | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 |
    -1.96 |
    
    Throughput:
    | metric | master | this PR | % change |
    | -- | -- | -- | -- |
    | http_avg_rps | 359.14 | 357.81 | -0.37 |
    | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 |
    | grpc_avg_rps | 506.16 | 485.92 | -4.0 |
    | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 |
    | handle_avg_rps | 604.52 | 641.66 | 6.14 |
    | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 |
    
    Results: everything except for grpc results are within noise. As for
    grpc results, they have always been relatively noisy (see below), so the
    results are actually also within the noise that we've been seeing. There
    is also no reason why latency for a request would only increase for grpc
    and not http or handle for the changes in this PR, so IMO this is safe.
    ![Screenshot 2024-08-21 at 11 54
    55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378)
    
    ## Related issue number
    
    closes ray-project#47219
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    2ba70de View commit details
    Browse the repository at this point in the history
  171. [spark] Improve Ray-on-spark fault tolerance in case of Spark executo…

    …r being down (e.g. spot instance termination) (ray-project#47493)
    
    Signed-off-by: Weichen Xu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    WeichenXu123 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f2ef047 View commit details
    Browse the repository at this point in the history
  172. [serve] skip failure test on windows (ray-project#47630)

    Skip test_replica_actor_died on windows.
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    72b643b View commit details
    Browse the repository at this point in the history
  173. [serve] reorganize replica scheduler classes (ray-project#47615)

    ## Why are these changes needed?
    
    Pull replica scheduler and replica wrapper out from `common.py` into
    their own files.
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7b136f9 View commit details
    Browse the repository at this point in the history
  174. [Core] Remove code accidently got in (ray-project#47612)

    Idk how this was genearted
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    309a86c View commit details
    Browse the repository at this point in the history
  175. [Core][aDAG] support multi readers in multi node when dag is created …

    …from an actor (ray-project#47601)
    
    Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself.
    
    This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()).
    
    This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b7b5c51 View commit details
    Browse the repository at this point in the history
  176. [core] out of band serialization exception (ray-project#47544)

    Introduce an env var to raise an exception when there's out of band seriailzation of object ref
    Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
    Update an anti-pattern doc.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a2b0cc3 View commit details
    Browse the repository at this point in the history
  177. [core][experimental] Allocate a channel for each InputAttributeNode (r…

    …ay-project#47564)
    
    Change 1: Remove class DAGInputAdapter.
    
    Without this PR, the entire input data will be written to the channel, even if a reader only wants to retrieve partial input data via InputAttributeNode. Then, the entire input data will be read by the READ operation, and the partial input will be retrieved during the COMPUTE operation (code)
    In this PR, each InputAttributeNode has its own channel, and only the corresponding input data will be written to the channel. Therefore, we no longer need to use DAGInputAdapter to retrieve the partial input data during the COMPUTE operation.
    Change 2: If the DAG contains any InputAttributeNode, create a channel for each InputAttributeNode. Then, write the partial input data to the corresponding channel (code).
    
    Change 3: There are some if/else statements to handle InputNode and InputAttributeNode for creating CachedChannel. This PR unifies the logic because InputNode and different InputAttributeNode are no longer considered consumers of only one input channel. Each InputAttributeNode has its own channel.
    
    Change 4: Move RayDAGArgs from compiled_dag_node.py to common.py to avoid importing it inside _adapt.
    
    Without this, this PR is about 5% slower than the baseline in the case "Benchmark: single actor, no InputAttributeNode". With this change, the performance is almost the same as, or slightly better than, the baseline. See "Benchmark: single actor, no InputAttributeNode" below for more details.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ffa2d34 View commit details
    Browse the repository at this point in the history
  178. [Data] Add partitioning parameter to read_parquet (ray-project#47553

    )
    
    To extract path partition information with `read_parquet`, you pass a
    PyArrow `partitioning` object to `dataset_kwargs`. For example:
    ```
    schema = pa.schema([("one", pa.int32()), ("two", pa.string())])
    partitioning = pa.dataset.partitioning(schema, flavor="hive")
    ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning))
    ```
    
    This is problematic for two reasons:
    1. It tightly couples the interface with the implementation;
    partitioning only works if we use `pyarrow.Dataset` in a specific way in
    the implementation.
    2. It's inconsistent with all of the other file-based API. All other
    APIs use expose a top-level `partitioning` parameter (rather than
    `dataset_kwargs`) where you pass a Ray Data `Partitioning` object
    (rather than a PyArrow partitioning object).
    
    ---------
    
    Signed-off-by: Balaji Veeramani <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    bveeramani authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8839ad4 View commit details
    Browse the repository at this point in the history
  179. [spark] Refine comment in Starting ray worker spark task (ray-project…

    …#47670)
    
    Signed-off-by: Weichen Xu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    WeichenXu123 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    2af394f View commit details
    Browse the repository at this point in the history
  180. [Core][aDAG] Set buffer size to 1 for regression (ray-project#47639)

    There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation.
    With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    27c71b6 View commit details
    Browse the repository at this point in the history
  181. Add perf metrics for 2.36.0 (ray-project#47574)

    ```
    REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json
    REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json
    REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json
    REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json
    REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json
    REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json
    REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json
    REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json
    REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json
    REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json
    REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json
    REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json
    REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json
    REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json
    REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json
    REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json
    REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json
    REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json
    REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json
    REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json
    REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json
    ```
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b2ccedc View commit details
    Browse the repository at this point in the history
  182. [RLlib] Add "shuffle batch per epoch" option. (ray-project#47458)

    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    5c66fac View commit details
    Browse the repository at this point in the history
  183. Configuration menu
    Copy the full SHA
    6c165c2 View commit details
    Browse the repository at this point in the history
  184. [Core] Make JobSupervisor logs structured (ray-project#47699)

    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    6439db3 View commit details
    Browse the repository at this point in the history
  185. [serve] wrap obj ref in result wrapper in deployment response (ray-pr…

    …oject#47655)
    
    ## Why are these changes needed?
    
    Abstract `ray.ObjectRef` and `ray.ObjectRefGenerator` in a result
    wrapper that the deployment response can directly call into.
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3f63c45 View commit details
    Browse the repository at this point in the history
  186. [Core] Fix broken dashboard worker page (ray-project#47714)

    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    94b5e06 View commit details
    Browse the repository at this point in the history
  187. [core][experimental] Remove unused attr CompiledDAG._type_hints (ray-…

    …project#47706)
    
    CompiledDAG._type_hints is not used.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    47de542 View commit details
    Browse the repository at this point in the history
  188. [Data] Re-phrase the streaming executor current usage string (ray-pro…

    …ject#47515)
    
    ## Why are these changes needed?
    
    The progress bar for ray data could still end up showing higher
    utilization of what the cluster currently have.
    ray-project#46729 was the first attempt to
    fix it which addressed the issue in static clusters, but we still have
    that issue for clusters that autoscales. This change simply rephrase the
    string so it is less confusing.
    
    Before
    <img width="1249" alt="image"
    src="https://github.com/user-attachments/assets/049ea096-a87f-4767-ba04-6d00d7c2755d">
    
    After
    <img width="1248" alt="image"
    src="https://github.com/user-attachments/assets/cb74c0dc-1f33-4b22-b31c-e83df2a5d408">
    
    This comes from the fact that operators don't track the task state (and
    currently ray core does not even provide that api). Which means Ray data
    operators does not know if the task is assigned to a node or not, so
    once the task is submitted to ray it is marked active even if it is
    pending a node assignment. The dashboard does better here since it does
    have extra information from the task.
    
    <img width="1493" alt="image"
    src="https://github.com/user-attachments/assets/9315b884-3e61-4b32-8400-7f76e15b6a4b">
    
    In the future we can visit adding the core api for remote state
    reporting and allowing operators to provide more detailed state (active,
    pending_scheduled, pending_node_assignment).
    
    ## Related issue number
    
    ## Checks
    
    - [ ] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [ ] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [ ] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    ---------
    
    Signed-off-by: Sofian Hnaide <[email protected]>
    Co-authored-by: scottjlee <[email protected]>
    Co-authored-by: matthewdeng <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    3 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e9e4d7e View commit details
    Browse the repository at this point in the history
  189. [serve] improve tests (ray-project#47722)

    ## Why are these changes needed?
    
    - We can make some tests asynchronous instead of having to rely on
    `_to_object_ref`.
    - we can use `RayActorError` instead of `ActorDiedError`
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    27ee9f1 View commit details
    Browse the repository at this point in the history
  190. [Core] Add test case where there is dead node for /nodes?view=summary…

    … endpoint (ray-project#47727)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    50016c0 View commit details
    Browse the repository at this point in the history
  191. [Dashboard] Optimizing performance of Ray Dashboard (ray-project#47617)

    Signed-off-by: Alexey Kudinkin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    alexeykudinkin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    4012314 View commit details
    Browse the repository at this point in the history
  192. [core][aDAG] Fix a bug where multi arg + exception doesn't work (ray-…

    …project#47704)
    
    Currently, when there's an exception, there's only 1 return value, but multi ref assumes that the return value has to match the # of output channels. It fixes the issue by duplicating exception to match the number of output channels.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    rkooo567 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    605221a View commit details
    Browse the repository at this point in the history
  193. [fake autoscaler] use check_call in fake multi node test utils (ray-p…

    …roject#47772)
    
    so that output is printed to logs
    
    and also use "sys.executable" rather than "python"
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    2e739d8 View commit details
    Browse the repository at this point in the history
  194. [RLlib] RLModule: Simplify defining custom distribution classes and a…

    …dd better defaults. (ray-project#47775)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b413593 View commit details
    Browse the repository at this point in the history
  195. [fake autoscaler] remove the redundant mkdir (ray-project#47786)

    - docker compose service volume short syntax uses bind (similar to `-v`
    and will create the dir if not exist
    - the code was not mapping the dir to host path, so it actually has no
    meaningful effect when it is running in a container, such as on CI
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    fee22c2 View commit details
    Browse the repository at this point in the history
  196. [Data] Simplify and consolidate progress bar outputs (ray-project#47692)

    ## Why are these changes needed?
    
    Currently, the progress bar is pretty verbose because it is very
    information dense. This PR:
    - Reorganizes progress output to group by relevant concepts and
    clarifies labels
    - Standardizes global and operator-level progress bar outputs
    - Removes the use of all emojis (poor rendering on some platforms /
    external logging systems)
    
    Progress bar before this PR:
    <img width="1403" alt="Screenshot at Sep 16 13-00-17"
    src="https://github.com/user-attachments/assets/4f459b77-06ba-4395-b883-e4c9ac8ca2ef">
    
    Progress bar after this PR:
    <img width="1502" alt="Screenshot at Sep 23 13-48-32"
    src="https://github.com/user-attachments/assets/0c0f8c94-9439-4fd4-ae1a-2857b3a87b59">
    
    Will follow up with a docs PR once we merge this change, so that I don't
    need to continuously modify the docs.
    
    In the future, we should restructure the way progress bars are
    grouped/tracked, so that we can tabulate the op-level progress bar
    outputs.
    
    ## Related issue number
    
    ## Checks
    
    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [x] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [x] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [ ] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    ---------
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    scottjlee authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c20e3b1 View commit details
    Browse the repository at this point in the history
  197. Add perf metrics for 2.37.0 (ray-project#47791)

    for release perf checking.
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e438357 View commit details
    Browse the repository at this point in the history
  198. [docker] Update latest Docker dependencies for 2.36.0 release (ray-pr…

    …oject#47748)
    
    Created by release automation bot.
    
    Update with commit f298a75
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8c73745 View commit details
    Browse the repository at this point in the history
  199. [docker] Update latest Docker dependencies for 2.36.1 release (ray-pr…

    …oject#47801)
    
    Created by release automation bot.
    
    Update with commit 18b2d94
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: Kevin H. Luu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8b0a597 View commit details
    Browse the repository at this point in the history
  200. [observability][export-api] Write submission job events (ray-project#…

    …47468)
    
    Add ExportEventLoggerAdapter which will be used to write export events to file from python files. Only a single ExportEventLoggerAdapter instance will exist per source type, so callers can create or get this instance using get_export_event_logger which is thread safe.
    Write Submission Job export events to file from JobInfoStorageClient.put_info which is called to update the JobInfo data in the internal KV store.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    5681a4a View commit details
    Browse the repository at this point in the history
  201. Move export events to separate folder (ray-project#47747)

    Move export events from session_latest/logs/events to session_latest/logs/export_events
    Keeping both event types in the same folder doesn't cause any issue for Ray -- export event files are already filtered out for /events API in
    ray/python/ray/dashboard/modules/event/event_utils.py
    
    Line 22 in 1e48a03
    
     all_source_types = set(event_consts.EVENT_SOURCE_ALL)
    However moving these to a separate folder would be better for existing downstream consumers to avoid handling export events in the events folder if they turn the flag on
    
    Signed-off-by: ujjawal-khare <[email protected]>
    nikitavemuri authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    2b21a08 View commit details
    Browse the repository at this point in the history
  202. [release] stream the full anyscale log to buildkite (ray-project#47808)

    Currently we only print 100 last lines of anyscale job log to buildkite.
    This PR removes that limit and prints everything instead. CC:
    @kouroshHakha
    
    Test:
    - CI
    
    Signed-off-by: can <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    can-anyscale authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a665c67 View commit details
    Browse the repository at this point in the history
  203. Configuration menu
    Copy the full SHA
    68bd111 View commit details
    Browse the repository at this point in the history
  204. [docker] Update latest Docker dependencies for 2.37.0 release (ray-pr…

    …oject#47812)
    
    Created by release automation bot.
    
    Update with commit d2982b7
    
    Signed-off-by: kevin <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    khluu authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c432e5c View commit details
    Browse the repository at this point in the history
  205. [RLlib] Fix action masking example. (ray-project#47817)

    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e5c2bd4 View commit details
    Browse the repository at this point in the history
  206. [Core] Separate the attempt_number with the task_status in memory sum…

    …mary and object list (ray-project#47818)
    
    # Current status:
    * When we retrieve the information from GCS, the task_status as well as
    the attempts are in 2 fields and the task status is an enum.
    * Later during reconstruction, the 2 fields are combined into 1 and the
    number of attempts is added to the task_status field.
    * That's why when displaying the objects, the function isn't able to
    convert the string back to enum.
    
    # Proposed solution:
    * Instead of combining the 2 fields (task_status and attempt), we will
    keep the 2 fields and added an additional field (attempt_number) in the
    Object State
    * In this way, we will keep the task_status as enum and put the attempt
    number information in a different field
    # Changes in this PR:
    * Added the `attempt_number` in `ObjectState` and
    `task_attempt_number_counts` in `ObjectSummaryPerKey`
      * Added logic to populate the fields as proposed above
    * Updated the logic for the memory summary function to display the
    attempt number in a new column
      * Corresponding tests added as well
    
    Signed-off-by: Mengjin Yan <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MengjinYan authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    75bc8fe View commit details
    Browse the repository at this point in the history
  207. [RLlib; docs] New API stack migration guide. (ray-project#47779)

    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    8b67dc6 View commit details
    Browse the repository at this point in the history
  208. [RLlib; new API stack by default] Switch on new API stack by default …

    …for SAC and DQN. (ray-project#47217)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    bcda013 View commit details
    Browse the repository at this point in the history
  209. [Core] Fix a Typo in dict_to_state function parameter name (ray-proje…

    …ct#47822)
    
    Signed-off-by: Mengjin Yan <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MengjinYan authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a335bdc View commit details
    Browse the repository at this point in the history
  210. [core] Introducing InstrumentedIOContextWithThread. (ray-project#47831)

    Previously we had several ad-hoc places to do a "thread and io_context"
    pattern: create a thread dedicated to an asio io_context, then workload
    can post async tasks onto it. This makes duplicate code: everywhere we
    create threads, implement stop and join.
    
    Introducing InstrumentedIOContextWithThread that does exactly this and
    replaces existing usages.
    
    Also fixes some absl::Time computations with best practice.
    
    This is refactoring. Should have no runtime difference.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    62802bd View commit details
    Browse the repository at this point in the history
  211. [RLlib] Discontinue support for "hybrid" API stack (using RLModule + …

    …Learner, but still on RolloutWorker and Policy) (ray-project#46085)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    5e7601b View commit details
    Browse the repository at this point in the history
  212. [Core] Fix object reconstruction hang on arguments pending creation (r…

    …ay-project#47645)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    4e54d89 View commit details
    Browse the repository at this point in the history
  213. [core][experimental] Fix test_execution_schedule_gpu (ray-project#47753)

    Pass a GPU tensor to execute, but it gets converted into a CPU tensor. The issue may be related to ray-project#46440.
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    00cc0a5 View commit details
    Browse the repository at this point in the history
  214. [core] Change many Ray ID logs to WithField. (ray-project#47844)

    Use structured logging by changing more `<< node_id` to use
    `.WithField(node_id)`. This is not intended to be a complete work, but
    it should cover most of the cases. We did the work for NodeID, WorkerID,
    ActorID, JobID, TaskID, PlacementGroupID.
    
    Some logs have multiple IDs. To avoid confusion, for these we only use
    WithField(object_id) don't use WithField on either of the Node IDs.
    
    This PR should have no change on Ray other than logs.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f3d8e46 View commit details
    Browse the repository at this point in the history
  215. [RLlib] Cleanup examples folder (vol 30): BC pretraining, then PPO fi…

    …netuning (new API stack with RLModule checkpoints). (ray-project#47838)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    80f4941 View commit details
    Browse the repository at this point in the history
  216. [RLlib] MultiAgentEnv API enhancements (related to defining obs-/acti…

    …on spaces for agents). (ray-project#47830)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f4a1d5c View commit details
    Browse the repository at this point in the history
  217. [RLlib] Add log-std clipping to 'MLPHead's. (ray-project#47827)

    Signed-off-by: ujjawal-khare <[email protected]>
    simonsays1980 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    554195d View commit details
    Browse the repository at this point in the history
  218. Configuration menu
    Copy the full SHA
    041874d View commit details
    Browse the repository at this point in the history
  219. [kuberay] Update docs for KubeRay v1.2.2 (ray-project#47867)

    change kuberay helm and branch reference versions to v1.2.2
    
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    9900778 View commit details
    Browse the repository at this point in the history
  220. [Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb (

    …ray-project#47832)
    
    Currently, when using tensor type in Ray Data if single tensor in a
    block grows above 2Gb (due to use of signed `int32` as offsets) this
    would result in the following issue:
    
    ```
    pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
    ```
    
    Consequently, this change adds support for tensors of > 4Gb in size,
    while maintaining compatibility with existing datasets already using
    tensors.
    
    This is done by forking off `ArrowTensorType` in 2:
    
     - `ArrowTensorType` (v1) remaining intact
    - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as
    now using `int64` offsets
    
    ---------
    
    Signed-off-by: Peter Wang <[email protected]>
    Signed-off-by: Alexey Kudinkin <[email protected]>
    Co-authored-by: Peter Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    4d35582 View commit details
    Browse the repository at this point in the history
  221. [Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeI…

    …D()) == sync_reactors_.end() (ray-project#47861)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    01e7634 View commit details
    Browse the repository at this point in the history
  222. [RLlib] New API stack: (Multi)RLModule overhaul vol 01 (some preparat…

    …ory cleanups). (ray-project#47884)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a7aba20 View commit details
    Browse the repository at this point in the history
  223. [RLlib] New API stack: (Multi)RLModule overhaul vol 02 (VPG RLModule,…

    … Algo, and Learner example classes). (ray-project#47885)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    d2f8737 View commit details
    Browse the repository at this point in the history
  224. [RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…

    …eric `_forward` to further simplify the user experience). (ray-project#47889)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b7e3789 View commit details
    Browse the repository at this point in the history
  225. [RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …

    …DreamerV3 on new API stack remains with tf now). (ray-project#47892)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    bbb59bb View commit details
    Browse the repository at this point in the history
  226. [core] Change debug_string from returning a string to streaming to an…

    … ostream. (ray-project#47893)
    
    We have a convenience function `debug_string` used in Ray logs: it
    prints printables (operator<<), containers, pairs. However it returns a
    std::string which is feed into RAY_LOG(). This makes a copy.
    
    Changes the signature to return a `DebugStringWrapper` which holds const
    reference to the argument, and is printable for all already supported
    types. Additionally supports std::tuple.
    
    This should only have marginal perf benefits since we typically don't
    debug_string a very big data structure.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    815a9e4 View commit details
    Browse the repository at this point in the history
  227. [Serve / Jobs] Check if conda env exists before removing (ray-project…

    …#47922)
    
    ## Why are these changes needed?
    Fixes some failing/flaky unit tests tests, which fail with errors like:
    ```
    EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b
    ```
    - TestBackwardsCompatibility.test_cli
    - test_failed_driver_exit_code
    
    Previously failing test now passes with this PR applied:
    https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934
    
    ## Related issue number
    
    ## Checks
    
    - [x] I've signed off every commit(by using the -s flag, i.e., `git
    commit -s`) in this PR.
    - [x] I've run `scripts/format.sh` to lint the changes in this PR.
    - [ ] I've included any doc changes needed for
    https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I
    added a
    method in Tune, I've added it in `doc/source/tune/api/` under the
               corresponding `.rst` file.
    - [x] I've made sure the tests are passing. Note that there might be a
    few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    - Testing Strategy
       - [x] Unit tests
       - [ ] Release tests
       - [ ] This PR is not tested :(
    
    Signed-off-by: Scott Lee <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    scottjlee authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f07ef31 View commit details
    Browse the repository at this point in the history
  228. [job] don't continue on test setup (ray-project#47927)

    when the conda env exists, should just remove it and continue
    the testing
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    ff382fa View commit details
    Browse the repository at this point in the history
  229. [core][experimental] Avoid false positives in deadlock detection (ray…

    …-project#47912)
    
    Signed-off-by: Kai-Hsun Chen <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    kevin85421 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    80a7ef7 View commit details
    Browse the repository at this point in the history
  230. [serve] Stop scheduling task early when requests have been cancelled (r…

    …ay-project#47847)
    
    In `fulfill_pending_requests`, there are two nested loops:
    - the outer loop greedily fulfills more requests so that if backoff
    doesn't occur, it's not necessary for new asyncio tasks to be started to
    fulfill each request
    - the inner loop handles backoff if replicas can't be found to fulfill
    the next request
    
    The outer loop will be stopped if there are enough tasks to handle all
    pending requests. However if all replicas are at max capacity, it's
    possible for the inner loop to continue to loop even when the task is no
    longer needed (e.g. when a request has been cancelled), because the
    inner loop simply continues to try to find an available replica without
    checking if the current task is even necessary.
    
    This PR makes sure that at the end of each iteration of the inner loop,
    it clears out requests in `pending_requests_to_fulfill` that have been
    cancelled, and then breaks out of the loop if there are enough tasks to
    handle the remaining requests.
    
    Tests:
    - Added a test that tests for the scenario where a request is cancelled
    while it's trying to find an available replica
    - Also modified the tests in `test_pow_2_scheduler.py` so that the
    backoff sequence is small values (1ms), and the timeouts in the tests
    are also low `10ms`, so that the unit tests run much faster (~5s now
    compared to ~30s before).
    
    ## Related issue number
    
    related: ray-project#47585
    
    ---------
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    71d5ad4 View commit details
    Browse the repository at this point in the history
  231. [RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…

    …cs, SpecDict, TensorSpec). (ray-project#47915)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    c4d884b View commit details
    Browse the repository at this point in the history
  232. [RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…

    … not catch correct `ObjectLostError`). (ray-project#47940)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    a994eec View commit details
    Browse the repository at this point in the history
  233. [RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…

    …oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    08fc41e View commit details
    Browse the repository at this point in the history
  234. [Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (ray-…

    …project#47659)
    
    Signed-off-by: Jiajun Yao <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    jjyao authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    90160e6 View commit details
    Browse the repository at this point in the history
  235. Configuration menu
    Copy the full SHA
    269b9ad View commit details
    Browse the repository at this point in the history
  236. [core] Add more debug string types (ray-project#47928)

    Followup on ray-project#47893, add more
    "blessed container types" to debug string function.
    
    Signed-off-by: dentiny <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    dentiny authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7260fdf View commit details
    Browse the repository at this point in the history
  237. [deps] add grpcio-tools into anyscale dependencies (ray-project#47955)

    so that it participates in the dependency resolving process
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    20e1cad View commit details
    Browse the repository at this point in the history
  238. [RLlib] Quick-fix for default RLModules in combination with a user-pr…

    …ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    611b645 View commit details
    Browse the repository at this point in the history
  239. [RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…

    …amples. (ray-project#47970)
    
    Signed-off-by: ujjawal-khare <[email protected]>
    sven1977 authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    0dcefc0 View commit details
    Browse the repository at this point in the history
  240. Configuration menu
    Copy the full SHA
    f218402 View commit details
    Browse the repository at this point in the history
  241. [serve] Fix failing test pow 2 scheduler on windows (ray-project#47975)

    ## Why are these changes needed?
    
    Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio
    is slower on windows, so the shortened timeouts for some tests cause the
    tests to fail because tasks didn't get a chance to start/finish
    executing.
    
    Failing tests on windows:
    - `test_multiple_queries_with_different_model_ids`
    - `test_queue_len_cache_replica_at_capacity_is_probed`
    - `test_queue_len_cache_background_probing`
    
    ## Related issue number
    
    Closes ray-project#47950
    
    Signed-off-by: Cindy Zhang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    zcin authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    7a6cfe0 View commit details
    Browse the repository at this point in the history
  242. [data] fix reading multiple parquet files with ragged ndarrays (ray-p…

    …roject#47961)
    
    ## Why are these changes needed?
    
    PyArrow infers parquet schema only based on the first file. This will
    cause errors when reading multiple files with ragged ndarrays.
    
    This PR fixes this issue by not using the inferred schema for reading.
    
    <!-- Please give a short summary of the change and the problem this
    solves. -->
    
    ## Related issue number
    Fixes ray-project#47960
    
    ---------
    
    Signed-off-by: Hao Chen <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    raulchen authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    e2f7c91 View commit details
    Browse the repository at this point in the history
  243. [core] Decouple create worker vs pop worker request. (ray-project#47694)

    Now, when you call PopWorker(), it finds an idle one or creates a
    worker. If a new worker is created, the worker is associated to the
    request and can only be used by it.
    
    This PR decouples the worker creation and the worker-to-task assignment,
    by adding an abstraction namely PopWorkerRequest. Now, if a req triggers
    a worker creation, the req is put into a queue. If there are workers
    ready, that is a PushWorker is called, either from a newly started
    worker or a released worker, Ray matches the first fitting request in
    the queue. This reduces latency.
    
    Later it can also be used to pre-start workers more meaningfully.
    
    Signed-off-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    rynewang authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    3eff78e View commit details
    Browse the repository at this point in the history
  244. [core] Add metrics for gcs jobs (ray-project#47793)

    This PR adds metrics for job states within job manager.
    
    In detail, a gauge stats is sent via opencensus exporter, so running ray
    jobs could be tracked and alerts could be created later on.
    
    Fault tolerance is not considered, according to
    [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
    state is re-constructed at restart.
    
    On testing, the best way is to observe via opencensus backend (i.e.
    google monitoring dashboard), but not easy for open-source contributors;
    or to have a mock / fake exporter implementation, which I don't find in
    the code base.
    
    Signed-off-by: dentiny <[email protected]>
    Co-authored-by: Ruiyang Wang <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    2 people authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    2597701 View commit details
    Browse the repository at this point in the history
  245. upgrade grpcio version (ray-project#47982)

    to at least 1.66.1
    
    this is already being overwritten to 1.66.1+ when during release tests
    
    Signed-off-by: Lonnie Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    aslonnie authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    b644b30 View commit details
    Browse the repository at this point in the history
  246. [Feat][Core] Implement single file module for runtime_env (ray-projec…

    …t#47807)
    
    Supports single file modules in `py_module` runtime_env.
    
    Signed-off-by: Chi-Sheng Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MortalHappiness authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    eac1cb6 View commit details
    Browse the repository at this point in the history
  247. [Chore][Core] Address PR 47807 comments (ray-project#48002)

    PR 47807 was auto-merged without applying the doc reviews, so this
    commit addresses them.
    
    Signed-off-by: Chi-Sheng Liu <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    MortalHappiness authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f84cd6b View commit details
    Browse the repository at this point in the history
  248. [core] Add thread check to job mgr callback (ray-project#48005)

    This PR followup for comment
    ray-project#47793 (comment),
    and adds a thread checking to GCS job manager callback to make sure no
    concurrent access for data members.
    
    Signed-off-by: dentiny <[email protected]>
    Signed-off-by: ujjawal-khare <[email protected]>
    dentiny authored and ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    5d7ab4b View commit details
    Browse the repository at this point in the history
  249. remove handler released

    Signed-off-by: ujjawal-khare <[email protected]>
    ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    d5193c9 View commit details
    Browse the repository at this point in the history
  250. Merge branch 'fix/job-manager-logger' of github.com:ujjawal-khare-27/…

    …ray into fix/job-manager-logger
    ujjawal-khare committed Oct 15, 2024
    Configuration menu
    Copy the full SHA
    f68bfa7 View commit details
    Browse the repository at this point in the history