Fix/job manager logger #48003

\Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur. Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout. Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks. This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm). Main changes Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner. When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception. Signed-off-by: ujjawal-khare <[email protected]>

) Clean up the code. Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes. Signed-off-by: ujjawal-khare <[email protected]>

…7533) When a serve app is launched, serve will startup automatically. In certain places like k8s, it can be difficult to preconfigure serve (e.g. in the [ray-cluster helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml) there is no ability to set the default serve arguments). This means you need to either be explicit when you start serve, or if it starts up automatically you may need to shut it down, then restart it, which is inconvenient. Signed-off-by: Tim Paine <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ject#47592) Style nits. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: angelinalg <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table Signed-off-by: ujjawal-khare <[email protected]>

…#47492) GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload. This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls. --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…servability doc (ray-project#47462) Signed-off-by: Rueian <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

The ranks should be in the order of the actors. Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows). This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step. This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows. Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…g their own loss (algo independent). (ray-project#47581) Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: liuxsh9 <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Re: ray-project#47229 Previous PR to setup default serve logger has some unexpected consequence. Mainly combined with Serve's stdout redirect feature (when `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve logger and redirect all stdout/stderr into serve's log files instead going to the console. This caused on the Anyscale platform unable to identify ray start command is running successfully and unable to start the cluster. This PR fixes this behavior by only configure Serve's default logger with stream handler and skip configuring file handler altogether. Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305) ## Why are these changes needed? Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on whether `execute` or `execute_async` is invoked, that points to a list of all of the outputs. To retrieve one of the outputs, you have to get and deserialize all of them at the same time. This PR separates the output of `execute` and `execute_async` to a list of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by `MultiOutputNode`. This is particularly useful for vLLM tensor parallelism. Since all shards return the same results, we only need to fetch result from one of the workers. Closes ray-project#46908. --------- Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Jeffrey Wang <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…r being down (e.g. spot instance termination) (ray-project#47493) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Skip test_replica_actor_died on windows. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Pull replica scheduler and replica wrapper out from `common.py` into their own files. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Idk how this was genearted Signed-off-by: ujjawal-khare <[email protected]>

…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <[email protected]>

Introduce an env var to raise an exception when there's out of band seriailzation of object ref Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details. Update an anti-pattern doc. Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47564) Change 1: Remove class DAGInputAdapter. Without this PR, the entire input data will be written to the channel, even if a reader only wants to retrieve partial input data via InputAttributeNode. Then, the entire input data will be read by the READ operation, and the partial input will be retrieved during the COMPUTE operation (code) In this PR, each InputAttributeNode has its own channel, and only the corresponding input data will be written to the channel. Therefore, we no longer need to use DAGInputAdapter to retrieve the partial input data during the COMPUTE operation. Change 2: If the DAG contains any InputAttributeNode, create a channel for each InputAttributeNode. Then, write the partial input data to the corresponding channel (code). Change 3: There are some if/else statements to handle InputNode and InputAttributeNode for creating CachedChannel. This PR unifies the logic because InputNode and different InputAttributeNode are no longer considered consumers of only one input channel. Each InputAttributeNode has its own channel. Change 4: Move RayDAGArgs from compiled_dag_node.py to common.py to avoid importing it inside _adapt. Without this, this PR is about 5% slower than the baseline in the case "Benchmark: single actor, no InputAttributeNode". With this change, the performance is almost the same as, or slightly better than, the baseline. See "Benchmark: single actor, no InputAttributeNode" below for more details. Signed-off-by: ujjawal-khare <[email protected]>

) To extract path partition information with `read_parquet`, you pass a PyArrow `partitioning` object to `dataset_kwargs`. For example: ``` schema = pa.schema([("one", pa.int32()), ("two", pa.string())]) partitioning = pa.dataset.partitioning(schema, flavor="hive") ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning)) ``` This is problematic for two reasons: 1. It tightly couples the interface with the implementation; partitioning only works if we use `pyarrow.Dataset` in a specific way in the implementation. 2. It's inconsistent with all of the other file-based API. All other APIs use expose a top-level `partitioning` parameter (rather than `dataset_kwargs`) where you pass a Ray Data `Partitioning` object (rather than a PyArrow partitioning object). --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…#47670) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation. With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that Signed-off-by: ujjawal-khare <[email protected]>

``` REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json ``` Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…oject#47655) ## Why are these changes needed? Abstract `ray.ObjectRef` and `ray.ObjectRefGenerator` in a result wrapper that the deployment response can directly call into. --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…project#47706) CompiledDAG._type_hints is not used. Signed-off-by: ujjawal-khare <[email protected]>

…ject#47515) ## Why are these changes needed? The progress bar for ray data could still end up showing higher utilization of what the cluster currently have. ray-project#46729 was the first attempt to fix it which addressed the issue in static clusters, but we still have that issue for clusters that autoscales. This change simply rephrase the string so it is less confusing. Before <img width="1249" alt="image" src="https://github.com/user-attachments/assets/049ea096-a87f-4767-ba04-6d00d7c2755d"> After <img width="1248" alt="image" src="https://github.com/user-attachments/assets/cb74c0dc-1f33-4b22-b31c-e83df2a5d408"> This comes from the fact that operators don't track the task state (and currently ray core does not even provide that api). Which means Ray data operators does not know if the task is assigned to a node or not, so once the task is submitted to ray it is marked active even if it is pending a node assignment. The dashboard does better here since it does have extra information from the task. <img width="1493" alt="image" src="https://github.com/user-attachments/assets/9315b884-3e61-4b32-8400-7f76e15b6a4b"> In the future we can visit adding the core api for remote state reporting and allowing operators to provide more detailed state (active, pending_scheduled, pending_node_assignment). ## Related issue number ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Sofian Hnaide <[email protected]> Co-authored-by: scottjlee <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? - We can make some tests asynchronous instead of having to rely on `_to_object_ref`. - we can use `RayActorError` instead of `ActorDiedError` Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

… endpoint (ray-project#47727) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…project#47704) Currently, when there's an exception, there's only 1 return value, but multi ref assumes that the return value has to match the # of output channels. It fixes the issue by duplicating exception to match the number of output channels. Signed-off-by: ujjawal-khare <[email protected]>

…roject#47772) so that output is printed to logs and also use "sys.executable" rather than "python" Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…dd better defaults. (ray-project#47775) Signed-off-by: ujjawal-khare <[email protected]>

- docker compose service volume short syntax uses bind (similar to `-v` and will create the dir if not exist - the code was not mapping the dir to host path, so it actually has no meaningful effect when it is running in a container, such as on CI Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Currently, the progress bar is pretty verbose because it is very information dense. This PR: - Reorganizes progress output to group by relevant concepts and clarifies labels - Standardizes global and operator-level progress bar outputs - Removes the use of all emojis (poor rendering on some platforms / external logging systems) Progress bar before this PR: <img width="1403" alt="Screenshot at Sep 16 13-00-17" src="https://github.com/user-attachments/assets/4f459b77-06ba-4395-b883-e4c9ac8ca2ef"> Progress bar after this PR: <img width="1502" alt="Screenshot at Sep 23 13-48-32" src="https://github.com/user-attachments/assets/0c0f8c94-9439-4fd4-ae1a-2857b3a87b59"> Will follow up with a docs PR once we merge this change, so that I don't need to continuously modify the docs. In the future, we should restructure the way progress bars are grouped/tracked, so that we can tabulate the op-level progress bar outputs. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

for release perf checking. Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Add `pyOpenSSL` dependency for Serve. And update test docker file to use ray[serve-grpc] dependencies. Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…oject#47748) Created by release automation bot. Update with commit f298a75 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…oject#47801) Created by release automation bot. Update with commit 18b2d94 Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…47468) Add ExportEventLoggerAdapter which will be used to write export events to file from python files. Only a single ExportEventLoggerAdapter instance will exist per source type, so callers can create or get this instance using get_export_event_logger which is thread safe. Write Submission Job export events to file from JobInfoStorageClient.put_info which is called to update the JobInfo data in the internal KV store. Signed-off-by: ujjawal-khare <[email protected]>

Move export events from session_latest/logs/events to session_latest/logs/export_events Keeping both event types in the same folder doesn't cause any issue for Ray -- export event files are already filtered out for /events API in ray/python/ray/dashboard/modules/event/event_utils.py Line 22 in 1e48a03 all_source_types = set(event_consts.EVENT_SOURCE_ALL) However moving these to a separate folder would be better for existing downstream consumers to avoid handling export events in the events folder if they turn the flag on Signed-off-by: ujjawal-khare <[email protected]>

@kouroshHakha

Currently we only print 100 last lines of anyscale job log to buildkite. This PR removes that limit and prints everything instead. CC: @kouroshHakha Test: - CI Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…oject#47812) Created by release automation bot. Update with commit d2982b7 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…mary and object list (ray-project#47818) # Current status: * When we retrieve the information from GCS, the task_status as well as the attempts are in 2 fields and the task status is an enum. * Later during reconstruction, the 2 fields are combined into 1 and the number of attempts is added to the task_status field. * That's why when displaying the objects, the function isn't able to convert the string back to enum. # Proposed solution: * Instead of combining the 2 fields (task_status and attempt), we will keep the 2 fields and added an additional field (attempt_number) in the Object State * In this way, we will keep the task_status as enum and put the attempt number information in a different field # Changes in this PR: * Added the `attempt_number` in `ObjectState` and `task_attempt_number_counts` in `ObjectSummaryPerKey` * Added logic to populate the fields as proposed above * Updated the logic for the memory summary function to display the attempt number in a new column * Corresponding tests added as well Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…for SAC and DQN. (ray-project#47217) Signed-off-by: ujjawal-khare <[email protected]>

…ct#47822) Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Previously we had several ad-hoc places to do a "thread and io_context" pattern: create a thread dedicated to an asio io_context, then workload can post async tasks onto it. This makes duplicate code: everywhere we create threads, implement stop and join. Introducing InstrumentedIOContextWithThread that does exactly this and replaces existing usages. Also fixes some absl::Time computations with best practice. This is refactoring. Should have no runtime difference. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…Learner, but still on RolloutWorker and Policy) (ray-project#46085) Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47645) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Pass a GPU tensor to execute, but it gets converted into a CPU tensor. The issue may be related to ray-project#46440. Signed-off-by: ujjawal-khare <[email protected]>

Use structured logging by changing more `<< node_id` to use `.WithField(node_id)`. This is not intended to be a complete work, but it should cover most of the cases. We did the work for NodeID, WorkerID, ActorID, JobID, TaskID, PlacementGroupID. Some logs have multiple IDs. To avoid confusion, for these we only use WithField(object_id) don't use WithField on either of the Node IDs. This PR should have no change on Ray other than logs. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…netuning (new API stack with RLModule checkpoints). (ray-project#47838) Signed-off-by: ujjawal-khare <[email protected]>

…on spaces for agents). (ray-project#47830) Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

change kuberay helm and branch reference versions to v1.2.2 Signed-off-by: ujjawal-khare <[email protected]>

…ray-project#47832) Currently, when using tensor type in Ray Data if single tensor in a block grows above 2Gb (due to use of signed `int32` as offsets) this would result in the following issue: ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` Consequently, this change adds support for tensors of > 4Gb in size, while maintaining compatibility with existing datasets already using tensors. This is done by forking off `ArrowTensorType` in 2: - `ArrowTensorType` (v1) remaining intact - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as now using `int64` offsets --------- Signed-off-by: Peter Wang <[email protected]> Signed-off-by: Alexey Kudinkin <[email protected]> Co-authored-by: Peter Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ory cleanups). (ray-project#47884) Signed-off-by: ujjawal-khare <[email protected]>

… Algo, and Learner example classes). (ray-project#47885) Signed-off-by: ujjawal-khare <[email protected]>

…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>

…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>

… ostream. (ray-project#47893) We have a convenience function `debug_string` used in Ray logs: it prints printables (operator<<), containers, pairs. However it returns a std::string which is feed into RAY_LOG(). This makes a copy. Changes the signature to return a `DebugStringWrapper` which holds const reference to the argument, and is printable for all already supported types. Additionally supports std::tuple. This should only have marginal perf benefits since we typically don't debug_string a very big data structure. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…#47922) ## Why are these changes needed? Fixes some failing/flaky unit tests tests, which fail with errors like: ``` EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b ``` - TestBackwardsCompatibility.test_cli - test_failed_driver_exit_code Previously failing test now passes with this PR applied: https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934 ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>

… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>

…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>

…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>

Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>

…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>

…ject#47973) Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio is slower on windows, so the shortened timeouts for some tests cause the tests to fail because tasks didn't get a chance to start/finish executing. Failing tests on windows: - `test_multiple_queries_with_different_model_ids` - `test_queue_len_cache_replica_at_capacity_is_probed` - `test_queue_len_cache_background_probing` ## Related issue number Closes ray-project#47950 Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…roject#47961) ## Why are these changes needed? PyArrow infers parquet schema only based on the first file. This will cause errors when reading multiple files with ragged ndarrays. This PR fixes this issue by not using the inferred schema for reading.  ## Related issue number Fixes ray-project#47960 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Now, when you call PopWorker(), it finds an idle one or creates a worker. If a new worker is created, the worker is associated to the request and can only be used by it. This PR decouples the worker creation and the worker-to-task assignment, by adding an abstraction namely PopWorkerRequest. Now, if a req triggers a worker creation, the req is put into a queue. If there are workers ready, that is a PushWorker is called, either from a newly started worker or a released worker, Ray matches the first fitting request in the queue. This reduces latency. Later it can also be used to pre-start workers more meaningfully. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

This PR fixes part of the problem by creating the payload message once and reusing it throughout the benchmark. Ran the release test on this change [build](https://buildkite.com/ray-project/release/builds/21663#01918fe1-853b-46f2-9699-c4045b182b8c) now seeing the `grpc_10mb_p50_latency` now dropped to ~58ms from ~80ms previously. The rest of the issue came from the existing gRPC server implementation requires to wait on the entirety of the unary request before it's able to continue it's work on replica. We will need to create a new HTTP2 proxy and pass the request transparently between the replica and the proxy to speed thing up. Will follow up in the future on ray-project#47370 Closes ray-project#47371 Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Write node events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a value in the node event data schema is modified. Typically this occurs in the callback after writing NodeTable to the GCS table Signed-off-by: ujjawal-khare <[email protected]>

…d new example script. (ray-project#47362) Signed-off-by: ujjawal-khare <[email protected]>

Like actor_head.py, we now update DataSource.nodes on delta. It first queries all node infos, then subscribes node deltas. Each delta updates: 1. DataSource.nodes[node_id] 2. DataSource.agents[node_id] 3. a warning generated after RAY_DASHBOARD_HEAD_NODE_REGISTRATION_TIMEOUT = 10s Note on (2) agents: it's read from internal kv, and is not readily available until the agent.py is spawned and writes its own port to internal kv. So we make an async task for each node to poll this port every 1s. It occurs that the get-all-then-subscribe code has a TOCTOU problem, so also updated actor_head.py to first subscribe then get all actors. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…nd float16 inference) through new example script. (ray-project#47116) Signed-off-by: ujjawal-khare <[email protected]>

…7188) The `test_actor_retry` tests are failing/flaky on windows. They pass locally. I have not been able to access the CI logs to see what is going wrong. In order to shrink the problem (is it a overall timeout? Is one of the tests failing?) we can start by splitting the tests into two files. Toward solving ray-project#43845. Signed-off-by: ujjawal-khare <[email protected]>

…new stack Offline RL. (ray-project#47359) Signed-off-by: ujjawal-khare <[email protected]>

Redeploy in between each microbenchmark. --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…7405) Reverts ray-project#47221 This broke ray-project#47395 Signed-off-by: ujjawal-khare <[email protected]>

…tally (ray-project#47372) Signed-off-by: khluu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Currently we have no linting on any part of the docs code. This PR runs pre-commit on the cluster docs. This PR fixes the following issues: ``` trim trailing whitespace.................................................Failed - hook id: trailing-whitespace - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/configure-manage-dashboard.md Fixing doc/source/cluster/kubernetes/user-guides/pod-security.md Fixing doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md Fixing doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md Fixing doc/source/cluster/vms/references/ray-cluster-configuration.rst Fixing doc/source/cluster/running-applications/job-submission/quickstart.rst Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md Fixing doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md Fixing doc/source/cluster/running-applications/job-submission/sdk.rst Fixing doc/source/cluster/running-applications/job-submission/ray-client.rst Fixing doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md Fixing doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md Fixing doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md Fixing doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/examples/rayjob-batch-inference-example.md Fixing doc/source/cluster/metrics.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md Fixing doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md Fixing doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md Fixing doc/source/cluster/kubernetes/user-guides/config.md Fixing doc/source/cluster/kubernetes/user-guides/pod-command.md fix end of files.........................................................Failed - hook id: end-of-file-fixer - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/images/rbac-clusterrole.svg Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/vms/user-guides/community/slurm.rst Fixing doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md Fixing doc/source/cluster/images/ray-job-diagram.svg Fixing doc/source/cluster/kubernetes/user-guides/observability.md Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/cli.rst Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg Fixing doc/source/cluster/kubernetes/images/kubeflow-architecture.svg Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/images/AutoscalerOperator.svg check for added large files..............................................Passed check python ast.........................................................Passed check json...........................................(no files to check)Skipped check toml...........................................(no files to check)Skipped black....................................................................Passed flake8...................................................................Passed prettier.............................................(no files to check)Skipped mypy.................................................(no files to check)Skipped isort (python)...........................................................Passed rst directives end with two colons.......................................Passed rst ``inline code`` next to normal text..................................Passed use logger.warning(......................................................Passed check for not-real mock methods..........................................Passed ShellCheck v0.9.0........................................................Passed clang-format.........................................(no files to check)Skipped Google Java Formatter................................(no files to check)Skipped Check for Ray docstyle violations........................................Passed Check for Ray import order violations....................................Passed ``` Signed-off-by: pdmurray <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ating to new API stack. (ray-project#47425) Signed-off-by: ujjawal-khare <[email protected]>

…nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401) Signed-off-by: ujjawal-khare <[email protected]>

…arning rates for actor, critic, and alpha. (ray-project#47402) Signed-off-by: ujjawal-khare <[email protected]>

If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>

…ents. (ray-project#47384) Signed-off-by: ujjawal-khare <[email protected]>

With serve shutdown in between every microbenchmark, serve needs to be started with grpc options every time for the grpc microbenchmarks. ## Related issue number closes ray-project#47424 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…#47024) aDAG currently does not support multiple return values. We would like to add general support for multiple return values. This PR supports multiple returns by returning a separate `ClassMethodNode` for each return value of the tuple. It is an incremental change for `ClassMethodNode`, addign `_is_class_method_output`, `_class_method_call`, `_output_idx`. `_output_idx` is used to guide channel allocation and output writes. User needs to specify `num_returns > 1` to hint multiple return values. The upstream task allocates a separate output channel for each return value. A downstream task reads from one of the output channels. ## What is done? We modify `ClassMethodNode` to handle two logics, one is a class method call which is the original semantics (`self.is_class_method_call == True`), another is a class method output which is responsible for one of the multiple return values (`self.is_class_method_output == True`). We modify `WriterInterface` to support writes to multiple `output_channels` with `output_idxs`. If an output index is None, it means the complete return value is written to the output channel. Otherwise, the return value is a tuple and the index is used to extract the value to be written to the output channel. We allocate separate output channels to different readers. The downstream tasks of a `ClassMethodNode` with `self.is_class_method_output == True` are the readers of an output channel of its upstream `ClassMethodNode`. The example below demonstrates this. ``` upstream ClassMethodNode (self.is_class_method_call == True, self.output_channels = [c1, c2]) --> downstream ClassMethodNode (self.is_class_method_method == True, self.output_channels[c1]) --> ... ``` Closes ray-project#45569 --------- Signed-off-by: Weixin Deng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

…ate schedules. (ray-project#47453) Signed-off-by: ujjawal-khare <[email protected]>

Same code changes as [observability][export-api] Write node events ray-project#47221 Move test into a separate file to create a separate bazel target that can be skipped on Windows Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…ating to new API stack (by config). (ray-project#47427) Signed-off-by: ujjawal-khare <[email protected]>

Add streaming microbenchmark to release tests. Only HTTP, intermediate router, and handle for now (no grpc). --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

![CleanShot 2024-09-04 at 11 12 44@2x](https://github.com/user-attachments/assets/9c8dfd64-c565-4285-a1ce-774c6fce2997) Signed-off-by: Saihajpreet Singh <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Reverts ray-project#46477 Signed-off-by: ujjawal-khare <[email protected]>

The current logic to parse logs from anyscale job is very complicated. It first downloads all the logs from the cluster, and try to guess the main job logs and error job logs. The logic of getting error job log is no longer neccessary. The new API offers a much simpler way to get the log, update to that API. Test: - CI - so much cleaner: https://buildkite.com/ray-project/release/builds/22057#0191ba75-2f0b-4a0b-9bad-8603003eba4c/741-742 --------- Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

… concurrently (ray-project#47482) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…, rt env agent. (ray-project#47490) This saves 1 RPC for each GcsClient, which can be O(#nodes). Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47489) Without this PR, the num_input_consumers would be 1 because both inp[0] and inp[1] are only referred to in one task on the actor, so CachedChannel will not be created. The read will eventually time out because the mutable object is being read by the same actor twice. Signed-off-by: ujjawal-khare <[email protected]>

Redo https://github.com/ray-project/ray/pull/47483/files. The previous PR was based on a too old base so it gets merged successfully without re-compiling the dependencies Also allow the dry-run of generating build cache to run on premerge, to block changes that can break it. Test: - CI Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Write actor events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a value in the actor event data schema is modified. Typically this occurs before writing ActorTableData to the GCS table or publishing the data for the dashboard Signed-off-by: ujjawal-khare <[email protected]>

Support logging events for execution task for better observability. Users can turn on event profiling by setting RAY_ADAG_ENABLE_PROFILING as True The event tracks the following metadata of a task: Signed-off-by: ujjawal-khare <[email protected]>

) Windows path needs to be escaped. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well. The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush. Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second). Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file. Signed-off-by: ujjawal-khare <[email protected]>

…47516) Reverts ray-project#47303 Signed-off-by: ujjawal-khare <[email protected]>

…7536) Reverts ray-project#47193 Signed-off-by: ujjawal-khare <[email protected]>

| Before | After | |--------|------| |![CleanShot 2024-09-06 at 10 33 56@2x](https://github.com/user-attachments/assets/0b8dff77-3a7f-4bc7-b117-39fcd4edd69f) | ![CleanShot 2024-09-06 at 10 33 18@2x](https://github.com/user-attachments/assets/ef4c67ba-df95-48c9-8c70-273b75ed5296) | Signed-off-by: Saihajpreet Singh <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47500) Signed-off-by: ujjawal-khare <[email protected]>

Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <[email protected]>

Fix CI regression: https://buildkite.com/ray-project/postmerge/builds/6157#0191c4aa-1897-4d42-93c7-5403b67bc5cc https://buildkite.com/ray-project/postmerge/builds/6165#0191c819-53f7-4605-805f-824e85951fde Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

- Add back code changes from [observability][export-api] Write actor events ray-project#47303 - Separate out actor manager export event test into a separate file so we can skip on windows. Update BUILD rule so all tests in src/ray/gcs/gcs_server/test/export_api are skipped on windows Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525 - Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged). Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…GeneralAdvantageEstimation` connector in learner pipeline. (ray-project#47532) Signed-off-by: ujjawal-khare <[email protected]>

…ed` for `test_csv_read_filter_non_csv_file` (ray-project#47513) ## Why are these changes needed? Seems that ray-project#47467 ended up breaking some niche setup for this test, by changing the fixture from `shutdown_only` to `ray_start_regular_shared` we are able to get the test passing again. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

``` REGRESSION 13.65%: client__get_calls (THROUGHPUT) regresses from 1119.7725751916082 to 966.9141307622872 in microbenchmark.json REGRESSION 9.23%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.184014305625574 to 18.32083810818594 in microbenchmark.json REGRESSION 8.40%: multi_client_tasks_async (THROUGHPUT) regresses from 23311.858831941317 to 21353.682091539627 in microbenchmark.json REGRESSION 6.66%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 3038.941703794114 to 2836.601104413851 in microbenchmark.json REGRESSION 4.39%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4456.606860484332 to 4261.050694056448 in microbenchmark.json REGRESSION 3.77%: actors_per_second (THROUGHPUT) regresses from 627.338335492887 to 603.6854672610009 in benchmarks/many_actors.json REGRESSION 3.47%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.679337230724197 to 13.204885454613315 in microbenchmark.json REGRESSION 3.38%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2055.7051275912527 to 1986.177233156469 in microbenchmark.json REGRESSION 2.44%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5167.9800954515 to 5041.760637338739 in microbenchmark.json REGRESSION 2.33%: placement_group_create/removal (THROUGHPUT) regresses from 824.4108502776797 to 805.1759941825478 in microbenchmark.json REGRESSION 1.64%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.485273551888224 to 5.39514490847805 in microbenchmark.json REGRESSION 1.28%: single_client_tasks_sync (THROUGHPUT) regresses from 986.5998779605792 to 973.959307673384 in microbenchmark.json REGRESSION 0.95%: pgs_per_second (THROUGHPUT) regresses from 22.249430148995714 to 22.037557767422825 in benchmarks/many_pgs.json REGRESSION 0.66%: n_n_actor_calls_async (THROUGHPUT) regresses from 26545.931713712664 to 26370.461840482538 in microbenchmark.json REGRESSION 0.53%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9060.701663275304 to 9012.880467992636 in microbenchmark.json REGRESSION 0.28%: single_client_tasks_async (THROUGHPUT) regresses from 8011.455682416454 to 7988.9069673790045 in microbenchmark.json REGRESSION 0.19%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1486.2327104183764 to 1483.4703793760418 in microbenchmark.json REGRESSION 107.66%: dashboard_p95_latency_ms (LATENCY) regresses from 34.039 to 70.687 in benchmarks/many_nodes.json REGRESSION 30.19%: stage_0_time (LATENCY) regresses from 8.773437261581421 to 11.421970844268799 in stress_tests/stress_test_many_tasks.json REGRESSION 27.05%: dashboard_p50_latency_ms (LATENCY) regresses from 3.87 to 4.917 in benchmarks/many_nodes.json REGRESSION 9.72%: dashboard_p99_latency_ms (LATENCY) regresses from 119.573 to 131.198 in benchmarks/many_nodes.json REGRESSION 9.58%: stage_1_avg_iteration_time (LATENCY) regresses from 23.938837790489195 to 26.23279986381531 in stress_tests/stress_test_many_tasks.json REGRESSION 9.41%: stage_3_time (LATENCY) regresses from 3035.906775712967 to 3321.615835428238 in stress_tests/stress_test_many_tasks.json REGRESSION 6.37%: dashboard_p95_latency_ms (LATENCY) regresses from 3542.989 to 3768.817 in benchmarks/many_actors.json REGRESSION 4.93%: dashboard_p99_latency_ms (LATENCY) regresses from 358.789 to 376.468 in benchmarks/many_pgs.json REGRESSION 3.70%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.28579454300001 to 18.961532712000007 in scalability/object_store.json REGRESSION 3.56%: avg_pg_create_time_ms (LATENCY) regresses from 0.9371462897900398 to 0.9705077387385862 in stress_tests/stress_test_placement_group.json REGRESSION 3.24%: stage_2_avg_iteration_time (LATENCY) regresses from 61.69442081451416 to 63.694758081436156 in stress_tests/stress_test_many_tasks.json REGRESSION 2.07%: 10000_get_time (LATENCY) regresses from 23.411743029999997 to 23.896780481999997 in scalability/single_node.json REGRESSION 1.74%: dashboard_p50_latency_ms (LATENCY) regresses from 167.38 to 170.294 in benchmarks/many_tasks.json REGRESSION 1.51%: 1000000_queued_time (LATENCY) regresses from 186.319367591 to 189.12986922100004 in scalability/single_node.json REGRESSION 1.39%: avg_pg_remove_time_ms (LATENCY) regresses from 0.9081441951950084 to 0.9207600330309926 in stress_tests/stress_test_placement_group.json REGRESSION 0.59%: dashboard_p95_latency_ms (LATENCY) regresses from 12.055 to 12.126 in benchmarks/many_pgs.json ``` Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…r task (ray-project#47396) Currently if we need to rerun an actor task to recover a lost object but the actor is dead, the actor task will fail immediately. This PR allows the actor to be restarted (if it doesn't violate max_restarts) so that the actor task can run to recover lost objects. In terms of the state machine, we add a state transition from DEAD to RESTARTING. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

\Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur. Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout. Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks. This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm). Main changes Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner. When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception. Signed-off-by: ujjawal-khare <[email protected]>

) Clean up the code. Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes. Signed-off-by: ujjawal-khare <[email protected]>

…7533) When a serve app is launched, serve will startup automatically. In certain places like k8s, it can be difficult to preconfigure serve (e.g. in the [ray-cluster helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml) there is no ability to set the default serve arguments). This means you need to either be explicit when you start serve, or if it starts up automatically you may need to shut it down, then restart it, which is inconvenient. Signed-off-by: Tim Paine <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ject#47592) Style nits. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: angelinalg <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table Signed-off-by: ujjawal-khare <[email protected]>

…#47492) GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload. This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls. --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…servability doc (ray-project#47462) Signed-off-by: Rueian <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

The ranks should be in the order of the actors. Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows). This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step. This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows. Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…g their own loss (algo independent). (ray-project#47581) Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: liuxsh9 <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Re: ray-project#47229 Previous PR to setup default serve logger has some unexpected consequence. Mainly combined with Serve's stdout redirect feature (when `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve logger and redirect all stdout/stderr into serve's log files instead going to the console. This caused on the Anyscale platform unable to identify ray start command is running successfully and unable to start the cluster. This PR fixes this behavior by only configure Serve's default logger with stream handler and skip configuring file handler altogether. Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305) ## Why are these changes needed? Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on whether `execute` or `execute_async` is invoked, that points to a list of all of the outputs. To retrieve one of the outputs, you have to get and deserialize all of them at the same time. This PR separates the output of `execute` and `execute_async` to a list of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by `MultiOutputNode`. This is particularly useful for vLLM tensor parallelism. Since all shards return the same results, we only need to fetch result from one of the workers. Closes ray-project#46908. --------- Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Jeffrey Wang <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…r being down (e.g. spot instance termination) (ray-project#47493) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Skip test_replica_actor_died on windows. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Pull replica scheduler and replica wrapper out from `common.py` into their own files. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Idk how this was genearted Signed-off-by: ujjawal-khare <[email protected]>

…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <[email protected]>

Introduce an env var to raise an exception when there's out of band seriailzation of object ref Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details. Update an anti-pattern doc. Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47564) Change 1: Remove class DAGInputAdapter. Without this PR, the entire input data will be written to the channel, even if a reader only wants to retrieve partial input data via InputAttributeNode. Then, the entire input data will be read by the READ operation, and the partial input will be retrieved during the COMPUTE operation (code) In this PR, each InputAttributeNode has its own channel, and only the corresponding input data will be written to the channel. Therefore, we no longer need to use DAGInputAdapter to retrieve the partial input data during the COMPUTE operation. Change 2: If the DAG contains any InputAttributeNode, create a channel for each InputAttributeNode. Then, write the partial input data to the corresponding channel (code). Change 3: There are some if/else statements to handle InputNode and InputAttributeNode for creating CachedChannel. This PR unifies the logic because InputNode and different InputAttributeNode are no longer considered consumers of only one input channel. Each InputAttributeNode has its own channel. Change 4: Move RayDAGArgs from compiled_dag_node.py to common.py to avoid importing it inside _adapt. Without this, this PR is about 5% slower than the baseline in the case "Benchmark: single actor, no InputAttributeNode". With this change, the performance is almost the same as, or slightly better than, the baseline. See "Benchmark: single actor, no InputAttributeNode" below for more details. Signed-off-by: ujjawal-khare <[email protected]>

) To extract path partition information with `read_parquet`, you pass a PyArrow `partitioning` object to `dataset_kwargs`. For example: ``` schema = pa.schema([("one", pa.int32()), ("two", pa.string())]) partitioning = pa.dataset.partitioning(schema, flavor="hive") ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning)) ``` This is problematic for two reasons: 1. It tightly couples the interface with the implementation; partitioning only works if we use `pyarrow.Dataset` in a specific way in the implementation. 2. It's inconsistent with all of the other file-based API. All other APIs use expose a top-level `partitioning` parameter (rather than `dataset_kwargs`) where you pass a Ray Data `Partitioning` object (rather than a PyArrow partitioning object). --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…#47670) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation. With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that Signed-off-by: ujjawal-khare <[email protected]>

``` REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json ``` Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…oject#47655) ## Why are these changes needed? Abstract `ray.ObjectRef` and `ray.ObjectRefGenerator` in a result wrapper that the deployment response can directly call into. --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…project#47706) CompiledDAG._type_hints is not used. Signed-off-by: ujjawal-khare <[email protected]>

…ject#47515) ## Why are these changes needed? The progress bar for ray data could still end up showing higher utilization of what the cluster currently have. ray-project#46729 was the first attempt to fix it which addressed the issue in static clusters, but we still have that issue for clusters that autoscales. This change simply rephrase the string so it is less confusing. Before <img width="1249" alt="image" src="https://github.com/user-attachments/assets/049ea096-a87f-4767-ba04-6d00d7c2755d"> After <img width="1248" alt="image" src="https://github.com/user-attachments/assets/cb74c0dc-1f33-4b22-b31c-e83df2a5d408"> This comes from the fact that operators don't track the task state (and currently ray core does not even provide that api). Which means Ray data operators does not know if the task is assigned to a node or not, so once the task is submitted to ray it is marked active even if it is pending a node assignment. The dashboard does better here since it does have extra information from the task. <img width="1493" alt="image" src="https://github.com/user-attachments/assets/9315b884-3e61-4b32-8400-7f76e15b6a4b"> In the future we can visit adding the core api for remote state reporting and allowing operators to provide more detailed state (active, pending_scheduled, pending_node_assignment). ## Related issue number ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Sofian Hnaide <[email protected]> Co-authored-by: scottjlee <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? - We can make some tests asynchronous instead of having to rely on `_to_object_ref`. - we can use `RayActorError` instead of `ActorDiedError` Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

… endpoint (ray-project#47727) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…project#47704) Currently, when there's an exception, there's only 1 return value, but multi ref assumes that the return value has to match the # of output channels. It fixes the issue by duplicating exception to match the number of output channels. Signed-off-by: ujjawal-khare <[email protected]>

…roject#47772) so that output is printed to logs and also use "sys.executable" rather than "python" Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…dd better defaults. (ray-project#47775) Signed-off-by: ujjawal-khare <[email protected]>

- docker compose service volume short syntax uses bind (similar to `-v` and will create the dir if not exist - the code was not mapping the dir to host path, so it actually has no meaningful effect when it is running in a container, such as on CI Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Currently, the progress bar is pretty verbose because it is very information dense. This PR: - Reorganizes progress output to group by relevant concepts and clarifies labels - Standardizes global and operator-level progress bar outputs - Removes the use of all emojis (poor rendering on some platforms / external logging systems) Progress bar before this PR: <img width="1403" alt="Screenshot at Sep 16 13-00-17" src="https://github.com/user-attachments/assets/4f459b77-06ba-4395-b883-e4c9ac8ca2ef"> Progress bar after this PR: <img width="1502" alt="Screenshot at Sep 23 13-48-32" src="https://github.com/user-attachments/assets/0c0f8c94-9439-4fd4-ae1a-2857b3a87b59"> Will follow up with a docs PR once we merge this change, so that I don't need to continuously modify the docs. In the future, we should restructure the way progress bars are grouped/tracked, so that we can tabulate the op-level progress bar outputs. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

for release perf checking. Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…oject#47748) Created by release automation bot. Update with commit f298a75 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…oject#47801) Created by release automation bot. Update with commit 18b2d94 Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…47468) Add ExportEventLoggerAdapter which will be used to write export events to file from python files. Only a single ExportEventLoggerAdapter instance will exist per source type, so callers can create or get this instance using get_export_event_logger which is thread safe. Write Submission Job export events to file from JobInfoStorageClient.put_info which is called to update the JobInfo data in the internal KV store. Signed-off-by: ujjawal-khare <[email protected]>

Move export events from session_latest/logs/events to session_latest/logs/export_events Keeping both event types in the same folder doesn't cause any issue for Ray -- export event files are already filtered out for /events API in ray/python/ray/dashboard/modules/event/event_utils.py Line 22 in 1e48a03 all_source_types = set(event_consts.EVENT_SOURCE_ALL) However moving these to a separate folder would be better for existing downstream consumers to avoid handling export events in the events folder if they turn the flag on Signed-off-by: ujjawal-khare <[email protected]>

@kouroshHakha

Currently we only print 100 last lines of anyscale job log to buildkite. This PR removes that limit and prints everything instead. CC: @kouroshHakha Test: - CI Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…oject#47812) Created by release automation bot. Update with commit d2982b7 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…mary and object list (ray-project#47818) # Current status: * When we retrieve the information from GCS, the task_status as well as the attempts are in 2 fields and the task status is an enum. * Later during reconstruction, the 2 fields are combined into 1 and the number of attempts is added to the task_status field. * That's why when displaying the objects, the function isn't able to convert the string back to enum. # Proposed solution: * Instead of combining the 2 fields (task_status and attempt), we will keep the 2 fields and added an additional field (attempt_number) in the Object State * In this way, we will keep the task_status as enum and put the attempt number information in a different field # Changes in this PR: * Added the `attempt_number` in `ObjectState` and `task_attempt_number_counts` in `ObjectSummaryPerKey` * Added logic to populate the fields as proposed above * Updated the logic for the memory summary function to display the attempt number in a new column * Corresponding tests added as well Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…for SAC and DQN. (ray-project#47217) Signed-off-by: ujjawal-khare <[email protected]>

…ct#47822) Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Previously we had several ad-hoc places to do a "thread and io_context" pattern: create a thread dedicated to an asio io_context, then workload can post async tasks onto it. This makes duplicate code: everywhere we create threads, implement stop and join. Introducing InstrumentedIOContextWithThread that does exactly this and replaces existing usages. Also fixes some absl::Time computations with best practice. This is refactoring. Should have no runtime difference. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…Learner, but still on RolloutWorker and Policy) (ray-project#46085) Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47645) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Pass a GPU tensor to execute, but it gets converted into a CPU tensor. The issue may be related to ray-project#46440. Signed-off-by: ujjawal-khare <[email protected]>

Use structured logging by changing more `<< node_id` to use `.WithField(node_id)`. This is not intended to be a complete work, but it should cover most of the cases. We did the work for NodeID, WorkerID, ActorID, JobID, TaskID, PlacementGroupID. Some logs have multiple IDs. To avoid confusion, for these we only use WithField(object_id) don't use WithField on either of the Node IDs. This PR should have no change on Ray other than logs. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…netuning (new API stack with RLModule checkpoints). (ray-project#47838) Signed-off-by: ujjawal-khare <[email protected]>

…on spaces for agents). (ray-project#47830) Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

change kuberay helm and branch reference versions to v1.2.2 Signed-off-by: ujjawal-khare <[email protected]>

…ray-project#47832) Currently, when using tensor type in Ray Data if single tensor in a block grows above 2Gb (due to use of signed `int32` as offsets) this would result in the following issue: ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` Consequently, this change adds support for tensors of > 4Gb in size, while maintaining compatibility with existing datasets already using tensors. This is done by forking off `ArrowTensorType` in 2: - `ArrowTensorType` (v1) remaining intact - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as now using `int64` offsets --------- Signed-off-by: Peter Wang <[email protected]> Signed-off-by: Alexey Kudinkin <[email protected]> Co-authored-by: Peter Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ory cleanups). (ray-project#47884) Signed-off-by: ujjawal-khare <[email protected]>

… Algo, and Learner example classes). (ray-project#47885) Signed-off-by: ujjawal-khare <[email protected]>

…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>

…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>

… ostream. (ray-project#47893) We have a convenience function `debug_string` used in Ray logs: it prints printables (operator<<), containers, pairs. However it returns a std::string which is feed into RAY_LOG(). This makes a copy. Changes the signature to return a `DebugStringWrapper` which holds const reference to the argument, and is printable for all already supported types. Additionally supports std::tuple. This should only have marginal perf benefits since we typically don't debug_string a very big data structure. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…#47922) ## Why are these changes needed? Fixes some failing/flaky unit tests tests, which fail with errors like: ``` EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b ``` - TestBackwardsCompatibility.test_cli - test_failed_driver_exit_code Previously failing test now passes with this PR applied: https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934 ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>

… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>

…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>

…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>

Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>

…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>

…ject#47973) Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio is slower on windows, so the shortened timeouts for some tests cause the tests to fail because tasks didn't get a chance to start/finish executing. Failing tests on windows: - `test_multiple_queries_with_different_model_ids` - `test_queue_len_cache_replica_at_capacity_is_probed` - `test_queue_len_cache_background_probing` ## Related issue number Closes ray-project#47950 Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…roject#47961) ## Why are these changes needed? PyArrow infers parquet schema only based on the first file. This will cause errors when reading multiple files with ragged ndarrays. This PR fixes this issue by not using the inferred schema for reading.  ## Related issue number Fixes ray-project#47960 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Now, when you call PopWorker(), it finds an idle one or creates a worker. If a new worker is created, the worker is associated to the request and can only be used by it. This PR decouples the worker creation and the worker-to-task assignment, by adding an abstraction namely PopWorkerRequest. Now, if a req triggers a worker creation, the req is put into a queue. If there are workers ready, that is a PushWorker is called, either from a newly started worker or a released worker, Ray matches the first fitting request in the queue. This reduces latency. Later it can also be used to pre-start workers more meaningfully. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…t#47807) Supports single file modules in `py_module` runtime_env. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…ray into fix/job-manager-logger

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/job manager logger #48003

Fix/job manager logger #48003

Commits on Oct 15, 2024