-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/job manager logger #48003
Fix/job manager logger #48003
Commits on Oct 15, 2024
-
[aDAG] support buffered input (ray-project#47272)
\Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur. Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout. Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks. This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm). Main changes Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner. When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 42361bc - Browse repository at this point
Copy the full SHA 42361bcView commit details -
[aDAG] Clean up arg_to_consumers in _get_or_compile() (ray-project#47514
) Clean up the code. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f1e2704 - Browse repository at this point
Copy the full SHA f1e2704View commit details -
[RLlib; Offline RL] Store episodes in state form. (ray-project#47294)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8d20388 - Browse repository at this point
Copy the full SHA 8d20388View commit details -
[Core][aDag] Support multi node multi reader (ray-project#47480)
This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6625ee2 - Browse repository at this point
Copy the full SHA 6625ee2View commit details -
Allow control of some serve configuration via env vars (ray-project#4…
…7533) When a serve app is launched, serve will startup automatically. In certain places like k8s, it can be difficult to preconfigure serve (e.g. in the [ray-cluster helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml) there is no ability to set the default serve arguments). This means you need to either be explicit when you start serve, or if it starts up automatically you may need to shut it down, then restart it, which is inconvenient. Signed-off-by: Tim Paine <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 290a14a - Browse repository at this point
Copy the full SHA 290a14aView commit details -
Update incremental build troubleshooting tip with style nits (ray-pro…
…ject#47592) Style nits. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: angelinalg <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a0430bb - Browse repository at this point
Copy the full SHA a0430bbView commit details -
[observability][export-api] Write driver job events (ray-project#47418)
Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a6a63e2 - Browse repository at this point
Copy the full SHA a6a63e2View commit details -
[core][dashboard] push down job_or_submission_id to GCS. (ray-project…
…#47492) GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload. This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls. --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6e790d9 - Browse repository at this point
Copy the full SHA 6e790d9View commit details -
[Doc][KubeRay] Add description tables for RayCluster Status in the ob…
…servability doc (ray-project#47462) Signed-off-by: Rueian <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e591c40 - Browse repository at this point
Copy the full SHA e591c40View commit details -
[aDAG] Fix ranks ordering for custom NCCL group (ray-project#47594)
The ranks should be in the order of the actors. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 87519fa - Browse repository at this point
Copy the full SHA 87519faView commit details -
[RLlib] RLModule:
InferenceOnlyAPI
. (ray-project#47572)Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47d9b0d - Browse repository at this point
Copy the full SHA 47d9b0dView commit details -
[Data] Remove
_default_metadata_providers
(ray-project#47575)_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 747b6f5 - Browse repository at this point
Copy the full SHA 747b6f5View commit details -
[Serve] Remove unused Serve constants (ray-project#47593)
Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 15132d5 - Browse repository at this point
Copy the full SHA 15132d5View commit details -
Fix windows://:task_event_buffer_test (ray-project#47577)
Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows). This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step. This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows. Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e038cb0 - Browse repository at this point
Copy the full SHA e038cb0View commit details -
[RLlib] RLModule API:
SelfSupervisedLossAPI
for RLModules that brin……g their own loss (algo independent). (ray-project#47581) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 644874d - Browse repository at this point
Copy the full SHA 644874dView commit details -
[GCS] Optimize
GetAllJobInfo
API for performance (ray-project#47530)Signed-off-by: liuxsh9 <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b1c7caa - Browse repository at this point
Copy the full SHA b1c7caaView commit details -
[Serve] fix default serve logger behavior (ray-project#47600)
Re: ray-project#47229 Previous PR to setup default serve logger has some unexpected consequence. Mainly combined with Serve's stdout redirect feature (when `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve logger and redirect all stdout/stderr into serve's log files instead going to the console. This caused on the Anyscale platform unable to identify ray start command is running successfully and unable to start the cluster. This PR fixes this behavior by only configure Serve's default logger with stream handler and skip configuring file handler altogether. Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4b38d57 - Browse repository at this point
Copy the full SHA 4b38d57View commit details -
[core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…
…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 102ec9d - Browse repository at this point
Copy the full SHA 102ec9dView commit details -
[core][adag] Separate the outputs of execute and execute_async to mul…
…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305) ## Why are these changes needed? Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on whether `execute` or `execute_async` is invoked, that points to a list of all of the outputs. To retrieve one of the outputs, you have to get and deserialize all of them at the same time. This PR separates the output of `execute` and `execute_async` to a list of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by `MultiOutputNode`. This is particularly useful for vLLM tensor parallelism. Since all shards return the same results, we only need to fetch result from one of the workers. Closes ray-project#46908. --------- Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Jeffrey Wang <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c47c430 - Browse repository at this point
Copy the full SHA c47c430View commit details -
[serve] Faster detection of dead replicas (ray-project#47237)
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 21379f3 - Browse repository at this point
Copy the full SHA 21379f3View commit details -
[spark] Improve Ray-on-spark fault tolerance in case of Spark executo…
…r being down (e.g. spot instance termination) (ray-project#47493) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 591a4d0 - Browse repository at this point
Copy the full SHA 591a4d0View commit details -
[serve] skip failure test on windows (ray-project#47630)
Skip test_replica_actor_died on windows. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 60d78b1 - Browse repository at this point
Copy the full SHA 60d78b1View commit details -
[serve] reorganize replica scheduler classes (ray-project#47615)
## Why are these changes needed? Pull replica scheduler and replica wrapper out from `common.py` into their own files. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0f9fa48 - Browse repository at this point
Copy the full SHA 0f9fa48View commit details -
[Core] Remove code accidently got in (ray-project#47612)
Idk how this was genearted Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3c2b92c - Browse repository at this point
Copy the full SHA 3c2b92cView commit details -
[Core][aDAG] support multi readers in multi node when dag is created …
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 35fe4ba - Browse repository at this point
Copy the full SHA 35fe4baView commit details -
[core] out of band serialization exception (ray-project#47544)
Introduce an env var to raise an exception when there's out of band seriailzation of object ref Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details. Update an anti-pattern doc. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0af4ca7 - Browse repository at this point
Copy the full SHA 0af4ca7View commit details -
[core][experimental] Allocate a channel for each InputAttributeNode (r…
…ay-project#47564) Change 1: Remove class DAGInputAdapter. Without this PR, the entire input data will be written to the channel, even if a reader only wants to retrieve partial input data via InputAttributeNode. Then, the entire input data will be read by the READ operation, and the partial input will be retrieved during the COMPUTE operation (code) In this PR, each InputAttributeNode has its own channel, and only the corresponding input data will be written to the channel. Therefore, we no longer need to use DAGInputAdapter to retrieve the partial input data during the COMPUTE operation. Change 2: If the DAG contains any InputAttributeNode, create a channel for each InputAttributeNode. Then, write the partial input data to the corresponding channel (code). Change 3: There are some if/else statements to handle InputNode and InputAttributeNode for creating CachedChannel. This PR unifies the logic because InputNode and different InputAttributeNode are no longer considered consumers of only one input channel. Each InputAttributeNode has its own channel. Change 4: Move RayDAGArgs from compiled_dag_node.py to common.py to avoid importing it inside _adapt. Without this, this PR is about 5% slower than the baseline in the case "Benchmark: single actor, no InputAttributeNode". With this change, the performance is almost the same as, or slightly better than, the baseline. See "Benchmark: single actor, no InputAttributeNode" below for more details. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ebb984e - Browse repository at this point
Copy the full SHA ebb984eView commit details -
[Data] Add
partitioning
parameter toread_parquet
(ray-project#47553) To extract path partition information with `read_parquet`, you pass a PyArrow `partitioning` object to `dataset_kwargs`. For example: ``` schema = pa.schema([("one", pa.int32()), ("two", pa.string())]) partitioning = pa.dataset.partitioning(schema, flavor="hive") ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning)) ``` This is problematic for two reasons: 1. It tightly couples the interface with the implementation; partitioning only works if we use `pyarrow.Dataset` in a specific way in the implementation. 2. It's inconsistent with all of the other file-based API. All other APIs use expose a top-level `partitioning` parameter (rather than `dataset_kwargs`) where you pass a Ray Data `Partitioning` object (rather than a PyArrow partitioning object). --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 804b4f3 - Browse repository at this point
Copy the full SHA 804b4f3View commit details -
[spark] Refine comment in Starting ray worker spark task (ray-project…
…#47670) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 96175fb - Browse repository at this point
Copy the full SHA 96175fbView commit details -
[Core][aDAG] Set buffer size to 1 for regression (ray-project#47639)
There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation. With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2a7679d - Browse repository at this point
Copy the full SHA 2a7679dView commit details -
Add perf metrics for 2.36.0 (ray-project#47574)
``` REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json ``` Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for de9be8f - Browse repository at this point
Copy the full SHA de9be8fView commit details -
[RLlib] Add "shuffle batch per epoch" option. (ray-project#47458)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3af892c - Browse repository at this point
Copy the full SHA 3af892cView commit details -
[RLlib; Offline RL] Enable buffering episodes. (ray-project#47501)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d738010 - Browse repository at this point
Copy the full SHA d738010View commit details -
[Core] Make JobSupervisor logs structured (ray-project#47699)
Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ca4be70 - Browse repository at this point
Copy the full SHA ca4be70View commit details -
[serve] wrap obj ref in result wrapper in deployment response (ray-pr…
…oject#47655) ## Why are these changes needed? Abstract `ray.ObjectRef` and `ray.ObjectRefGenerator` in a result wrapper that the deployment response can directly call into. --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 73b528b - Browse repository at this point
Copy the full SHA 73b528bView commit details -
[Core] Fix broken dashboard worker page (ray-project#47714)
Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9dbbe38 - Browse repository at this point
Copy the full SHA 9dbbe38View commit details -
[core][experimental] Remove unused attr CompiledDAG._type_hints (ray-…
…project#47706) CompiledDAG._type_hints is not used. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c1bdd25 - Browse repository at this point
Copy the full SHA c1bdd25View commit details -
[Data] Re-phrase the streaming executor current usage string (ray-pro…
…ject#47515) ## Why are these changes needed? The progress bar for ray data could still end up showing higher utilization of what the cluster currently have. ray-project#46729 was the first attempt to fix it which addressed the issue in static clusters, but we still have that issue for clusters that autoscales. This change simply rephrase the string so it is less confusing. Before <img width="1249" alt="image" src="https://github.com/user-attachments/assets/049ea096-a87f-4767-ba04-6d00d7c2755d"> After <img width="1248" alt="image" src="https://github.com/user-attachments/assets/cb74c0dc-1f33-4b22-b31c-e83df2a5d408"> This comes from the fact that operators don't track the task state (and currently ray core does not even provide that api). Which means Ray data operators does not know if the task is assigned to a node or not, so once the task is submitted to ray it is marked active even if it is pending a node assignment. The dashboard does better here since it does have extra information from the task. <img width="1493" alt="image" src="https://github.com/user-attachments/assets/9315b884-3e61-4b32-8400-7f76e15b6a4b"> In the future we can visit adding the core api for remote state reporting and allowing operators to provide more detailed state (active, pending_scheduled, pending_node_assignment). ## Related issue number ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Sofian Hnaide <[email protected]> Co-authored-by: scottjlee <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f966d2e - Browse repository at this point
Copy the full SHA f966d2eView commit details -
[serve] improve tests (ray-project#47722)
## Why are these changes needed? - We can make some tests asynchronous instead of having to rely on `_to_object_ref`. - we can use `RayActorError` instead of `ActorDiedError` Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3f439f8 - Browse repository at this point
Copy the full SHA 3f439f8View commit details -
[Core] Add test case where there is dead node for /nodes?view=summary…
… endpoint (ray-project#47727) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9e28fb7 - Browse repository at this point
Copy the full SHA 9e28fb7View commit details -
[Dashboard] Optimizing performance of Ray Dashboard (ray-project#47617)
Signed-off-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0426ee4 - Browse repository at this point
Copy the full SHA 0426ee4View commit details -
[core][aDAG] Fix a bug where multi arg + exception doesn't work (ray-…
…project#47704) Currently, when there's an exception, there's only 1 return value, but multi ref assumes that the return value has to match the # of output channels. It fixes the issue by duplicating exception to match the number of output channels. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for dd8ee01 - Browse repository at this point
Copy the full SHA dd8ee01View commit details -
[fake autoscaler] use check_call in fake multi node test utils (ray-p…
…roject#47772) so that output is printed to logs and also use "sys.executable" rather than "python" Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 361c10e - Browse repository at this point
Copy the full SHA 361c10eView commit details -
[RLlib] RLModule: Simplify defining custom distribution classes and a…
…dd better defaults. (ray-project#47775) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 605c640 - Browse repository at this point
Copy the full SHA 605c640View commit details -
[fake autoscaler] remove the redundant mkdir (ray-project#47786)
- docker compose service volume short syntax uses bind (similar to `-v` and will create the dir if not exist - the code was not mapping the dir to host path, so it actually has no meaningful effect when it is running in a container, such as on CI Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aaa3d8d - Browse repository at this point
Copy the full SHA aaa3d8dView commit details -
[Data] Simplify and consolidate progress bar outputs (ray-project#47692)
## Why are these changes needed? Currently, the progress bar is pretty verbose because it is very information dense. This PR: - Reorganizes progress output to group by relevant concepts and clarifies labels - Standardizes global and operator-level progress bar outputs - Removes the use of all emojis (poor rendering on some platforms / external logging systems) Progress bar before this PR: <img width="1403" alt="Screenshot at Sep 16 13-00-17" src="https://github.com/user-attachments/assets/4f459b77-06ba-4395-b883-e4c9ac8ca2ef"> Progress bar after this PR: <img width="1502" alt="Screenshot at Sep 23 13-48-32" src="https://github.com/user-attachments/assets/0c0f8c94-9439-4fd4-ae1a-2857b3a87b59"> Will follow up with a docs PR once we merge this change, so that I don't need to continuously modify the docs. In the future, we should restructure the way progress bars are grouped/tracked, so that we can tabulate the op-level progress bar outputs. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c7ff9c8 - Browse repository at this point
Copy the full SHA c7ff9c8View commit details -
Add perf metrics for 2.37.0 (ray-project#47791)
for release perf checking. Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6aae543 - Browse repository at this point
Copy the full SHA 6aae543View commit details -
[Serve] add dependencies on openssl (ray-project#47738)
Add `pyOpenSSL` dependency for Serve. And update test docker file to use ray[serve-grpc] dependencies. Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bc7f7b0 - Browse repository at this point
Copy the full SHA bc7f7b0View commit details -
[docker] Update latest Docker dependencies for 2.36.0 release (ray-pr…
…oject#47748) Created by release automation bot. Update with commit f298a75 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 247be0b - Browse repository at this point
Copy the full SHA 247be0bView commit details -
[docker] Update latest Docker dependencies for 2.36.1 release (ray-pr…
…oject#47801) Created by release automation bot. Update with commit 18b2d94 Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d4e7f7f - Browse repository at this point
Copy the full SHA d4e7f7fView commit details -
[observability][export-api] Write submission job events (ray-project#…
…47468) Add ExportEventLoggerAdapter which will be used to write export events to file from python files. Only a single ExportEventLoggerAdapter instance will exist per source type, so callers can create or get this instance using get_export_event_logger which is thread safe. Write Submission Job export events to file from JobInfoStorageClient.put_info which is called to update the JobInfo data in the internal KV store. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f994475 - Browse repository at this point
Copy the full SHA f994475View commit details -
Move export events to separate folder (ray-project#47747)
Move export events from session_latest/logs/events to session_latest/logs/export_events Keeping both event types in the same folder doesn't cause any issue for Ray -- export event files are already filtered out for /events API in ray/python/ray/dashboard/modules/event/event_utils.py Line 22 in 1e48a03 all_source_types = set(event_consts.EVENT_SOURCE_ALL) However moving these to a separate folder would be better for existing downstream consumers to avoid handling export events in the events folder if they turn the flag on Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c8f16e3 - Browse repository at this point
Copy the full SHA c8f16e3View commit details -
[release] stream the full anyscale log to buildkite (ray-project#47808)
Currently we only print 100 last lines of anyscale job log to buildkite. This PR removes that limit and prints everything instead. CC: @kouroshHakha Test: - CI Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1b8fcac - Browse repository at this point
Copy the full SHA 1b8fcacView commit details -
[RLlib; Offline RL] Offline performance cleanup. (ray-project#47731)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 575ee94 - Browse repository at this point
Copy the full SHA 575ee94View commit details -
[docker] Update latest Docker dependencies for 2.37.0 release (ray-pr…
…oject#47812) Created by release automation bot. Update with commit d2982b7 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1d25a39 - Browse repository at this point
Copy the full SHA 1d25a39View commit details -
[RLlib] Fix action masking example. (ray-project#47817)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ce75400 - Browse repository at this point
Copy the full SHA ce75400View commit details -
[Core] Separate the attempt_number with the task_status in memory sum…
…mary and object list (ray-project#47818) # Current status: * When we retrieve the information from GCS, the task_status as well as the attempts are in 2 fields and the task status is an enum. * Later during reconstruction, the 2 fields are combined into 1 and the number of attempts is added to the task_status field. * That's why when displaying the objects, the function isn't able to convert the string back to enum. # Proposed solution: * Instead of combining the 2 fields (task_status and attempt), we will keep the 2 fields and added an additional field (attempt_number) in the Object State * In this way, we will keep the task_status as enum and put the attempt number information in a different field # Changes in this PR: * Added the `attempt_number` in `ObjectState` and `task_attempt_number_counts` in `ObjectSummaryPerKey` * Added logic to populate the fields as proposed above * Updated the logic for the memory summary function to display the attempt number in a new column * Corresponding tests added as well Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 788db07 - Browse repository at this point
Copy the full SHA 788db07View commit details -
[RLlib; docs] New API stack migration guide. (ray-project#47779)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 55397ea - Browse repository at this point
Copy the full SHA 55397eaView commit details -
[RLlib; new API stack by default] Switch on new API stack by default …
…for SAC and DQN. (ray-project#47217) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 27985d4 - Browse repository at this point
Copy the full SHA 27985d4View commit details -
[Core] Fix a Typo in dict_to_state function parameter name (ray-proje…
…ct#47822) Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 90742fb - Browse repository at this point
Copy the full SHA 90742fbView commit details -
[core] Introducing InstrumentedIOContextWithThread. (ray-project#47831)
Previously we had several ad-hoc places to do a "thread and io_context" pattern: create a thread dedicated to an asio io_context, then workload can post async tasks onto it. This makes duplicate code: everywhere we create threads, implement stop and join. Introducing InstrumentedIOContextWithThread that does exactly this and replaces existing usages. Also fixes some absl::Time computations with best practice. This is refactoring. Should have no runtime difference. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1383374 - Browse repository at this point
Copy the full SHA 1383374View commit details -
[RLlib] Discontinue support for "hybrid" API stack (using RLModule + …
…Learner, but still on RolloutWorker and Policy) (ray-project#46085) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 417cdd2 - Browse repository at this point
Copy the full SHA 417cdd2View commit details -
[Core] Fix object reconstruction hang on arguments pending creation (r…
…ay-project#47645) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6c160b3 - Browse repository at this point
Copy the full SHA 6c160b3View commit details -
[core][experimental] Fix test_execution_schedule_gpu (ray-project#47753)
Pass a GPU tensor to execute, but it gets converted into a CPU tensor. The issue may be related to ray-project#46440. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3714afd - Browse repository at this point
Copy the full SHA 3714afdView commit details -
[core] Change many Ray ID logs to WithField. (ray-project#47844)
Use structured logging by changing more `<< node_id` to use `.WithField(node_id)`. This is not intended to be a complete work, but it should cover most of the cases. We did the work for NodeID, WorkerID, ActorID, JobID, TaskID, PlacementGroupID. Some logs have multiple IDs. To avoid confusion, for these we only use WithField(object_id) don't use WithField on either of the Node IDs. This PR should have no change on Ray other than logs. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 256c177 - Browse repository at this point
Copy the full SHA 256c177View commit details -
[RLlib] Cleanup examples folder (vol 30): BC pretraining, then PPO fi…
…netuning (new API stack with RLModule checkpoints). (ray-project#47838) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a4f62b6 - Browse repository at this point
Copy the full SHA a4f62b6View commit details -
[RLlib] MultiAgentEnv API enhancements (related to defining obs-/acti…
…on spaces for agents). (ray-project#47830) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1ac860f - Browse repository at this point
Copy the full SHA 1ac860fView commit details -
[RLlib] Add log-std clipping to 'MLPHead's. (ray-project#47827)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 43a8a1d - Browse repository at this point
Copy the full SHA 43a8a1dView commit details -
[RLlib] Update autoregressive actions example. (ray-project#47829)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cfbda91 - Browse repository at this point
Copy the full SHA cfbda91View commit details -
[kuberay] Update docs for KubeRay v1.2.2 (ray-project#47867)
change kuberay helm and branch reference versions to v1.2.2 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 71bb74b - Browse repository at this point
Copy the full SHA 71bb74bView commit details -
[Arrow] Adding
ArrowTensorTypeV2
to support tensors larger than 2Gb (……ray-project#47832) Currently, when using tensor type in Ray Data if single tensor in a block grows above 2Gb (due to use of signed `int32` as offsets) this would result in the following issue: ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` Consequently, this change adds support for tensors of > 4Gb in size, while maintaining compatibility with existing datasets already using tensors. This is done by forking off `ArrowTensorType` in 2: - `ArrowTensorType` (v1) remaining intact - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as now using `int64` offsets --------- Signed-off-by: Peter Wang <[email protected]> Signed-off-by: Alexey Kudinkin <[email protected]> Co-authored-by: Peter Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cd61cb3 - Browse repository at this point
Copy the full SHA cd61cb3View commit details -
[Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeI…
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 21af246 - Browse repository at this point
Copy the full SHA 21af246View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 01 (some preparat…
…ory cleanups). (ray-project#47884) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 072d349 - Browse repository at this point
Copy the full SHA 072d349View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 02 (VPG RLModule,…
… Algo, and Learner example classes). (ray-project#47885) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 383d7ff - Browse repository at this point
Copy the full SHA 383d7ffView commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…
…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e4401e5 - Browse repository at this point
Copy the full SHA e4401e5View commit details -
[RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …
…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 759b0c8 - Browse repository at this point
Copy the full SHA 759b0c8View commit details -
[core] Change debug_string from returning a string to streaming to an…
… ostream. (ray-project#47893) We have a convenience function `debug_string` used in Ray logs: it prints printables (operator<<), containers, pairs. However it returns a std::string which is feed into RAY_LOG(). This makes a copy. Changes the signature to return a `DebugStringWrapper` which holds const reference to the argument, and is printable for all already supported types. Additionally supports std::tuple. This should only have marginal perf benefits since we typically don't debug_string a very big data structure. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b50f7c1 - Browse repository at this point
Copy the full SHA b50f7c1View commit details -
[Serve / Jobs] Check if conda env exists before removing (ray-project…
…#47922) ## Why are these changes needed? Fixes some failing/flaky unit tests tests, which fail with errors like: ``` EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b ``` - TestBackwardsCompatibility.test_cli - test_failed_driver_exit_code Previously failing test now passes with this PR applied: https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934 ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8844a78 - Browse repository at this point
Copy the full SHA 8844a78View commit details -
[job] don't continue on test setup (ray-project#47927)
when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 122b382 - Browse repository at this point
Copy the full SHA 122b382View commit details -
[core][experimental] Avoid false positives in deadlock detection (ray…
…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b16a782 - Browse repository at this point
Copy the full SHA b16a782View commit details -
[serve] Stop scheduling task early when requests have been cancelled (r…
…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d360d45 - Browse repository at this point
Copy the full SHA d360d45View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…
…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b2a8acf - Browse repository at this point
Copy the full SHA b2a8acfView commit details -
[RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…
… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fe2aea0 - Browse repository at this point
Copy the full SHA fe2aea0View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…
…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 80824d0 - Browse repository at this point
Copy the full SHA 80824d0View commit details -
[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (ray-…
…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for da339ad - Browse repository at this point
Copy the full SHA da339adView commit details -
[RLlib] Fix small bug in 'InfiniteLookBackBuffer.get_state/from_state…
…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 315bdf1 - Browse repository at this point
Copy the full SHA 315bdf1View commit details -
[core] Add more debug string types (ray-project#47928)
Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c5bbfe8 - Browse repository at this point
Copy the full SHA c5bbfe8View commit details -
[deps] add grpcio-tools into anyscale dependencies (ray-project#47955)
so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 364ee39 - Browse repository at this point
Copy the full SHA 364ee39View commit details -
[RLlib] Quick-fix for default RLModules in combination with a user-pr…
…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d04f8d3 - Browse repository at this point
Copy the full SHA d04f8d3View commit details -
[RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…
…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ca5d29b - Browse repository at this point
Copy the full SHA ca5d29bView commit details -
[RLlib] Add framework-check to
MultiRLModule.add_module()
. (ray-pro……ject#47973) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3c1aa3b - Browse repository at this point
Copy the full SHA 3c1aa3bView commit details -
[serve] Fix failing test pow 2 scheduler on windows (ray-project#47975)
## Why are these changes needed? Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio is slower on windows, so the shortened timeouts for some tests cause the tests to fail because tasks didn't get a chance to start/finish executing. Failing tests on windows: - `test_multiple_queries_with_different_model_ids` - `test_queue_len_cache_replica_at_capacity_is_probed` - `test_queue_len_cache_background_probing` ## Related issue number Closes ray-project#47950 Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7585842 - Browse repository at this point
Copy the full SHA 7585842View commit details -
[data] fix reading multiple parquet files with ragged ndarrays (ray-p…
…roject#47961) ## Why are these changes needed? PyArrow infers parquet schema only based on the first file. This will cause errors when reading multiple files with ragged ndarrays. This PR fixes this issue by not using the inferred schema for reading. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Fixes ray-project#47960 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ca871bc - Browse repository at this point
Copy the full SHA ca871bcView commit details -
[core] Decouple create worker vs pop worker request. (ray-project#47694)
Now, when you call PopWorker(), it finds an idle one or creates a worker. If a new worker is created, the worker is associated to the request and can only be used by it. This PR decouples the worker creation and the worker-to-task assignment, by adding an abstraction namely PopWorkerRequest. Now, if a req triggers a worker creation, the req is put into a queue. If there are workers ready, that is a PushWorker is called, either from a newly started worker or a released worker, Ray matches the first fitting request in the queue. This reduces latency. Later it can also be used to pre-start workers more meaningfully. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6a38914 - Browse repository at this point
Copy the full SHA 6a38914View commit details -
[core] Add metrics for gcs jobs (ray-project#47793)
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 669d699 - Browse repository at this point
Copy the full SHA 669d699View commit details -
upgrade grpcio version (ray-project#47982)
to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f2b09d4 - Browse repository at this point
Copy the full SHA f2b09d4View commit details -
[Chore][Core] Address PR 47807 comments (ray-project#48002)
PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aed856b - Browse repository at this point
Copy the full SHA aed856bView commit details -
[core] Add thread check to job mgr callback (ray-project#48005)
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4cf016c - Browse repository at this point
Copy the full SHA 4cf016cView commit details -
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 155a415 - Browse repository at this point
Copy the full SHA 155a415View commit details -
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e7c94c5 - Browse repository at this point
Copy the full SHA e7c94c5View commit details -
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b83e7ad - Browse repository at this point
Copy the full SHA b83e7adView commit details -
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b77c5ad - Browse repository at this point
Copy the full SHA b77c5adView commit details -
[Serve] fix grpc performance issue (ray-project#47338)
This PR fixes part of the problem by creating the payload message once and reusing it throughout the benchmark. Ran the release test on this change [build](https://buildkite.com/ray-project/release/builds/21663#01918fe1-853b-46f2-9699-c4045b182b8c) now seeing the `grpc_10mb_p50_latency` now dropped to ~58ms from ~80ms previously. The rest of the issue came from the existing gRPC server implementation requires to wait on the entirety of the unary request before it's able to continue it's work on replica. We will need to create a new HTTP2 proxy and pass the request transparently between the replica and the proxy to speed thing up. Will follow up in the future on ray-project#47370 Closes ray-project#47371 Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7ad9a3b - Browse repository at this point
Copy the full SHA 7ad9a3bView commit details -
[observability][export-api] Write node events (ray-project#47221)
Write node events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a value in the node event data schema is modified. Typically this occurs in the callback after writing NodeTable to the GCS table Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cdc86fa - Browse repository at this point
Copy the full SHA cdc86faView commit details -
[RLlib] Cleanup examples folder (vol 23): Float16 training support an…
…d new example script. (ray-project#47362) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1ea718b - Browse repository at this point
Copy the full SHA 1ea718bView commit details -
[core][dashboard] Update nodes on delta. (ray-project#47367)
Like actor_head.py, we now update DataSource.nodes on delta. It first queries all node infos, then subscribes node deltas. Each delta updates: 1. DataSource.nodes[node_id] 2. DataSource.agents[node_id] 3. a warning generated after RAY_DASHBOARD_HEAD_NODE_REGISTRATION_TIMEOUT = 10s Note on (2) agents: it's read from internal kv, and is not readily available until the agent.py is spawned and writes its own port to internal kv. So we make an async task for each node to poll this port every 1s. It occurs that the get-all-then-subscribe code has a TOCTOU problem, so also updated actor_head.py to first subscribe then get all actors. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ddec4a5 - Browse repository at this point
Copy the full SHA ddec4a5View commit details -
[RLlib] Cleanup examples folder (vol 24): Mixed-precision training (a…
…nd float16 inference) through new example script. (ray-project#47116) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8820918 - Browse repository at this point
Copy the full SHA 8820918View commit details -
Split python/ray/tests/test_actor_retry over two files (ray-project#4…
…7188) The `test_actor_retry` tests are failing/flaky on windows. They pass locally. I have not been able to access the CI logs to see what is going wrong. In order to shrink the problem (is it a overall timeout? Is one of the tests failing?) we can start by splitting the tests into two files. Toward solving ray-project#43845. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9f76655 - Browse repository at this point
Copy the full SHA 9f76655View commit details -
[RLlib; Offline RL] - Enable reading old-stack
SampleBatch
data in ……new stack Offline RL. (ray-project#47359) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 05fad3f - Browse repository at this point
Copy the full SHA 05fad3fView commit details -
[serve] redeploy in between each microbenchmark (ray-project#47404)
Redeploy in between each microbenchmark. --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fa17d58 - Browse repository at this point
Copy the full SHA fa17d58View commit details -
Revert "[observability][export-api] Write node events" (ray-project#4…
…7405) Reverts ray-project#47221 This broke ray-project#47395 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 28d7347 - Browse repository at this point
Copy the full SHA 28d7347View commit details -
[doc] Instruction for troubleshooting side nav when building incremen…
…tally (ray-project#47372) Signed-off-by: khluu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e6c08e1 - Browse repository at this point
Copy the full SHA e6c08e1View commit details -
[Doc] Run pre-commit on cluster docs (ray-project#47342)
Currently we have no linting on any part of the docs code. This PR runs pre-commit on the cluster docs. This PR fixes the following issues: ``` trim trailing whitespace.................................................Failed - hook id: trailing-whitespace - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/configure-manage-dashboard.md Fixing doc/source/cluster/kubernetes/user-guides/pod-security.md Fixing doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md Fixing doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md Fixing doc/source/cluster/vms/references/ray-cluster-configuration.rst Fixing doc/source/cluster/running-applications/job-submission/quickstart.rst Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md Fixing doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md Fixing doc/source/cluster/running-applications/job-submission/sdk.rst Fixing doc/source/cluster/running-applications/job-submission/ray-client.rst Fixing doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md Fixing doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md Fixing doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md Fixing doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/examples/rayjob-batch-inference-example.md Fixing doc/source/cluster/metrics.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md Fixing doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md Fixing doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md Fixing doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md Fixing doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md Fixing doc/source/cluster/kubernetes/user-guides/config.md Fixing doc/source/cluster/kubernetes/user-guides/pod-command.md fix end of files.........................................................Failed - hook id: end-of-file-fixer - exit code: 1 - files were modified by this hook Fixing doc/source/cluster/kubernetes/images/rbac-clusterrole.svg Fixing doc/source/cluster/running-applications/job-submission/cli.rst Fixing doc/source/cluster/vms/user-guides/community/slurm.rst Fixing doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md Fixing doc/source/cluster/images/ray-job-diagram.svg Fixing doc/source/cluster/kubernetes/user-guides/observability.md Fixing doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml Fixing doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg Fixing doc/source/cluster/kubernetes/examples/mnist-training-example.md Fixing doc/source/cluster/cli.rst Fixing doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml Fixing doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md Fixing doc/source/cluster/kubernetes/user-guides/logging.md Fixing doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md Fixing doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg Fixing doc/source/cluster/kubernetes/images/kubeflow-architecture.svg Fixing doc/source/cluster/faq.rst Fixing doc/source/cluster/running-applications/job-submission/openapi.yml Fixing doc/source/cluster/kubernetes/images/AutoscalerOperator.svg check for added large files..............................................Passed check python ast.........................................................Passed check json...........................................(no files to check)Skipped check toml...........................................(no files to check)Skipped black....................................................................Passed flake8...................................................................Passed prettier.............................................(no files to check)Skipped mypy.................................................(no files to check)Skipped isort (python)...........................................................Passed rst directives end with two colons.......................................Passed rst ``inline code`` next to normal text..................................Passed use logger.warning(......................................................Passed check for not-real mock methods..........................................Passed ShellCheck v0.9.0........................................................Passed clang-format.........................................(no files to check)Skipped Google Java Formatter................................(no files to check)Skipped Check for Ray docstyle violations........................................Passed Check for Ray import order violations....................................Passed ``` Signed-off-by: pdmurray <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8438af2 - Browse repository at this point
Copy the full SHA 8438af2View commit details -
[RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…
…ating to new API stack. (ray-project#47425) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 51be505 - Browse repository at this point
Copy the full SHA 51be505View commit details -
[RLlib] Remove 2nd Learner ConnectorV2 pass from PPO (add new GAE Con…
…nector piece). Fix: "State-connector" would use `seq_len=20`. (ray-project#47401) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f2c5415 - Browse repository at this point
Copy the full SHA f2c5415View commit details -
[RLlib; Offline RL] CQL: Support multi-GPU/CPU setup and different le…
…arning rates for actor, critic, and alpha. (ray-project#47402) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for be99650 - Browse repository at this point
Copy the full SHA be99650View commit details -
[aDAG] Support multi-read of the same shm channel (ray-project#47311)
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ed38e38 - Browse repository at this point
Copy the full SHA ed38e38View commit details -
[RLlib; Offline RL] Add cloud filesystems to offline data input argum…
…ents. (ray-project#47384) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c13190c - Browse repository at this point
Copy the full SHA c13190cView commit details -
[serve] Fix broken microbenchmarks (ray-project#47430)
With serve shutdown in between every microbenchmark, serve needs to be started with grpc options every time for the grpc microbenchmarks. ## Related issue number closes ray-project#47424 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8edb0e3 - Browse repository at this point
Copy the full SHA 8edb0e3View commit details -
[ADAG] Support tasks with multiple return values in aDAG (ray-project…
…#47024) aDAG currently does not support multiple return values. We would like to add general support for multiple return values. This PR supports multiple returns by returning a separate `ClassMethodNode` for each return value of the tuple. It is an incremental change for `ClassMethodNode`, addign `_is_class_method_output`, `_class_method_call`, `_output_idx`. `_output_idx` is used to guide channel allocation and output writes. User needs to specify `num_returns > 1` to hint multiple return values. The upstream task allocates a separate output channel for each return value. A downstream task reads from one of the output channels. ## What is done? We modify `ClassMethodNode` to handle two logics, one is a class method call which is the original semantics (`self.is_class_method_call == True`), another is a class method output which is responsible for one of the multiple return values (`self.is_class_method_output == True`). We modify `WriterInterface` to support writes to multiple `output_channels` with `output_idxs`. If an output index is None, it means the complete return value is written to the output channel. Otherwise, the return value is a tuple and the index is used to extract the value to be written to the output channel. We allocate separate output channels to different readers. The downstream tasks of a `ClassMethodNode` with `self.is_class_method_output == True` are the readers of an output channel of its upstream `ClassMethodNode`. The example below demonstrates this. ``` upstream ClassMethodNode (self.is_class_method_call == True, self.output_channels = [c1, c2]) --> downstream ClassMethodNode (self.is_class_method_method == True, self.output_channels[c1]) --> ... ``` Closes ray-project#45569 --------- Signed-off-by: Weixin Deng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 150c8ba - Browse repository at this point
Copy the full SHA 150c8baView commit details -
[RLlib] Add gradient checks to avoid
nan
gradients in `TorchLearner……`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8c990d8 - Browse repository at this point
Copy the full SHA 8c990d8View commit details -
[RLlib] Add option to use
torch.lr_scheduler
classes for learning r……ate schedules. (ray-project#47453) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 105a904 - Browse repository at this point
Copy the full SHA 105a904View commit details -
[observability][export-api] Write node events (ray-project#47422)
Same code changes as [observability][export-api] Write node events ray-project#47221 Move test into a separate file to create a separate bazel target that can be skipped on Windows Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9e0a00d - Browse repository at this point
Copy the full SHA 9e0a00dView commit details -
[RLlib] - Add example for PyTorch lr schedulers. (ray-project#47454)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 637c16c - Browse repository at this point
Copy the full SHA 637c16cView commit details -
[RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…
…ating to new API stack (by config). (ray-project#47427) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 68c117a - Browse repository at this point
Copy the full SHA 68c117aView commit details -
[serve] add streaming to microbenchmarks (ray-project#47466)
Add streaming microbenchmark to release tests. Only HTTP, intermediate router, and handle for now (no grpc). --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 716f314 - Browse repository at this point
Copy the full SHA 716f314View commit details -
feat: quickstart install button (ray-project#47479)
![CleanShot 2024-09-04 at 11 12 44@2x](https://github.com/user-attachments/assets/9c8dfd64-c565-4285-a1ce-774c6fce2997) Signed-off-by: Saihajpreet Singh <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 383b47a - Browse repository at this point
Copy the full SHA 383b47aView commit details -
Revert "[Doc] Add Algolia search to docs" (ray-project#47483)
Reverts ray-project#46477 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for dcb8d6d - Browse repository at this point
Copy the full SHA dcb8d6dView commit details -
[release] simplify the process of getting job logs (ray-project#47470)
The current logic to parse logs from anyscale job is very complicated. It first downloads all the logs from the cluster, and try to guess the main job logs and error job logs. The logic of getting error job log is no longer neccessary. The new API offers a much simpler way to get the log, update to that API. Test: - CI - so much cleaner: https://buildkite.com/ray-project/release/builds/22057#0191ba75-2f0b-4a0b-9bad-8603003eba4c/741-742 --------- Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3125db2 - Browse repository at this point
Copy the full SHA 3125db2View commit details -
[Core] Fix runtime env race condition when uploading the same package…
… concurrently (ray-project#47482) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b42f473 - Browse repository at this point
Copy the full SHA b42f473View commit details -
[core][dashboard] Pass in cluster ID in hex for dashboard, dash agent…
…, rt env agent. (ray-project#47490) This saves 1 RPC for each GcsClient, which can be O(#nodes). Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a4621ce - Browse repository at this point
Copy the full SHA a4621ceView commit details -
[core][experimental] Correct
num_input_consumers
for CachedChannel (r……ay-project#47489) Without this PR, the num_input_consumers would be 1 because both inp[0] and inp[1] are only referred to in one task on the actor, so CachedChannel will not be created. The read will eventually time out because the mutable object is being read by the same actor twice. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8a530a2 - Browse repository at this point
Copy the full SHA 8a530a2View commit details -
Revert Revert "[Doc] Add Algolia search to docs" (ray-project#47487)
Redo https://github.com/ray-project/ray/pull/47483/files. The previous PR was based on a too old base so it gets merged successfully without re-compiling the dependencies Also allow the dry-run of generating build cache to run on premerge, to block changes that can break it. Test: - CI Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 98f6186 - Browse repository at this point
Copy the full SHA 98f6186View commit details -
[observability][export-api] Write actor events (ray-project#47303)
Write actor events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a value in the actor event data schema is modified. Typically this occurs before writing ActorTableData to the GCS table or publishing the data for the dashboard Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0c75290 - Browse repository at this point
Copy the full SHA 0c75290View commit details -
[ADAG] Log Executable Task Events (ray-project#47345)
Support logging events for execution task for better observability. Users can turn on event profiling by setting RAY_ADAG_ENABLE_PROFILING as True The event tracks the following metadata of a task: Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e14400f - Browse repository at this point
Copy the full SHA e14400fView commit details -
[Core] Fix test_runtime_env_working_dir_4 for Windows (ray-project#47505
) Windows path needs to be escaped. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 692f9df - Browse repository at this point
Copy the full SHA 692f9dfView commit details -
[observability][export-api] Write task events (ray-project#47193)
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well. The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush. Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second). Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eca534a - Browse repository at this point
Copy the full SHA eca534aView commit details -
Revert "[observability][export-api] Write actor events" (ray-project#…
…47516) Reverts ray-project#47303 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 184e293 - Browse repository at this point
Copy the full SHA 184e293View commit details -
Revert "[observability][export-api] Write task events" (ray-project#4…
…7536) Reverts ray-project#47193 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 950ad18 - Browse repository at this point
Copy the full SHA 950ad18View commit details -
fix quickstart image path (ray-project#47535)
| Before | After | |--------|------| |![CleanShot 2024-09-06 at 10 33 56@2x](https://github.com/user-attachments/assets/0b8dff77-3a7f-4bc7-b117-39fcd4edd69f) | ![CleanShot 2024-09-06 at 10 33 18@2x](https://github.com/user-attachments/assets/ef4c67ba-df95-48c9-8c70-273b75ed5296) | Signed-off-by: Saihajpreet Singh <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 246e395 - Browse repository at this point
Copy the full SHA 246e395View commit details -
[RLlib; Off-policy] Add episode sampling to
EpisodeReplayBuffer
. (r……ay-project#47500) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1e4e4d0 - Browse repository at this point
Copy the full SHA 1e4e4d0View commit details -
[aDAG] Allow custom NCCL group for aDAG (ray-project#47141)
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4792e1d - Browse repository at this point
Copy the full SHA 4792e1dView commit details -
[aDAG] Fix test_accelerated_dag regression (ray-project#47543)
Fix CI regression: https://buildkite.com/ray-project/postmerge/builds/6157#0191c4aa-1897-4d42-93c7-5403b67bc5cc https://buildkite.com/ray-project/postmerge/builds/6165#0191c819-53f7-4605-805f-824e85951fde Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8b89a9d - Browse repository at this point
Copy the full SHA 8b89a9dView commit details -
[Core] Remove ray._raylet.check_health (ray-project#47526)
Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bb015e4 - Browse repository at this point
Copy the full SHA bb015e4View commit details -
[observability][export-api] Write actor events (ray-project#47529)
- Add back code changes from [observability][export-api] Write actor events ray-project#47303 - Separate out actor manager export event test into a separate file so we can skip on windows. Update BUILD rule so all tests in src/ray/gcs/gcs_server/test/export_api are skipped on windows Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9cf02de - Browse repository at this point
Copy the full SHA 9cf02deView commit details -
[observability][export-api] Write task events (ray-project#47538)
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525 - Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged). Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6f4aaf6 - Browse repository at this point
Copy the full SHA 6f4aaf6View commit details -
[RLlib; Offline RL] - Replace GAE in
MARWILOfflinePreLearner
with `……GeneralAdvantageEstimation` connector in learner pipeline. (ray-project#47532) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8e61bab - Browse repository at this point
Copy the full SHA 8e61babView commit details -
[data] Change fixture from
shutdown_only
to `ray_start_regular_shar……ed` for `test_csv_read_filter_non_csv_file` (ray-project#47513) ## Why are these changes needed? Seems that ray-project#47467 ended up breaking some niche setup for this test, by changing the fixture from `shutdown_only` to `ray_start_regular_shared` we are able to get the test passing again. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9eef3b5 - Browse repository at this point
Copy the full SHA 9eef3b5View commit details -
Add perf metrics for 2.35.0 (ray-project#47283)
``` REGRESSION 13.65%: client__get_calls (THROUGHPUT) regresses from 1119.7725751916082 to 966.9141307622872 in microbenchmark.json REGRESSION 9.23%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.184014305625574 to 18.32083810818594 in microbenchmark.json REGRESSION 8.40%: multi_client_tasks_async (THROUGHPUT) regresses from 23311.858831941317 to 21353.682091539627 in microbenchmark.json REGRESSION 6.66%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 3038.941703794114 to 2836.601104413851 in microbenchmark.json REGRESSION 4.39%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4456.606860484332 to 4261.050694056448 in microbenchmark.json REGRESSION 3.77%: actors_per_second (THROUGHPUT) regresses from 627.338335492887 to 603.6854672610009 in benchmarks/many_actors.json REGRESSION 3.47%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.679337230724197 to 13.204885454613315 in microbenchmark.json REGRESSION 3.38%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2055.7051275912527 to 1986.177233156469 in microbenchmark.json REGRESSION 2.44%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5167.9800954515 to 5041.760637338739 in microbenchmark.json REGRESSION 2.33%: placement_group_create/removal (THROUGHPUT) regresses from 824.4108502776797 to 805.1759941825478 in microbenchmark.json REGRESSION 1.64%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.485273551888224 to 5.39514490847805 in microbenchmark.json REGRESSION 1.28%: single_client_tasks_sync (THROUGHPUT) regresses from 986.5998779605792 to 973.959307673384 in microbenchmark.json REGRESSION 0.95%: pgs_per_second (THROUGHPUT) regresses from 22.249430148995714 to 22.037557767422825 in benchmarks/many_pgs.json REGRESSION 0.66%: n_n_actor_calls_async (THROUGHPUT) regresses from 26545.931713712664 to 26370.461840482538 in microbenchmark.json REGRESSION 0.53%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9060.701663275304 to 9012.880467992636 in microbenchmark.json REGRESSION 0.28%: single_client_tasks_async (THROUGHPUT) regresses from 8011.455682416454 to 7988.9069673790045 in microbenchmark.json REGRESSION 0.19%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1486.2327104183764 to 1483.4703793760418 in microbenchmark.json REGRESSION 107.66%: dashboard_p95_latency_ms (LATENCY) regresses from 34.039 to 70.687 in benchmarks/many_nodes.json REGRESSION 30.19%: stage_0_time (LATENCY) regresses from 8.773437261581421 to 11.421970844268799 in stress_tests/stress_test_many_tasks.json REGRESSION 27.05%: dashboard_p50_latency_ms (LATENCY) regresses from 3.87 to 4.917 in benchmarks/many_nodes.json REGRESSION 9.72%: dashboard_p99_latency_ms (LATENCY) regresses from 119.573 to 131.198 in benchmarks/many_nodes.json REGRESSION 9.58%: stage_1_avg_iteration_time (LATENCY) regresses from 23.938837790489195 to 26.23279986381531 in stress_tests/stress_test_many_tasks.json REGRESSION 9.41%: stage_3_time (LATENCY) regresses from 3035.906775712967 to 3321.615835428238 in stress_tests/stress_test_many_tasks.json REGRESSION 6.37%: dashboard_p95_latency_ms (LATENCY) regresses from 3542.989 to 3768.817 in benchmarks/many_actors.json REGRESSION 4.93%: dashboard_p99_latency_ms (LATENCY) regresses from 358.789 to 376.468 in benchmarks/many_pgs.json REGRESSION 3.70%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.28579454300001 to 18.961532712000007 in scalability/object_store.json REGRESSION 3.56%: avg_pg_create_time_ms (LATENCY) regresses from 0.9371462897900398 to 0.9705077387385862 in stress_tests/stress_test_placement_group.json REGRESSION 3.24%: stage_2_avg_iteration_time (LATENCY) regresses from 61.69442081451416 to 63.694758081436156 in stress_tests/stress_test_many_tasks.json REGRESSION 2.07%: 10000_get_time (LATENCY) regresses from 23.411743029999997 to 23.896780481999997 in scalability/single_node.json REGRESSION 1.74%: dashboard_p50_latency_ms (LATENCY) regresses from 167.38 to 170.294 in benchmarks/many_tasks.json REGRESSION 1.51%: 1000000_queued_time (LATENCY) regresses from 186.319367591 to 189.12986922100004 in scalability/single_node.json REGRESSION 1.39%: avg_pg_remove_time_ms (LATENCY) regresses from 0.9081441951950084 to 0.9207600330309926 in stress_tests/stress_test_placement_group.json REGRESSION 0.59%: dashboard_p95_latency_ms (LATENCY) regresses from 12.055 to 12.126 in benchmarks/many_pgs.json ``` Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c475f45 - Browse repository at this point
Copy the full SHA c475f45View commit details -
[Core] Reconstruct actor to run lineage reconstruction triggered acto…
…r task (ray-project#47396) Currently if we need to rerun an actor task to recover a lost object but the actor is dead, the actor task will fail immediately. This PR allows the actor to be restarted (if it doesn't violate max_restarts) so that the actor task can run to recover lost objects. In terms of the state machine, we add a state transition from DEAD to RESTARTING. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 03e5832 - Browse repository at this point
Copy the full SHA 03e5832View commit details -
[aDAG] support buffered input (ray-project#47272)
\Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur. Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout. Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks. This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm). Main changes Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner. When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7625128 - Browse repository at this point
Copy the full SHA 7625128View commit details -
[aDAG] Clean up arg_to_consumers in _get_or_compile() (ray-project#47514
) Clean up the code. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cbe6687 - Browse repository at this point
Copy the full SHA cbe6687View commit details -
[RLlib; Offline RL] Store episodes in state form. (ray-project#47294)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 53e641a - Browse repository at this point
Copy the full SHA 53e641aView commit details -
[Core][aDag] Support multi node multi reader (ray-project#47480)
This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eb14e06 - Browse repository at this point
Copy the full SHA eb14e06View commit details -
Allow control of some serve configuration via env vars (ray-project#4…
…7533) When a serve app is launched, serve will startup automatically. In certain places like k8s, it can be difficult to preconfigure serve (e.g. in the [ray-cluster helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml) there is no ability to set the default serve arguments). This means you need to either be explicit when you start serve, or if it starts up automatically you may need to shut it down, then restart it, which is inconvenient. Signed-off-by: Tim Paine <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 50bd27a - Browse repository at this point
Copy the full SHA 50bd27aView commit details -
Update incremental build troubleshooting tip with style nits (ray-pro…
…ject#47592) Style nits. ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: angelinalg <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 12afcd1 - Browse repository at this point
Copy the full SHA 12afcd1View commit details -
[observability][export-api] Write driver job events (ray-project#47418)
Write Driver Job events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false. Event write is called whenever a job table data value is modified. Typically this occurs before writing JobTableData to the GCS table Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a0d3355 - Browse repository at this point
Copy the full SHA a0d3355View commit details -
[core][dashboard] push down job_or_submission_id to GCS. (ray-project…
…#47492) GCS API GetAllJobInfo serves Dashboard APIs, even for only 1 job. This becomes slow when the number of jobs are high. This PR pushes down the job filter to GCS to save Dashboard workload. This API is kind of strange because the filter `job_or_submission_id` is actually Either a Job ID Or a job_submission_id. We don't have an index on the latter, and some jobs don't have one. So we still GetAll from Redis; and filter by both IDs after that and before doing more RPC calls. --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f541305 - Browse repository at this point
Copy the full SHA f541305View commit details -
[Doc][KubeRay] Add description tables for RayCluster Status in the ob…
…servability doc (ray-project#47462) Signed-off-by: Rueian <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 54ce249 - Browse repository at this point
Copy the full SHA 54ce249View commit details -
[aDAG] Fix ranks ordering for custom NCCL group (ray-project#47594)
The ranks should be in the order of the actors. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a0fb580 - Browse repository at this point
Copy the full SHA a0fb580View commit details -
[RLlib] RLModule:
InferenceOnlyAPI
. (ray-project#47572)Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6e90110 - Browse repository at this point
Copy the full SHA 6e90110View commit details -
[Data] Remove
_default_metadata_providers
(ray-project#47575)_default_metadata_providers adds a layer of indirection. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 69ca5c5 - Browse repository at this point
Copy the full SHA 69ca5c5View commit details -
[Serve] Remove unused Serve constants (ray-project#47593)
Went through all the constants in the file and remove the ones that's no Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8f9236c - Browse repository at this point
Copy the full SHA 8f9236cView commit details -
Fix windows://:task_event_buffer_test (ray-project#47577)
Move TestWriteTaskExportEvents to a separate file and skip on Windows. This is ok for the export API feature because we currently aren't supporting on Windows (tests for other resource events written from GCS are also skipped on Windows). This test is failing in postmerge (CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523) for Windows due to unknown file: error: C++ exception with description "remove_all: The process cannot access the file because it is being used by another process.: "event_123"" thrown in TearDown(). in the tear down step. This is the same error raised for other tests that clean up created directories with remove_all() in Windows (eg: //src/ray/util/tests:event_test). These tests are also skipped on Windows. Signed-off-by: Nikita Vemuri <[email protected]> Co-authored-by: Nikita Vemuri <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bd0d6eb - Browse repository at this point
Copy the full SHA bd0d6ebView commit details -
[RLlib] RLModule API:
SelfSupervisedLossAPI
for RLModules that brin……g their own loss (algo independent). (ray-project#47581) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6293a1f - Browse repository at this point
Copy the full SHA 6293a1fView commit details -
[GCS] Optimize
GetAllJobInfo
API for performance (ray-project#47530)Signed-off-by: liuxsh9 <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 649148c - Browse repository at this point
Copy the full SHA 649148cView commit details -
[Serve] fix default serve logger behavior (ray-project#47600)
Re: ray-project#47229 Previous PR to setup default serve logger has some unexpected consequence. Mainly combined with Serve's stdout redirect feature (when `RAY_SERVE_LOG_TO_STDERR=0` is set in env), it will setup default serve logger and redirect all stdout/stderr into serve's log files instead going to the console. This caused on the Anyscale platform unable to identify ray start command is running successfully and unable to start the cluster. This PR fixes this behavior by only configure Serve's default logger with stream handler and skip configuring file handler altogether. Signed-off-by: Gene Su <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 92f0741 - Browse repository at this point
Copy the full SHA 92f0741View commit details -
[core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…
…rkers. (ray-project#47212) Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7e0d054 - Browse repository at this point
Copy the full SHA 7e0d054View commit details -
[core][adag] Separate the outputs of execute and execute_async to mul…
…tiple refs or futures to allow clients to retrieve them one at a time (ray-project#46908) (ray-project#47305) ## Why are these changes needed? Currently, if `MultiOutputNode` is used to wrap a DAG's output, you get back a single `CompiledDAGRef` or `CompiledDAGFuture`, depending on whether `execute` or `execute_async` is invoked, that points to a list of all of the outputs. To retrieve one of the outputs, you have to get and deserialize all of them at the same time. This PR separates the output of `execute` and `execute_async` to a list of `CompiledDAGRef` or `CompiledDAGFuture` when the output is wrapped by `MultiOutputNode`. This is particularly useful for vLLM tensor parallelism. Since all shards return the same results, we only need to fetch result from one of the workers. Closes ray-project#46908. --------- Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Jeffrey Wang <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7ec6491 - Browse repository at this point
Copy the full SHA 7ec6491View commit details -
[serve] Faster detection of dead replicas (ray-project#47237)
## Why are these changes needed? Detect replica death earlier on handles/routers. Currently routers will process replica death if the actor death error is thrown during active probing or system message. 1. cover one more case: process replica death if error is thrown _while_ request was being processed on the replica. 2. improved handling: if error is detected on the system message, meaning router found out replica is dead after assigning a request to that replica, retry the request. ### Performance evaluation (master results pulled from https://buildkite.com/ray-project/release/builds/21404#01917375-2b1e-4cba-9380-24e557a42a42) Latency: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_p50_latency | 3.9672044999932154 | 3.9794859999986443 | 0.31 | | http_1mb_p50_latency | 4.283115999996312 | 4.1375990000034335 | -3.4 | | http_10mb_p50_latency | 8.212248500001351 | 8.056774499998198 | -1.89 | | grpc_p50_latency | 2.889802499964844 | 2.845889500008525 | -1.52 | | grpc_1mb_p50_latency | 6.320479999999407 | 9.85005449996379 | 55.84 | | grpc_10mb_p50_latency | 92.12763850001693 | 106.14903449999247 | 15.22 | | handle_p50_latency | 1.7775379999420693 | 1.6373455000575632 | -7.89 | | handle_1mb_p50_latency | 2.797253500034458 | 2.7225929999303844 | -2.67 | | handle_10mb_p50_latency | 11.619127000017215 | 11.39100950001648 | -1.96 | Throughput: | metric | master | this PR | % change | | -- | -- | -- | -- | | http_avg_rps | 359.14 | 357.81 | -0.37 | | http_100_max_ongoing_requests_avg_rps | 507.21 | 515.71 | 1.68 | | grpc_avg_rps | 506.16 | 485.92 | -4.0 | | grpc_100_max_ongoing_requests_avg_rps | 506.13 | 486.47 | -3.88 | | handle_avg_rps | 604.52 | 641.66 | 6.14 | | handle_100_max_ongoing_requests_avg_rps | 1003.45 | 1039.15 | 3.56 | Results: everything except for grpc results are within noise. As for grpc results, they have always been relatively noisy (see below), so the results are actually also within the noise that we've been seeing. There is also no reason why latency for a request would only increase for grpc and not http or handle for the changes in this PR, so IMO this is safe. ![Screenshot 2024-08-21 at 11 54 55 AM](https://github.com/user-attachments/assets/6c7caa40-ae3c-417b-a5bf-332e2d6ca378) ## Related issue number closes ray-project#47219 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2ba70de - Browse repository at this point
Copy the full SHA 2ba70deView commit details -
[spark] Improve Ray-on-spark fault tolerance in case of Spark executo…
…r being down (e.g. spot instance termination) (ray-project#47493) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f2ef047 - Browse repository at this point
Copy the full SHA f2ef047View commit details -
[serve] skip failure test on windows (ray-project#47630)
Skip test_replica_actor_died on windows. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 72b643b - Browse repository at this point
Copy the full SHA 72b643bView commit details -
[serve] reorganize replica scheduler classes (ray-project#47615)
## Why are these changes needed? Pull replica scheduler and replica wrapper out from `common.py` into their own files. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7b136f9 - Browse repository at this point
Copy the full SHA 7b136f9View commit details -
[Core] Remove code accidently got in (ray-project#47612)
Idk how this was genearted Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 309a86c - Browse repository at this point
Copy the full SHA 309a86cView commit details -
[Core][aDAG] support multi readers in multi node when dag is created …
…from an actor (ray-project#47601) Currently, when a DAG is created from an actor, we are using different mechanism from a driver. In a driver we create a ProxyActor vs actor we are just using the actor itself. This inconsistent mechanism is prone to error. As an example, I found when we support multi reader in multi node, we have deadlock because the driver actor needs to call ray.get(a.allocate_channel.remote()) for a downstream actor while the downstream actor calls ray.get(driver_actor.create_ref.remote()). This fixes the issue by making ProxyActor as the default mechanism even when a dag is created inside an actor. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b7b5c51 - Browse repository at this point
Copy the full SHA b7b5c51View commit details -
[core] out of band serialization exception (ray-project#47544)
Introduce an env var to raise an exception when there's out of band seriailzation of object ref Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details. Update an anti-pattern doc. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a2b0cc3 - Browse repository at this point
Copy the full SHA a2b0cc3View commit details -
[core][experimental] Allocate a channel for each InputAttributeNode (r…
…ay-project#47564) Change 1: Remove class DAGInputAdapter. Without this PR, the entire input data will be written to the channel, even if a reader only wants to retrieve partial input data via InputAttributeNode. Then, the entire input data will be read by the READ operation, and the partial input will be retrieved during the COMPUTE operation (code) In this PR, each InputAttributeNode has its own channel, and only the corresponding input data will be written to the channel. Therefore, we no longer need to use DAGInputAdapter to retrieve the partial input data during the COMPUTE operation. Change 2: If the DAG contains any InputAttributeNode, create a channel for each InputAttributeNode. Then, write the partial input data to the corresponding channel (code). Change 3: There are some if/else statements to handle InputNode and InputAttributeNode for creating CachedChannel. This PR unifies the logic because InputNode and different InputAttributeNode are no longer considered consumers of only one input channel. Each InputAttributeNode has its own channel. Change 4: Move RayDAGArgs from compiled_dag_node.py to common.py to avoid importing it inside _adapt. Without this, this PR is about 5% slower than the baseline in the case "Benchmark: single actor, no InputAttributeNode". With this change, the performance is almost the same as, or slightly better than, the baseline. See "Benchmark: single actor, no InputAttributeNode" below for more details. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ffa2d34 - Browse repository at this point
Copy the full SHA ffa2d34View commit details -
[Data] Add
partitioning
parameter toread_parquet
(ray-project#47553) To extract path partition information with `read_parquet`, you pass a PyArrow `partitioning` object to `dataset_kwargs`. For example: ``` schema = pa.schema([("one", pa.int32()), ("two", pa.string())]) partitioning = pa.dataset.partitioning(schema, flavor="hive") ds = ray.data.read_parquet(... dataset_kwargs=dict(partitioning=partitioning)) ``` This is problematic for two reasons: 1. It tightly couples the interface with the implementation; partitioning only works if we use `pyarrow.Dataset` in a specific way in the implementation. 2. It's inconsistent with all of the other file-based API. All other APIs use expose a top-level `partitioning` parameter (rather than `dataset_kwargs`) where you pass a Ray Data `Partitioning` object (rather than a PyArrow partitioning object). --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8839ad4 - Browse repository at this point
Copy the full SHA 8839ad4View commit details -
[spark] Refine comment in Starting ray worker spark task (ray-project…
…#47670) Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2af394f - Browse repository at this point
Copy the full SHA 2af394fView commit details -
[Core][aDAG] Set buffer size to 1 for regression (ray-project#47639)
There's a regression with buffer size 10. I am going to investigate but I will revert it to buffer size 1 for now until further investigation. With buffer size 1, regression seems to be gone https://buildkite.com/ray-project/release/builds/22594#0191ed4b-5477-45ff-be9e-6e098b5fbb3c. probably some sort of contention or sth like that Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 27c71b6 - Browse repository at this point
Copy the full SHA 27c71b6View commit details -
Add perf metrics for 2.36.0 (ray-project#47574)
``` REGRESSION 12.66%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.204885454613315 to 11.533423619760748 in microbenchmark.json REGRESSION 9.50%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 523.3469473257671 to 473.62862729568997 in microbenchmark.json REGRESSION 6.76%: multi_client_put_gigabytes (THROUGHPUT) regresses from 45.440179854469804 to 42.368678421213005 in microbenchmark.json REGRESSION 4.92%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8803.178389859915 to 8370.014425096557 in microbenchmark.json REGRESSION 3.89%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2748.863962184806 to 2641.837605625889 in microbenchmark.json REGRESSION 3.45%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1019.3028285821217 to 984.156036006501 in microbenchmark.json REGRESSION 3.06%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1007.6444648899972 to 976.8103650114274 in microbenchmark.json REGRESSION 0.65%: placement_group_create/removal (THROUGHPUT) regresses from 805.1759941825478 to 799.9345402492929 in microbenchmark.json REGRESSION 0.33%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5273.203424794718 to 5255.898134426729 in microbenchmark.json REGRESSION 0.02%: 1_1_actor_calls_async (THROUGHPUT) regresses from 9012.880467992636 to 9011.034048587637 in microbenchmark.json REGRESSION 0.01%: client__put_gigabytes (THROUGHPUT) regresses from 0.13947664668408546 to 0.13945791828216536 in microbenchmark.json REGRESSION 0.00%: client__put_calls (THROUGHPUT) regresses from 806.1974515278531 to 806.172478450918 in microbenchmark.json REGRESSION 70.55%: dashboard_p50_latency_ms (LATENCY) regresses from 104.211 to 177.731 in benchmarks/many_actors.json REGRESSION 13.13%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 18.961532712000007 to 21.451945214000006 in scalability/object_store.json REGRESSION 4.50%: 3000_returns_time (LATENCY) regresses from 5.680022101000006 to 5.935367576000004 in scalability/single_node.json REGRESSION 3.96%: avg_iteration_time (LATENCY) regresses from 0.9740754842758179 to 1.012664566040039 in stress_tests/stress_test_dead_actors.json REGRESSION 2.75%: stage_2_avg_iteration_time (LATENCY) regresses from 63.694758081436156 to 65.44879236221314 in stress_tests/stress_test_many_tasks.json REGRESSION 1.66%: 10000_args_time (LATENCY) regresses from 17.328640389999997 to 17.61703060299999 in scalability/single_node.json REGRESSION 1.40%: stage_4_spread (LATENCY) regresses from 0.45063567085147194 to 0.4569625792772166 in stress_tests/stress_test_many_tasks.json REGRESSION 0.69%: dashboard_p50_latency_ms (LATENCY) regresses from 3.347 to 3.37 in benchmarks/many_pgs.json REGRESSION 0.19%: 10000_get_time (LATENCY) regresses from 23.896780481999997 to 23.942006032999984 in scalability/single_node.json ``` Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b2ccedc - Browse repository at this point
Copy the full SHA b2ccedcView commit details -
[RLlib] Add "shuffle batch per epoch" option. (ray-project#47458)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5c66fac - Browse repository at this point
Copy the full SHA 5c66facView commit details -
[RLlib; Offline RL] Enable buffering episodes. (ray-project#47501)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6c165c2 - Browse repository at this point
Copy the full SHA 6c165c2View commit details -
[Core] Make JobSupervisor logs structured (ray-project#47699)
Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6439db3 - Browse repository at this point
Copy the full SHA 6439db3View commit details -
[serve] wrap obj ref in result wrapper in deployment response (ray-pr…
…oject#47655) ## Why are these changes needed? Abstract `ray.ObjectRef` and `ray.ObjectRefGenerator` in a result wrapper that the deployment response can directly call into. --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3f63c45 - Browse repository at this point
Copy the full SHA 3f63c45View commit details -
[Core] Fix broken dashboard worker page (ray-project#47714)
Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 94b5e06 - Browse repository at this point
Copy the full SHA 94b5e06View commit details -
[core][experimental] Remove unused attr CompiledDAG._type_hints (ray-…
…project#47706) CompiledDAG._type_hints is not used. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47de542 - Browse repository at this point
Copy the full SHA 47de542View commit details -
[Data] Re-phrase the streaming executor current usage string (ray-pro…
…ject#47515) ## Why are these changes needed? The progress bar for ray data could still end up showing higher utilization of what the cluster currently have. ray-project#46729 was the first attempt to fix it which addressed the issue in static clusters, but we still have that issue for clusters that autoscales. This change simply rephrase the string so it is less confusing. Before <img width="1249" alt="image" src="https://github.com/user-attachments/assets/049ea096-a87f-4767-ba04-6d00d7c2755d"> After <img width="1248" alt="image" src="https://github.com/user-attachments/assets/cb74c0dc-1f33-4b22-b31c-e83df2a5d408"> This comes from the fact that operators don't track the task state (and currently ray core does not even provide that api). Which means Ray data operators does not know if the task is assigned to a node or not, so once the task is submitted to ray it is marked active even if it is pending a node assignment. The dashboard does better here since it does have extra information from the task. <img width="1493" alt="image" src="https://github.com/user-attachments/assets/9315b884-3e61-4b32-8400-7f76e15b6a4b"> In the future we can visit adding the core api for remote state reporting and allowing operators to provide more detailed state (active, pending_scheduled, pending_node_assignment). ## Related issue number ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Sofian Hnaide <[email protected]> Co-authored-by: scottjlee <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e9e4d7e - Browse repository at this point
Copy the full SHA e9e4d7eView commit details -
[serve] improve tests (ray-project#47722)
## Why are these changes needed? - We can make some tests asynchronous instead of having to rely on `_to_object_ref`. - we can use `RayActorError` instead of `ActorDiedError` Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 27ee9f1 - Browse repository at this point
Copy the full SHA 27ee9f1View commit details -
[Core] Add test case where there is dead node for /nodes?view=summary…
… endpoint (ray-project#47727) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 50016c0 - Browse repository at this point
Copy the full SHA 50016c0View commit details -
[Dashboard] Optimizing performance of Ray Dashboard (ray-project#47617)
Signed-off-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4012314 - Browse repository at this point
Copy the full SHA 4012314View commit details -
[core][aDAG] Fix a bug where multi arg + exception doesn't work (ray-…
…project#47704) Currently, when there's an exception, there's only 1 return value, but multi ref assumes that the return value has to match the # of output channels. It fixes the issue by duplicating exception to match the number of output channels. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 605221a - Browse repository at this point
Copy the full SHA 605221aView commit details -
[fake autoscaler] use check_call in fake multi node test utils (ray-p…
…roject#47772) so that output is printed to logs and also use "sys.executable" rather than "python" Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2e739d8 - Browse repository at this point
Copy the full SHA 2e739d8View commit details -
[RLlib] RLModule: Simplify defining custom distribution classes and a…
…dd better defaults. (ray-project#47775) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b413593 - Browse repository at this point
Copy the full SHA b413593View commit details -
[fake autoscaler] remove the redundant mkdir (ray-project#47786)
- docker compose service volume short syntax uses bind (similar to `-v` and will create the dir if not exist - the code was not mapping the dir to host path, so it actually has no meaningful effect when it is running in a container, such as on CI Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fee22c2 - Browse repository at this point
Copy the full SHA fee22c2View commit details -
[Data] Simplify and consolidate progress bar outputs (ray-project#47692)
## Why are these changes needed? Currently, the progress bar is pretty verbose because it is very information dense. This PR: - Reorganizes progress output to group by relevant concepts and clarifies labels - Standardizes global and operator-level progress bar outputs - Removes the use of all emojis (poor rendering on some platforms / external logging systems) Progress bar before this PR: <img width="1403" alt="Screenshot at Sep 16 13-00-17" src="https://github.com/user-attachments/assets/4f459b77-06ba-4395-b883-e4c9ac8ca2ef"> Progress bar after this PR: <img width="1502" alt="Screenshot at Sep 23 13-48-32" src="https://github.com/user-attachments/assets/0c0f8c94-9439-4fd4-ae1a-2857b3a87b59"> Will follow up with a docs PR once we merge this change, so that I don't need to continuously modify the docs. In the future, we should restructure the way progress bars are grouped/tracked, so that we can tabulate the op-level progress bar outputs. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c20e3b1 - Browse repository at this point
Copy the full SHA c20e3b1View commit details -
Add perf metrics for 2.37.0 (ray-project#47791)
for release perf checking. Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e438357 - Browse repository at this point
Copy the full SHA e438357View commit details -
[docker] Update latest Docker dependencies for 2.36.0 release (ray-pr…
…oject#47748) Created by release automation bot. Update with commit f298a75 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8c73745 - Browse repository at this point
Copy the full SHA 8c73745View commit details -
[docker] Update latest Docker dependencies for 2.36.1 release (ray-pr…
…oject#47801) Created by release automation bot. Update with commit 18b2d94 Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8b0a597 - Browse repository at this point
Copy the full SHA 8b0a597View commit details -
[observability][export-api] Write submission job events (ray-project#…
…47468) Add ExportEventLoggerAdapter which will be used to write export events to file from python files. Only a single ExportEventLoggerAdapter instance will exist per source type, so callers can create or get this instance using get_export_event_logger which is thread safe. Write Submission Job export events to file from JobInfoStorageClient.put_info which is called to update the JobInfo data in the internal KV store. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5681a4a - Browse repository at this point
Copy the full SHA 5681a4aView commit details -
Move export events to separate folder (ray-project#47747)
Move export events from session_latest/logs/events to session_latest/logs/export_events Keeping both event types in the same folder doesn't cause any issue for Ray -- export event files are already filtered out for /events API in ray/python/ray/dashboard/modules/event/event_utils.py Line 22 in 1e48a03 all_source_types = set(event_consts.EVENT_SOURCE_ALL) However moving these to a separate folder would be better for existing downstream consumers to avoid handling export events in the events folder if they turn the flag on Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2b21a08 - Browse repository at this point
Copy the full SHA 2b21a08View commit details -
[release] stream the full anyscale log to buildkite (ray-project#47808)
Currently we only print 100 last lines of anyscale job log to buildkite. This PR removes that limit and prints everything instead. CC: @kouroshHakha Test: - CI Signed-off-by: can <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a665c67 - Browse repository at this point
Copy the full SHA a665c67View commit details -
[RLlib; Offline RL] Offline performance cleanup. (ray-project#47731)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 68bd111 - Browse repository at this point
Copy the full SHA 68bd111View commit details -
[docker] Update latest Docker dependencies for 2.37.0 release (ray-pr…
…oject#47812) Created by release automation bot. Update with commit d2982b7 Signed-off-by: kevin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c432e5c - Browse repository at this point
Copy the full SHA c432e5cView commit details -
[RLlib] Fix action masking example. (ray-project#47817)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e5c2bd4 - Browse repository at this point
Copy the full SHA e5c2bd4View commit details -
[Core] Separate the attempt_number with the task_status in memory sum…
…mary and object list (ray-project#47818) # Current status: * When we retrieve the information from GCS, the task_status as well as the attempts are in 2 fields and the task status is an enum. * Later during reconstruction, the 2 fields are combined into 1 and the number of attempts is added to the task_status field. * That's why when displaying the objects, the function isn't able to convert the string back to enum. # Proposed solution: * Instead of combining the 2 fields (task_status and attempt), we will keep the 2 fields and added an additional field (attempt_number) in the Object State * In this way, we will keep the task_status as enum and put the attempt number information in a different field # Changes in this PR: * Added the `attempt_number` in `ObjectState` and `task_attempt_number_counts` in `ObjectSummaryPerKey` * Added logic to populate the fields as proposed above * Updated the logic for the memory summary function to display the attempt number in a new column * Corresponding tests added as well Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 75bc8fe - Browse repository at this point
Copy the full SHA 75bc8feView commit details -
[RLlib; docs] New API stack migration guide. (ray-project#47779)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8b67dc6 - Browse repository at this point
Copy the full SHA 8b67dc6View commit details -
[RLlib; new API stack by default] Switch on new API stack by default …
…for SAC and DQN. (ray-project#47217) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bcda013 - Browse repository at this point
Copy the full SHA bcda013View commit details -
[Core] Fix a Typo in dict_to_state function parameter name (ray-proje…
…ct#47822) Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a335bdc - Browse repository at this point
Copy the full SHA a335bdcView commit details -
[core] Introducing InstrumentedIOContextWithThread. (ray-project#47831)
Previously we had several ad-hoc places to do a "thread and io_context" pattern: create a thread dedicated to an asio io_context, then workload can post async tasks onto it. This makes duplicate code: everywhere we create threads, implement stop and join. Introducing InstrumentedIOContextWithThread that does exactly this and replaces existing usages. Also fixes some absl::Time computations with best practice. This is refactoring. Should have no runtime difference. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 62802bd - Browse repository at this point
Copy the full SHA 62802bdView commit details -
[RLlib] Discontinue support for "hybrid" API stack (using RLModule + …
…Learner, but still on RolloutWorker and Policy) (ray-project#46085) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5e7601b - Browse repository at this point
Copy the full SHA 5e7601bView commit details -
[Core] Fix object reconstruction hang on arguments pending creation (r…
…ay-project#47645) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4e54d89 - Browse repository at this point
Copy the full SHA 4e54d89View commit details -
[core][experimental] Fix test_execution_schedule_gpu (ray-project#47753)
Pass a GPU tensor to execute, but it gets converted into a CPU tensor. The issue may be related to ray-project#46440. Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 00cc0a5 - Browse repository at this point
Copy the full SHA 00cc0a5View commit details -
[core] Change many Ray ID logs to WithField. (ray-project#47844)
Use structured logging by changing more `<< node_id` to use `.WithField(node_id)`. This is not intended to be a complete work, but it should cover most of the cases. We did the work for NodeID, WorkerID, ActorID, JobID, TaskID, PlacementGroupID. Some logs have multiple IDs. To avoid confusion, for these we only use WithField(object_id) don't use WithField on either of the Node IDs. This PR should have no change on Ray other than logs. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f3d8e46 - Browse repository at this point
Copy the full SHA f3d8e46View commit details -
[RLlib] Cleanup examples folder (vol 30): BC pretraining, then PPO fi…
…netuning (new API stack with RLModule checkpoints). (ray-project#47838) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 80f4941 - Browse repository at this point
Copy the full SHA 80f4941View commit details -
[RLlib] MultiAgentEnv API enhancements (related to defining obs-/acti…
…on spaces for agents). (ray-project#47830) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f4a1d5c - Browse repository at this point
Copy the full SHA f4a1d5cView commit details -
[RLlib] Add log-std clipping to 'MLPHead's. (ray-project#47827)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 554195d - Browse repository at this point
Copy the full SHA 554195dView commit details -
[RLlib] Update autoregressive actions example. (ray-project#47829)
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 041874d - Browse repository at this point
Copy the full SHA 041874dView commit details -
[kuberay] Update docs for KubeRay v1.2.2 (ray-project#47867)
change kuberay helm and branch reference versions to v1.2.2 Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9900778 - Browse repository at this point
Copy the full SHA 9900778View commit details -
[Arrow] Adding
ArrowTensorTypeV2
to support tensors larger than 2Gb (……ray-project#47832) Currently, when using tensor type in Ray Data if single tensor in a block grows above 2Gb (due to use of signed `int32` as offsets) this would result in the following issue: ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` Consequently, this change adds support for tensors of > 4Gb in size, while maintaining compatibility with existing datasets already using tensors. This is done by forking off `ArrowTensorType` in 2: - `ArrowTensorType` (v1) remaining intact - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as now using `int64` offsets --------- Signed-off-by: Peter Wang <[email protected]> Signed-off-by: Alexey Kudinkin <[email protected]> Co-authored-by: Peter Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4d35582 - Browse repository at this point
Copy the full SHA 4d35582View commit details -
[Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeI…
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 01e7634 - Browse repository at this point
Copy the full SHA 01e7634View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 01 (some preparat…
…ory cleanups). (ray-project#47884) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a7aba20 - Browse repository at this point
Copy the full SHA a7aba20View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 02 (VPG RLModule,…
… Algo, and Learner example classes). (ray-project#47885) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d2f8737 - Browse repository at this point
Copy the full SHA d2f8737View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…
…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b7e3789 - Browse repository at this point
Copy the full SHA b7e3789View commit details -
[RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …
…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bbb59bb - Browse repository at this point
Copy the full SHA bbb59bbView commit details -
[core] Change debug_string from returning a string to streaming to an…
… ostream. (ray-project#47893) We have a convenience function `debug_string` used in Ray logs: it prints printables (operator<<), containers, pairs. However it returns a std::string which is feed into RAY_LOG(). This makes a copy. Changes the signature to return a `DebugStringWrapper` which holds const reference to the argument, and is printable for all already supported types. Additionally supports std::tuple. This should only have marginal perf benefits since we typically don't debug_string a very big data structure. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 815a9e4 - Browse repository at this point
Copy the full SHA 815a9e4View commit details -
[Serve / Jobs] Check if conda env exists before removing (ray-project…
…#47922) ## Why are these changes needed? Fixes some failing/flaky unit tests tests, which fail with errors like: ``` EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b ``` - TestBackwardsCompatibility.test_cli - test_failed_driver_exit_code Previously failing test now passes with this PR applied: https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934 ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f07ef31 - Browse repository at this point
Copy the full SHA f07ef31View commit details -
[job] don't continue on test setup (ray-project#47927)
when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ff382fa - Browse repository at this point
Copy the full SHA ff382faView commit details -
[core][experimental] Avoid false positives in deadlock detection (ray…
…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 80a7ef7 - Browse repository at this point
Copy the full SHA 80a7ef7View commit details -
[serve] Stop scheduling task early when requests have been cancelled (r…
…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 71d5ad4 - Browse repository at this point
Copy the full SHA 71d5ad4View commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…
…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c4d884b - Browse repository at this point
Copy the full SHA c4d884bView commit details -
[RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…
… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a994eec - Browse repository at this point
Copy the full SHA a994eecView commit details -
[RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…
…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 08fc41e - Browse repository at this point
Copy the full SHA 08fc41eView commit details -
[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (ray-…
…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 90160e6 - Browse repository at this point
Copy the full SHA 90160e6View commit details -
[RLlib] Fix small bug in 'InfiniteLookBackBuffer.get_state/from_state…
…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 269b9ad - Browse repository at this point
Copy the full SHA 269b9adView commit details -
[core] Add more debug string types (ray-project#47928)
Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7260fdf - Browse repository at this point
Copy the full SHA 7260fdfView commit details -
[deps] add grpcio-tools into anyscale dependencies (ray-project#47955)
so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 20e1cad - Browse repository at this point
Copy the full SHA 20e1cadView commit details -
[RLlib] Quick-fix for default RLModules in combination with a user-pr…
…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 611b645 - Browse repository at this point
Copy the full SHA 611b645View commit details -
[RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…
…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0dcefc0 - Browse repository at this point
Copy the full SHA 0dcefc0View commit details -
[RLlib] Add framework-check to
MultiRLModule.add_module()
. (ray-pro……ject#47973) Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f218402 - Browse repository at this point
Copy the full SHA f218402View commit details -
[serve] Fix failing test pow 2 scheduler on windows (ray-project#47975)
## Why are these changes needed? Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio is slower on windows, so the shortened timeouts for some tests cause the tests to fail because tasks didn't get a chance to start/finish executing. Failing tests on windows: - `test_multiple_queries_with_different_model_ids` - `test_queue_len_cache_replica_at_capacity_is_probed` - `test_queue_len_cache_background_probing` ## Related issue number Closes ray-project#47950 Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7a6cfe0 - Browse repository at this point
Copy the full SHA 7a6cfe0View commit details -
[data] fix reading multiple parquet files with ragged ndarrays (ray-p…
…roject#47961) ## Why are these changes needed? PyArrow infers parquet schema only based on the first file. This will cause errors when reading multiple files with ragged ndarrays. This PR fixes this issue by not using the inferred schema for reading. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Fixes ray-project#47960 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for e2f7c91 - Browse repository at this point
Copy the full SHA e2f7c91View commit details -
[core] Decouple create worker vs pop worker request. (ray-project#47694)
Now, when you call PopWorker(), it finds an idle one or creates a worker. If a new worker is created, the worker is associated to the request and can only be used by it. This PR decouples the worker creation and the worker-to-task assignment, by adding an abstraction namely PopWorkerRequest. Now, if a req triggers a worker creation, the req is put into a queue. If there are workers ready, that is a PushWorker is called, either from a newly started worker or a released worker, Ray matches the first fitting request in the queue. This reduces latency. Later it can also be used to pre-start workers more meaningfully. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3eff78e - Browse repository at this point
Copy the full SHA 3eff78eView commit details -
[core] Add metrics for gcs jobs (ray-project#47793)
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2597701 - Browse repository at this point
Copy the full SHA 2597701View commit details -
upgrade grpcio version (ray-project#47982)
to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b644b30 - Browse repository at this point
Copy the full SHA b644b30View commit details -
[Feat][Core] Implement single file module for runtime_env (ray-projec…
…t#47807) Supports single file modules in `py_module` runtime_env. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eac1cb6 - Browse repository at this point
Copy the full SHA eac1cb6View commit details -
[Chore][Core] Address PR 47807 comments (ray-project#48002)
PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f84cd6b - Browse repository at this point
Copy the full SHA f84cd6bView commit details -
[core] Add thread check to job mgr callback (ray-project#48005)
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5d7ab4b - Browse repository at this point
Copy the full SHA 5d7ab4bView commit details -
Signed-off-by: ujjawal-khare <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d5193c9 - Browse repository at this point
Copy the full SHA d5193c9View commit details -
Merge branch 'fix/job-manager-logger' of github.com:ujjawal-khare-27/…
…ray into fix/job-manager-logger
Configuration menu - View commit details
-
Copy full SHA for f68bfa7 - Browse repository at this point
Copy the full SHA f68bfa7View commit details