Fix/job manager logger #48003

ujjawal-khare-27 · 2024-10-12T06:31:31Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>

…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>

… ostream. (ray-project#47893) We have a convenience function `debug_string` used in Ray logs: it prints printables (operator<<), containers, pairs. However it returns a std::string which is feed into RAY_LOG(). This makes a copy. Changes the signature to return a `DebugStringWrapper` which holds const reference to the argument, and is printable for all already supported types. Additionally supports std::tuple. This should only have marginal perf benefits since we typically don't debug_string a very big data structure. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…#47922) ## Why are these changes needed? Fixes some failing/flaky unit tests tests, which fail with errors like: ``` EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b ``` - TestBackwardsCompatibility.test_cli - test_failed_driver_exit_code Previously failing test now passes with this PR applied: https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934 ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>

… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>

…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>

…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>

Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>

…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>

…ject#47973) Signed-off-by: ujjawal-khare <[email protected]>

## Why are these changes needed? Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio is slower on windows, so the shortened timeouts for some tests cause the tests to fail because tasks didn't get a chance to start/finish executing. Failing tests on windows: - `test_multiple_queries_with_different_model_ids` - `test_queue_len_cache_replica_at_capacity_is_probed` - `test_queue_len_cache_background_probing` ## Related issue number Closes ray-project#47950 Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…roject#47961) ## Why are these changes needed? PyArrow infers parquet schema only based on the first file. This will cause errors when reading multiple files with ragged ndarrays. This PR fixes this issue by not using the inferred schema for reading.  ## Related issue number Fixes ray-project#47960 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Now, when you call PopWorker(), it finds an idle one or creates a worker. If a new worker is created, the worker is associated to the request and can only be used by it. This PR decouples the worker creation and the worker-to-task assignment, by adding an abstraction namely PopWorkerRequest. Now, if a req triggers a worker creation, the req is put into a queue. If there are workers ready, that is a PushWorker is called, either from a newly started worker or a released worker, Ray matches the first fitting request in the queue. This reduces latency. Later it can also be used to pre-start workers more meaningfully. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

…t#47807) Supports single file modules in `py_module` runtime_env. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: ujjawal-khare <[email protected]>

…ray into fix/job-manager-logger

ujjawal-khare-27 · 2024-10-15T20:26:45Z

@GeneDer opened new PR for cleaner commit history.
Link: #48029

Can you please have a look at this?

ujjawal-khare-27 requested review from richardliaw, edoakes and aslonnie as code owners October 12, 2024 06:31

ujjawal-khare force-pushed the fix/job-manager-logger branch from bebd681 to 0a9ac33 Compare October 12, 2024 06:33

rynewang assigned rkooo567 Oct 12, 2024

anyscalesam added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 12, 2024

sven1977 and others added 27 commits October 16, 2024 01:32

[RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…

b7e3789

…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …

bbb59bb

…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>

[job] don't continue on test setup (ray-project#47927)

ff382fa

when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[core][experimental] Avoid false positives in deadlock detection (ray…

80a7ef7

…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…

c4d884b

…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>

[RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…

a994eec

… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…

08fc41e

…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>

[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (ray-…

90160e6

…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] Fix small bug in 'InfiniteLookBackBuffer.get_state/from_state…

269b9ad

…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>

[core] Add more debug string types (ray-project#47928)

7260fdf

Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[deps] add grpcio-tools into anyscale dependencies (ray-project#47955)

20e1cad

so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] Quick-fix for default RLModules in combination with a user-pr…

611b645

…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…

0dcefc0

…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>

[RLlib] Add framework-check to MultiRLModule.add_module(). (ray-pro…

f218402

…ject#47973) Signed-off-by: ujjawal-khare <[email protected]>

upgrade grpcio version (ray-project#47982)

b644b30

to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[Feat][Core] Implement single file module for runtime_env (ray-projec…

eac1cb6

…t#47807) Supports single file modules in `py_module` runtime_env. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

[Chore][Core] Address PR 47807 comments (ray-project#48002)

f84cd6b

PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

remove handler released

d5193c9

Signed-off-by: ujjawal-khare <[email protected]>

Merge branch 'fix/job-manager-logger' of github.com:ujjawal-khare-27/…

f68bfa7

…ray into fix/job-manager-logger

ujjawal-khare-27 mentioned this pull request Oct 16, 2024

Job manager logger file descriptor leak fixed #48029

Open

8 tasks

jjyao closed this Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/job manager logger #48003

Fix/job manager logger #48003

ujjawal-khare-27 commented Oct 12, 2024 •

edited

Loading

ujjawal-khare-27 commented Oct 15, 2024 •

edited

Loading

Fix/job manager logger #48003

Fix/job manager logger #48003

Conversation

ujjawal-khare-27 commented Oct 12, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

ujjawal-khare-27 commented Oct 15, 2024 • edited Loading

ujjawal-khare-27 commented Oct 12, 2024 •

edited

Loading

ujjawal-khare-27 commented Oct 15, 2024 •

edited

Loading