-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/job manager logger #48003
Closed
ujjawal-khare-27
wants to merge
1,569
commits into
ray-project:master
from
ujjawal-khare-27:fix/job-manager-logger
Closed
Fix/job manager logger #48003
ujjawal-khare-27
wants to merge
1,569
commits into
ray-project:master
from
ujjawal-khare-27:fix/job-manager-logger
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ujjawal-khare-27
requested review from
richardliaw,
edoakes and
aslonnie
as code owners
October 12, 2024 06:31
ujjawal-khare
force-pushed
the
fix/job-manager-logger
branch
from
October 12, 2024 06:33
bebd681
to
0a9ac33
Compare
ujjawal-khare
requested review from
a team,
rkooo567,
jjyao,
shrekris-anyscale,
zcin,
GeneDer,
akshay-anyscale,
hongpeng-guo,
justinvyu,
matthewdeng,
raulchen,
woshiyyya,
maxpumperla,
pcmoritz,
kevin85421,
sven1977,
simonsays1980,
scottjlee,
bveeramani,
stephanie-wang,
omatthew98 and
hongchaodeng
as code owners
October 12, 2024 06:33
anyscalesam
added
the
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
label
Oct 12, 2024
…eric `_forward` to further simplify the user experience). (ray-project#47889) Signed-off-by: ujjawal-khare <[email protected]>
…DreamerV3 on new API stack remains with tf now). (ray-project#47892) Signed-off-by: ujjawal-khare <[email protected]>
… ostream. (ray-project#47893) We have a convenience function `debug_string` used in Ray logs: it prints printables (operator<<), containers, pairs. However it returns a std::string which is feed into RAY_LOG(). This makes a copy. Changes the signature to return a `DebugStringWrapper` which holds const reference to the argument, and is printable for all already supported types. Additionally supports std::tuple. This should only have marginal perf benefits since we typically don't debug_string a very big data structure. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…#47922) ## Why are these changes needed? Fixes some failing/flaky unit tests tests, which fail with errors like: ``` EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b ``` - TestBackwardsCompatibility.test_cli - test_failed_driver_exit_code Previously failing test now passes with this PR applied: https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934 ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Scott Lee <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
when the conda env exists, should just remove it and continue the testing Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…-project#47912) Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…ay-project#47847) In `fulfill_pending_requests`, there are two nested loops: - the outer loop greedily fulfills more requests so that if backoff doesn't occur, it's not necessary for new asyncio tasks to be started to fulfill each request - the inner loop handles backoff if replicas can't be found to fulfill the next request The outer loop will be stopped if there are enough tasks to handle all pending requests. However if all replicas are at max capacity, it's possible for the inner loop to continue to loop even when the task is no longer needed (e.g. when a request has been cancelled), because the inner loop simply continues to try to find an available replica without checking if the current task is even necessary. This PR makes sure that at the end of each iteration of the inner loop, it clears out requests in `pending_requests_to_fulfill` that have been cancelled, and then breaks out of the loop if there are enough tasks to handle the remaining requests. Tests: - Added a test that tests for the scenario where a request is cancelled while it's trying to find an available replica - Also modified the tests in `test_pow_2_scheduler.py` so that the backoff sequence is small values (1ms), and the timeouts in the tests are also low `10ms`, so that the unit tests run much faster (~5s now compared to ~30s before). ## Related issue number related: ray-project#47585 --------- Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…cs, SpecDict, TensorSpec). (ray-project#47915) Signed-off-by: ujjawal-khare <[email protected]>
… not catch correct `ObjectLostError`). (ray-project#47940) Signed-off-by: ujjawal-khare <[email protected]>
…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908) Signed-off-by: ujjawal-khare <[email protected]>
…project#47659) Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…'. (ray-project#47914) Signed-off-by: ujjawal-khare <[email protected]>
Followup on ray-project#47893, add more "blessed container types" to debug string function. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
so that it participates in the dependency resolving process Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965) Signed-off-by: ujjawal-khare <[email protected]>
…amples. (ray-project#47970) Signed-off-by: ujjawal-khare <[email protected]>
…ject#47973) Signed-off-by: ujjawal-khare <[email protected]>
## Why are these changes needed? Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio is slower on windows, so the shortened timeouts for some tests cause the tests to fail because tasks didn't get a chance to start/finish executing. Failing tests on windows: - `test_multiple_queries_with_different_model_ids` - `test_queue_len_cache_replica_at_capacity_is_probed` - `test_queue_len_cache_background_probing` ## Related issue number Closes ray-project#47950 Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…roject#47961) ## Why are these changes needed? PyArrow infers parquet schema only based on the first file. This will cause errors when reading multiple files with ragged ndarrays. This PR fixes this issue by not using the inferred schema for reading. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Fixes ray-project#47960 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Now, when you call PopWorker(), it finds an idle one or creates a worker. If a new worker is created, the worker is associated to the request and can only be used by it. This PR decouples the worker creation and the worker-to-task assignment, by adding an abstraction namely PopWorkerRequest. Now, if a req triggers a worker creation, the req is put into a queue. If there are workers ready, that is a PushWorker is called, either from a newly started worker or a released worker, Ray matches the first fitting request in the queue. This reduces latency. Later it can also be used to pre-start workers more meaningfully. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
to at least 1.66.1 this is already being overwritten to 1.66.1+ when during release tests Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
…t#47807) Supports single file modules in `py_module` runtime_env. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
PR 47807 was auto-merged without applying the doc reviews, so this commit addresses them. Signed-off-by: Chi-Sheng Liu <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
…ray into fix/job-manager-logger
8 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Closes #47389
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.