Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/job manager logger #48003

Closed

Conversation

ujjawal-khare-27
Copy link

@ujjawal-khare-27 ujjawal-khare-27 commented Oct 12, 2024

Why are these changes needed?

Related issue number

Closes #47389

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

sven1977 and others added 27 commits October 16, 2024 01:32
…eric `_forward` to further simplify the user experience). (ray-project#47889)

Signed-off-by: ujjawal-khare <[email protected]>
…DreamerV3 on new API stack remains with tf now). (ray-project#47892)

Signed-off-by: ujjawal-khare <[email protected]>
… ostream. (ray-project#47893)

We have a convenience function `debug_string` used in Ray logs: it
prints printables (operator<<), containers, pairs. However it returns a
std::string which is feed into RAY_LOG(). This makes a copy.

Changes the signature to return a `DebugStringWrapper` which holds const
reference to the argument, and is printable for all already supported
types. Additionally supports std::tuple.

This should only have marginal perf benefits since we typically don't
debug_string a very big data structure.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
…#47922)

## Why are these changes needed?
Fixes some failing/flaky unit tests tests, which fail with errors like:
```
EnvironmentLocationNotFound: Not a conda environment: /opt/miniconda/envs/jobs-backwards-compatibility-cc452d926b8748a1ab6b4fbf6a6dba2b
```
- TestBackwardsCompatibility.test_cli
- test_failed_driver_exit_code

Previously failing test now passes with this PR applied:
https://buildkite.com/ray-project/postmerge/builds/6479#0192693b-1b8f-4dbc-a497-26d163b52c70/181-934

## Related issue number

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
when the conda env exists, should just remove it and continue
the testing

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
…ay-project#47847)

In `fulfill_pending_requests`, there are two nested loops:
- the outer loop greedily fulfills more requests so that if backoff
doesn't occur, it's not necessary for new asyncio tasks to be started to
fulfill each request
- the inner loop handles backoff if replicas can't be found to fulfill
the next request

The outer loop will be stopped if there are enough tasks to handle all
pending requests. However if all replicas are at max capacity, it's
possible for the inner loop to continue to loop even when the task is no
longer needed (e.g. when a request has been cancelled), because the
inner loop simply continues to try to find an available replica without
checking if the current task is even necessary.

This PR makes sure that at the end of each iteration of the inner loop,
it clears out requests in `pending_requests_to_fulfill` that have been
cancelled, and then breaks out of the loop if there are enough tasks to
handle the remaining requests.

Tests:
- Added a test that tests for the scenario where a request is cancelled
while it's trying to find an available replica
- Also modified the tests in `test_pow_2_scheduler.py` so that the
backoff sequence is small values (1ms), and the timeouts in the tests
are also low `10ms`, so that the unit tests run much faster (~5s now
compared to ~30s before).

## Related issue number

related: ray-project#47585

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
… not catch correct `ObjectLostError`). (ray-project#47940)

Signed-off-by: ujjawal-khare <[email protected]>
…oduleConfig; cleanups, DefaultModelConfig dataclass). (ray-project#47908)

Signed-off-by: ujjawal-khare <[email protected]>
Followup on ray-project#47893, add more
"blessed container types" to debug string function.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
so that it participates in the dependency resolving process

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
…ovided config-sub-dict (instead of a full `DefaultModelConfig`). (ray-project#47965)

Signed-off-by: ujjawal-khare <[email protected]>
## Why are these changes needed?

Fix `test_pow_2_replica_scheduler.py` on windows. Best guess is asyncio
is slower on windows, so the shortened timeouts for some tests cause the
tests to fail because tasks didn't get a chance to start/finish
executing.

Failing tests on windows:
- `test_multiple_queries_with_different_model_ids`
- `test_queue_len_cache_replica_at_capacity_is_probed`
- `test_queue_len_cache_background_probing`

## Related issue number

Closes ray-project#47950

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
…roject#47961)

## Why are these changes needed?

PyArrow infers parquet schema only based on the first file. This will
cause errors when reading multiple files with ragged ndarrays.

This PR fixes this issue by not using the inferred schema for reading.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
Fixes ray-project#47960

---------

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Now, when you call PopWorker(), it finds an idle one or creates a
worker. If a new worker is created, the worker is associated to the
request and can only be used by it.

This PR decouples the worker creation and the worker-to-task assignment,
by adding an abstraction namely PopWorkerRequest. Now, if a req triggers
a worker creation, the req is put into a queue. If there are workers
ready, that is a PushWorker is called, either from a newly started
worker or a released worker, Ray matches the first fitting request in
the queue. This reduces latency.

Later it can also be used to pre-start workers more meaningfully.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
to at least 1.66.1

this is already being overwritten to 1.66.1+ when during release tests

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
…t#47807)

Supports single file modules in `py_module` runtime_env.

Signed-off-by: Chi-Sheng Liu <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
PR 47807 was auto-merged without applying the doc reviews, so this
commit addresses them.

Signed-off-by: Chi-Sheng Liu <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
@ujjawal-khare-27
Copy link
Author

ujjawal-khare-27 commented Oct 15, 2024

@GeneDer opened new PR for cleaner commit history.
Link: #48029

Can you please have a look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Core: File descriptor of ray job does not get closed