Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/job manager logger #48003

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1569 commits
Select commit Hold shift + click to select a range
42361bc
[aDAG] support buffered input (#47272)
rkooo567 Sep 10, 2024
f1e2704
[aDAG] Clean up arg_to_consumers in _get_or_compile() (#47514)
ruisearch42 Sep 10, 2024
8d20388
[RLlib; Offline RL] Store episodes in state form. (#47294)
simonsays1980 Sep 10, 2024
6625ee2
[Core][aDag] Support multi node multi reader (#47480)
rkooo567 Sep 10, 2024
290a14a
Allow control of some serve configuration via env vars (#47533)
timkpaine Sep 10, 2024
a0430bb
Update incremental build troubleshooting tip with style nits (#47592)
angelinalg Sep 10, 2024
a6a63e2
[observability][export-api] Write driver job events (#47418)
nikitavemuri Sep 10, 2024
6e790d9
[core][dashboard] push down job_or_submission_id to GCS. (#47492)
rynewang Sep 11, 2024
e591c40
[Doc][KubeRay] Add description tables for RayCluster Status in the ob…
rueian Sep 11, 2024
87519fa
[aDAG] Fix ranks ordering for custom NCCL group (#47594)
ruisearch42 Sep 11, 2024
47d9b0d
[RLlib] RLModule: `InferenceOnlyAPI`. (#47572)
sven1977 Sep 11, 2024
747b6f5
[Data] Remove `_default_metadata_providers` (#47575)
bveeramani Sep 11, 2024
15132d5
[Serve] Remove unused Serve constants (#47593)
GeneDer Sep 11, 2024
e038cb0
Fix windows://:task_event_buffer_test (#47577)
nikitavemuri Sep 11, 2024
644874d
[RLlib] RLModule API: `SelfSupervisedLossAPI` for RLModules that brin…
sven1977 Sep 11, 2024
b1c7caa
[GCS] Optimize `GetAllJobInfo` API for performance (#47530)
liuxsh9 Sep 11, 2024
4b38d57
[Serve] fix default serve logger behavior (#47600)
GeneDer Sep 11, 2024
102ec9d
[core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…
rynewang Sep 11, 2024
c47c430
[core][adag] Separate the outputs of execute and execute_async to mul…
jeffreyjeffreywang Sep 11, 2024
21379f3
[serve] Faster detection of dead replicas (#47237)
zcin Sep 12, 2024
591a4d0
[spark] Improve Ray-on-spark fault tolerance in case of Spark executo…
WeichenXu123 Sep 12, 2024
60d78b1
[serve] skip failure test on windows (#47630)
zcin Sep 12, 2024
0f9fa48
[serve] reorganize replica scheduler classes (#47615)
zcin Sep 12, 2024
3c2b92c
[Core] Remove code accidently got in (#47612)
rkooo567 Sep 13, 2024
35fe4ba
[Core][aDAG] support multi readers in multi node when dag is created …
rkooo567 Sep 14, 2024
0af4ca7
[core] out of band serialization exception (#47544)
rkooo567 Sep 14, 2024
ebb984e
[core][experimental] Allocate a channel for each InputAttributeNode (…
kevin85421 Sep 15, 2024
804b4f3
[Data] Add `partitioning` parameter to `read_parquet` (#47553)
bveeramani Sep 16, 2024
96175fb
[spark] Refine comment in Starting ray worker spark task (#47670)
WeichenXu123 Sep 16, 2024
2a7679d
[Core][aDAG] Set buffer size to 1 for regression (#47639)
rkooo567 Sep 16, 2024
de9be8f
Add perf metrics for 2.36.0 (#47574)
khluu Sep 16, 2024
3af892c
[RLlib] Add "shuffle batch per epoch" option. (#47458)
sven1977 Sep 17, 2024
d738010
[RLlib; Offline RL] Enable buffering episodes. (#47501)
simonsays1980 Sep 17, 2024
ca4be70
[Core] Make JobSupervisor logs structured (#47699)
jjyao Sep 17, 2024
73b528b
[serve] wrap obj ref in result wrapper in deployment response (#47655)
zcin Sep 17, 2024
9dbbe38
[Core] Fix broken dashboard worker page (#47714)
jjyao Sep 17, 2024
c1bdd25
[core][experimental] Remove unused attr CompiledDAG._type_hints (#47706)
kevin85421 Sep 17, 2024
f966d2e
[Data] Re-phrase the streaming executor current usage string (#47515)
sofianhnaide Sep 18, 2024
3f439f8
[serve] improve tests (#47722)
zcin Sep 18, 2024
9e28fb7
[Core] Add test case where there is dead node for /nodes?view=summary…
jjyao Sep 18, 2024
0426ee4
[Dashboard] Optimizing performance of Ray Dashboard (#47617)
alexeykudinkin Sep 19, 2024
dd8ee01
[core][aDAG] Fix a bug where multi arg + exception doesn't work (#47704)
rkooo567 Sep 19, 2024
361c10e
[fake autoscaler] use check_call in fake multi node test utils (#47772)
aslonnie Sep 22, 2024
605c640
[RLlib] RLModule: Simplify defining custom distribution classes and a…
sven1977 Sep 23, 2024
aaa3d8d
[fake autoscaler] remove the redundant mkdir (#47786)
aslonnie Sep 23, 2024
c7ff9c8
[Data] Simplify and consolidate progress bar outputs (#47692)
scottjlee Sep 23, 2024
6aae543
Add perf metrics for 2.37.0 (#47791)
khluu Sep 23, 2024
bc7f7b0
[Serve] add dependencies on openssl (#47738)
GeneDer Sep 23, 2024
247be0b
[docker] Update latest Docker dependencies for 2.36.0 release (#47748)
khluu Sep 23, 2024
d4e7f7f
[docker] Update latest Docker dependencies for 2.36.1 release (#47801)
khluu Sep 23, 2024
f994475
[observability][export-api] Write submission job events (#47468)
nikitavemuri Sep 23, 2024
c8f16e3
Move export events to separate folder (#47747)
nikitavemuri Sep 24, 2024
1b8fcac
[release] stream the full anyscale log to buildkite (#47808)
can-anyscale Sep 25, 2024
575ee94
[RLlib; Offline RL] Offline performance cleanup. (#47731)
simonsays1980 Sep 25, 2024
1d25a39
[docker] Update latest Docker dependencies for 2.37.0 release (#47812)
khluu Sep 25, 2024
ce75400
[RLlib] Fix action masking example. (#47817)
simonsays1980 Sep 25, 2024
788db07
[Core] Separate the attempt_number with the task_status in memory sum…
MengjinYan Sep 25, 2024
55397ea
[RLlib; docs] New API stack migration guide. (#47779)
sven1977 Sep 26, 2024
27985d4
[RLlib; new API stack by default] Switch on new API stack by default …
sven1977 Sep 26, 2024
90742fb
[Core] Fix a Typo in dict_to_state function parameter name (#47822)
MengjinYan Sep 26, 2024
1383374
[core] Introducing InstrumentedIOContextWithThread. (#47831)
rynewang Sep 26, 2024
417cdd2
[RLlib] Discontinue support for "hybrid" API stack (using RLModule + …
sven1977 Sep 27, 2024
6c160b3
[Core] Fix object reconstruction hang on arguments pending creation (…
jjyao Sep 27, 2024
3714afd
[core][experimental] Fix test_execution_schedule_gpu (#47753)
kevin85421 Sep 28, 2024
256c177
[core] Change many Ray ID logs to WithField. (#47844)
rynewang Sep 28, 2024
a4f62b6
[RLlib] Cleanup examples folder (vol 30): BC pretraining, then PPO fi…
sven1977 Sep 28, 2024
1ac860f
[RLlib] MultiAgentEnv API enhancements (related to defining obs-/acti…
sven1977 Sep 28, 2024
43a8a1d
[RLlib] Add log-std clipping to 'MLPHead's. (#47827)
simonsays1980 Sep 30, 2024
cfbda91
[RLlib] Update autoregressive actions example. (#47829)
simonsays1980 Sep 30, 2024
71bb74b
[kuberay] Update docs for KubeRay v1.2.2 (#47867)
kevin85421 Sep 30, 2024
cd61cb3
[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb…
alexeykudinkin Oct 1, 2024
21af246
[Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeI…
jjyao Oct 4, 2024
072d349
[RLlib] New API stack: (Multi)RLModule overhaul vol 01 (some preparat…
sven1977 Oct 4, 2024
383d7ff
[RLlib] New API stack: (Multi)RLModule overhaul vol 02 (VPG RLModule,…
sven1977 Oct 4, 2024
e4401e5
[RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…
sven1977 Oct 5, 2024
759b0c8
[RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …
sven1977 Oct 7, 2024
b50f7c1
[core] Change debug_string from returning a string to streaming to an…
rynewang Oct 8, 2024
8844a78
[Serve / Jobs] Check if conda env exists before removing (#47922)
scottjlee Oct 8, 2024
122b382
[job] don't continue on test setup (#47927)
aslonnie Oct 8, 2024
b16a782
[core][experimental] Avoid false positives in deadlock detection (#47…
kevin85421 Oct 8, 2024
d360d45
[serve] Stop scheduling task early when requests have been cancelled …
zcin Oct 8, 2024
b2a8acf
[RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…
sven1977 Oct 9, 2024
fe2aea0
[RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…
sven1977 Oct 9, 2024
80824d0
[RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…
sven1977 Oct 9, 2024
da339ad
[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (#47659)
jjyao Oct 9, 2024
315bdf1
[RLlib] Fix small bug in 'InfiniteLookBackBuffer.get_state/from_state…
simonsays1980 Oct 10, 2024
c5bbfe8
[core] Add more debug string types (#47928)
dentiny Oct 10, 2024
364ee39
[deps] add grpcio-tools into anyscale dependencies (#47955)
aslonnie Oct 10, 2024
d04f8d3
[RLlib] Quick-fix for default RLModules in combination with a user-pr…
sven1977 Oct 10, 2024
ca5d29b
[RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…
sven1977 Oct 10, 2024
3c1aa3b
[RLlib] Add framework-check to `MultiRLModule.add_module()`. (#47973)
sven1977 Oct 10, 2024
7585842
[serve] Fix failing test pow 2 scheduler on windows (#47975)
zcin Oct 10, 2024
ca871bc
[data] fix reading multiple parquet files with ragged ndarrays (#47961)
raulchen Oct 10, 2024
6a38914
[core] Decouple create worker vs pop worker request. (#47694)
rynewang Oct 10, 2024
669d699
[core] Add metrics for gcs jobs (#47793)
dentiny Oct 11, 2024
f2b09d4
upgrade grpcio version (#47982)
aslonnie Oct 11, 2024
aed856b
[Chore][Core] Address PR 47807 comments (#48002)
MortalHappiness Oct 12, 2024
4cf016c
[core] Add thread check to job mgr callback (#48005)
dentiny Oct 14, 2024
155a415
unneccessary file removed
ujjawal-khare Oct 15, 2024
e7c94c5
unneccessary files removed
ujjawal-khare Oct 15, 2024
b83e7ad
lint error fixed
ujjawal-khare Oct 15, 2024
b77c5ad
lint error fixed
ujjawal-khare Oct 15, 2024
7ad9a3b
[Serve] fix grpc performance issue (#47338)
GeneDer Aug 28, 2024
cdc86fa
[observability][export-api] Write node events (#47221)
nikitavemuri Aug 28, 2024
1ea718b
[RLlib] Cleanup examples folder (vol 23): Float16 training support an…
sven1977 Aug 29, 2024
ddec4a5
[core][dashboard] Update nodes on delta. (#47367)
rynewang Aug 29, 2024
8820918
[RLlib] Cleanup examples folder (vol 24): Mixed-precision training (a…
sven1977 Aug 29, 2024
9f76655
Split python/ray/tests/test_actor_retry over two files (#47188)
mattip Aug 29, 2024
05fad3f
[RLlib; Offline RL] - Enable reading old-stack `SampleBatch` data in …
simonsays1980 Aug 29, 2024
fa17d58
[serve] redeploy in between each microbenchmark (#47404)
zcin Aug 29, 2024
28d7347
Revert "[observability][export-api] Write node events" (#47405)
can-anyscale Aug 29, 2024
e6c08e1
[doc] Instruction for troubleshooting side nav when building incremen…
khluu Aug 29, 2024
8438af2
[Doc] Run pre-commit on cluster docs (#47342)
peytondmurray Aug 29, 2024
51be505
[RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…
sven1977 Aug 30, 2024
f2c5415
[RLlib] Remove 2nd Learner ConnectorV2 pass from PPO (add new GAE Con…
sven1977 Aug 30, 2024
be99650
[RLlib; Offline RL] CQL: Support multi-GPU/CPU setup and different le…
simonsays1980 Aug 30, 2024
ed38e38
[aDAG] Support multi-read of the same shm channel (#47311)
ruisearch42 Aug 30, 2024
c13190c
[RLlib; Offline RL] Add cloud filesystems to offline data input argum…
simonsays1980 Aug 31, 2024
8edb0e3
[serve] Fix broken microbenchmarks (#47430)
zcin Aug 31, 2024
150c8ba
[ADAG] Support tasks with multiple return values in aDAG (#47024)
dengwxn Sep 2, 2024
8c990d8
[RLlib] Add gradient checks to avoid `nan` gradients in `TorchLearner…
simonsays1980 Sep 3, 2024
105a904
[RLlib] Add option to use `torch.lr_scheduler` classes for learning r…
simonsays1980 Sep 3, 2024
9e0a00d
[observability][export-api] Write node events (#47422)
nikitavemuri Sep 3, 2024
637c16c
[RLlib] - Add example for PyTorch lr schedulers. (#47454)
simonsays1980 Sep 4, 2024
68c117a
[RLlib] Examples folder cleanup: ModelV2 -> RLModule wrapper for migr…
sven1977 Sep 4, 2024
716f314
[serve] add streaming to microbenchmarks (#47466)
zcin Sep 4, 2024
383b47a
feat: quickstart install button (#47479)
saihaj Sep 4, 2024
dcb8d6d
Revert "[Doc] Add Algolia search to docs" (#47483)
can-anyscale Sep 4, 2024
3125db2
[release] simplify the process of getting job logs (#47470)
can-anyscale Sep 4, 2024
b42f473
[Core] Fix runtime env race condition when uploading the same package…
jjyao Sep 4, 2024
a4621ce
[core][dashboard] Pass in cluster ID in hex for dashboard, dash agent…
rynewang Sep 5, 2024
8a530a2
[core][experimental] Correct `num_input_consumers` for CachedChannel …
kevin85421 Sep 5, 2024
98f6186
Revert Revert "[Doc] Add Algolia search to docs" (#47487)
can-anyscale Sep 5, 2024
0c75290
[observability][export-api] Write actor events (#47303)
nikitavemuri Sep 5, 2024
e14400f
[ADAG] Log Executable Task Events (#47345)
woshiyyya Sep 5, 2024
692f9df
[Core] Fix test_runtime_env_working_dir_4 for Windows (#47505)
jjyao Sep 5, 2024
eca534a
[observability][export-api] Write task events (#47193)
nikitavemuri Sep 5, 2024
184e293
Revert "[observability][export-api] Write actor events" (#47516)
can-anyscale Sep 6, 2024
950ad18
Revert "[observability][export-api] Write task events" (#47536)
can-anyscale Sep 6, 2024
246e395
fix quickstart image path (#47535)
saihaj Sep 6, 2024
1e4e4d0
[RLlib; Off-policy] Add episode sampling to `EpisodeReplayBuffer`. (#…
simonsays1980 Sep 6, 2024
4792e1d
[aDAG] Allow custom NCCL group for aDAG (#47141)
ruisearch42 Sep 6, 2024
8b89a9d
[aDAG] Fix test_accelerated_dag regression (#47543)
ruisearch42 Sep 6, 2024
bb015e4
[Core] Remove ray._raylet.check_health (#47526)
jjyao Sep 9, 2024
9cf02de
[observability][export-api] Write actor events (#47529)
nikitavemuri Sep 9, 2024
6f4aaf6
[observability][export-api] Write task events (#47538)
nikitavemuri Sep 9, 2024
8e61bab
[RLlib; Offline RL] - Replace GAE in `MARWILOfflinePreLearner` with `…
simonsays1980 Sep 9, 2024
9eef3b5
[data] Change fixture from `shutdown_only` to `ray_start_regular_shar…
omatthew98 Sep 9, 2024
c475f45
Add perf metrics for 2.35.0 (#47283)
khluu Sep 9, 2024
03e5832
[Core] Reconstruct actor to run lineage reconstruction triggered acto…
jjyao Sep 10, 2024
7625128
[aDAG] support buffered input (#47272)
rkooo567 Sep 10, 2024
cbe6687
[aDAG] Clean up arg_to_consumers in _get_or_compile() (#47514)
ruisearch42 Sep 10, 2024
53e641a
[RLlib; Offline RL] Store episodes in state form. (#47294)
simonsays1980 Sep 10, 2024
eb14e06
[Core][aDag] Support multi node multi reader (#47480)
rkooo567 Sep 10, 2024
50bd27a
Allow control of some serve configuration via env vars (#47533)
timkpaine Sep 10, 2024
12afcd1
Update incremental build troubleshooting tip with style nits (#47592)
angelinalg Sep 10, 2024
a0d3355
[observability][export-api] Write driver job events (#47418)
nikitavemuri Sep 10, 2024
f541305
[core][dashboard] push down job_or_submission_id to GCS. (#47492)
rynewang Sep 11, 2024
54ce249
[Doc][KubeRay] Add description tables for RayCluster Status in the ob…
rueian Sep 11, 2024
a0fb580
[aDAG] Fix ranks ordering for custom NCCL group (#47594)
ruisearch42 Sep 11, 2024
6e90110
[RLlib] RLModule: `InferenceOnlyAPI`. (#47572)
sven1977 Sep 11, 2024
69ca5c5
[Data] Remove `_default_metadata_providers` (#47575)
bveeramani Sep 11, 2024
8f9236c
[Serve] Remove unused Serve constants (#47593)
GeneDer Sep 11, 2024
bd0d6eb
Fix windows://:task_event_buffer_test (#47577)
nikitavemuri Sep 11, 2024
6293a1f
[RLlib] RLModule API: `SelfSupervisedLossAPI` for RLModules that brin…
sven1977 Sep 11, 2024
649148c
[GCS] Optimize `GetAllJobInfo` API for performance (#47530)
liuxsh9 Sep 11, 2024
92f0741
[Serve] fix default serve logger behavior (#47600)
GeneDer Sep 11, 2024
7e0d054
[core] Make is_gpu, is_actor, root_detached_id fields late bind to wo…
rynewang Sep 11, 2024
7ec6491
[core][adag] Separate the outputs of execute and execute_async to mul…
jeffreyjeffreywang Sep 11, 2024
2ba70de
[serve] Faster detection of dead replicas (#47237)
zcin Sep 12, 2024
f2ef047
[spark] Improve Ray-on-spark fault tolerance in case of Spark executo…
WeichenXu123 Sep 12, 2024
72b643b
[serve] skip failure test on windows (#47630)
zcin Sep 12, 2024
7b136f9
[serve] reorganize replica scheduler classes (#47615)
zcin Sep 12, 2024
309a86c
[Core] Remove code accidently got in (#47612)
rkooo567 Sep 13, 2024
b7b5c51
[Core][aDAG] support multi readers in multi node when dag is created …
rkooo567 Sep 14, 2024
a2b0cc3
[core] out of band serialization exception (#47544)
rkooo567 Sep 14, 2024
ffa2d34
[core][experimental] Allocate a channel for each InputAttributeNode (…
kevin85421 Sep 15, 2024
8839ad4
[Data] Add `partitioning` parameter to `read_parquet` (#47553)
bveeramani Sep 16, 2024
2af394f
[spark] Refine comment in Starting ray worker spark task (#47670)
WeichenXu123 Sep 16, 2024
27c71b6
[Core][aDAG] Set buffer size to 1 for regression (#47639)
rkooo567 Sep 16, 2024
b2ccedc
Add perf metrics for 2.36.0 (#47574)
khluu Sep 16, 2024
5c66fac
[RLlib] Add "shuffle batch per epoch" option. (#47458)
sven1977 Sep 17, 2024
6c165c2
[RLlib; Offline RL] Enable buffering episodes. (#47501)
simonsays1980 Sep 17, 2024
6439db3
[Core] Make JobSupervisor logs structured (#47699)
jjyao Sep 17, 2024
3f63c45
[serve] wrap obj ref in result wrapper in deployment response (#47655)
zcin Sep 17, 2024
94b5e06
[Core] Fix broken dashboard worker page (#47714)
jjyao Sep 17, 2024
47de542
[core][experimental] Remove unused attr CompiledDAG._type_hints (#47706)
kevin85421 Sep 17, 2024
e9e4d7e
[Data] Re-phrase the streaming executor current usage string (#47515)
sofianhnaide Sep 18, 2024
27ee9f1
[serve] improve tests (#47722)
zcin Sep 18, 2024
50016c0
[Core] Add test case where there is dead node for /nodes?view=summary…
jjyao Sep 18, 2024
4012314
[Dashboard] Optimizing performance of Ray Dashboard (#47617)
alexeykudinkin Sep 19, 2024
605221a
[core][aDAG] Fix a bug where multi arg + exception doesn't work (#47704)
rkooo567 Sep 19, 2024
2e739d8
[fake autoscaler] use check_call in fake multi node test utils (#47772)
aslonnie Sep 22, 2024
b413593
[RLlib] RLModule: Simplify defining custom distribution classes and a…
sven1977 Sep 23, 2024
fee22c2
[fake autoscaler] remove the redundant mkdir (#47786)
aslonnie Sep 23, 2024
c20e3b1
[Data] Simplify and consolidate progress bar outputs (#47692)
scottjlee Sep 23, 2024
e438357
Add perf metrics for 2.37.0 (#47791)
khluu Sep 23, 2024
8c73745
[docker] Update latest Docker dependencies for 2.36.0 release (#47748)
khluu Sep 23, 2024
8b0a597
[docker] Update latest Docker dependencies for 2.36.1 release (#47801)
khluu Sep 23, 2024
5681a4a
[observability][export-api] Write submission job events (#47468)
nikitavemuri Sep 23, 2024
2b21a08
Move export events to separate folder (#47747)
nikitavemuri Sep 24, 2024
a665c67
[release] stream the full anyscale log to buildkite (#47808)
can-anyscale Sep 25, 2024
68bd111
[RLlib; Offline RL] Offline performance cleanup. (#47731)
simonsays1980 Sep 25, 2024
c432e5c
[docker] Update latest Docker dependencies for 2.37.0 release (#47812)
khluu Sep 25, 2024
e5c2bd4
[RLlib] Fix action masking example. (#47817)
simonsays1980 Sep 25, 2024
75bc8fe
[Core] Separate the attempt_number with the task_status in memory sum…
MengjinYan Sep 25, 2024
8b67dc6
[RLlib; docs] New API stack migration guide. (#47779)
sven1977 Sep 26, 2024
bcda013
[RLlib; new API stack by default] Switch on new API stack by default …
sven1977 Sep 26, 2024
a335bdc
[Core] Fix a Typo in dict_to_state function parameter name (#47822)
MengjinYan Sep 26, 2024
62802bd
[core] Introducing InstrumentedIOContextWithThread. (#47831)
rynewang Sep 26, 2024
5e7601b
[RLlib] Discontinue support for "hybrid" API stack (using RLModule + …
sven1977 Sep 27, 2024
4e54d89
[Core] Fix object reconstruction hang on arguments pending creation (…
jjyao Sep 27, 2024
00cc0a5
[core][experimental] Fix test_execution_schedule_gpu (#47753)
kevin85421 Sep 28, 2024
f3d8e46
[core] Change many Ray ID logs to WithField. (#47844)
rynewang Sep 28, 2024
80f4941
[RLlib] Cleanup examples folder (vol 30): BC pretraining, then PPO fi…
sven1977 Sep 28, 2024
f4a1d5c
[RLlib] MultiAgentEnv API enhancements (related to defining obs-/acti…
sven1977 Sep 28, 2024
554195d
[RLlib] Add log-std clipping to 'MLPHead's. (#47827)
simonsays1980 Sep 30, 2024
041874d
[RLlib] Update autoregressive actions example. (#47829)
simonsays1980 Sep 30, 2024
9900778
[kuberay] Update docs for KubeRay v1.2.2 (#47867)
kevin85421 Sep 30, 2024
4d35582
[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb…
alexeykudinkin Oct 1, 2024
01e7634
[Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeI…
jjyao Oct 4, 2024
a7aba20
[RLlib] New API stack: (Multi)RLModule overhaul vol 01 (some preparat…
sven1977 Oct 4, 2024
d2f8737
[RLlib] New API stack: (Multi)RLModule overhaul vol 02 (VPG RLModule,…
sven1977 Oct 4, 2024
b7e3789
[RLlib] New API stack: (Multi)RLModule overhaul vol 03 (Introduce gen…
sven1977 Oct 5, 2024
bbb59bb
[RLlib] Remove Tf support on new API stack for PPO/IMPALA/APPO (only …
sven1977 Oct 7, 2024
815a9e4
[core] Change debug_string from returning a string to streaming to an…
rynewang Oct 8, 2024
f07ef31
[Serve / Jobs] Check if conda env exists before removing (#47922)
scottjlee Oct 8, 2024
ff382fa
[job] don't continue on test setup (#47927)
aslonnie Oct 8, 2024
80a7ef7
[core][experimental] Avoid false positives in deadlock detection (#47…
kevin85421 Oct 8, 2024
71d5ad4
[serve] Stop scheduling task early when requests have been cancelled …
zcin Oct 8, 2024
c4d884b
[RLlib] New API stack: (Multi)RLModule overhaul vol 05 (deprecate Spe…
sven1977 Oct 9, 2024
a994eec
[RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does…
sven1977 Oct 9, 2024
08fc41e
[RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLM…
sven1977 Oct 9, 2024
90160e6
[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); (#47659)
jjyao Oct 9, 2024
269b9ad
[RLlib] Fix small bug in 'InfiniteLookBackBuffer.get_state/from_state…
simonsays1980 Oct 10, 2024
7260fdf
[core] Add more debug string types (#47928)
dentiny Oct 10, 2024
20e1cad
[deps] add grpcio-tools into anyscale dependencies (#47955)
aslonnie Oct 10, 2024
611b645
[RLlib] Quick-fix for default RLModules in combination with a user-pr…
sven1977 Oct 10, 2024
0dcefc0
[RLlib] Cleanup examples folder vol. 25: Remove some old API stack ex…
sven1977 Oct 10, 2024
f218402
[RLlib] Add framework-check to `MultiRLModule.add_module()`. (#47973)
sven1977 Oct 10, 2024
7a6cfe0
[serve] Fix failing test pow 2 scheduler on windows (#47975)
zcin Oct 10, 2024
e2f7c91
[data] fix reading multiple parquet files with ragged ndarrays (#47961)
raulchen Oct 10, 2024
3eff78e
[core] Decouple create worker vs pop worker request. (#47694)
rynewang Oct 10, 2024
2597701
[core] Add metrics for gcs jobs (#47793)
dentiny Oct 11, 2024
b644b30
upgrade grpcio version (#47982)
aslonnie Oct 11, 2024
eac1cb6
[Feat][Core] Implement single file module for runtime_env (#47807)
MortalHappiness Oct 11, 2024
f84cd6b
[Chore][Core] Address PR 47807 comments (#48002)
MortalHappiness Oct 12, 2024
5d7ab4b
[core] Add thread check to job mgr callback (#48005)
dentiny Oct 14, 2024
d5193c9
remove handler released
ujjawal-khare Oct 15, 2024
f68bfa7
Merge branch 'fix/job-manager-logger' of github.com:ujjawal-khare-27/…
ujjawal-khare Oct 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions python/ray/dashboard/modules/job/job_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from ray.dashboard.modules.job.job_log_storage_client import JobLogStorageClient
from ray.dashboard.modules.job.job_supervisor import JobSupervisor
from ray.dashboard.modules.job.utils import get_head_node_id
from ray.dashboard.utils import close_logger_file_descriptor
from ray.exceptions import ActorUnschedulableError, RuntimeEnvSetupError
from ray.job_submission import JobStatus
from ray.runtime_env import RuntimeEnvConfig
Expand Down Expand Up @@ -505,6 +506,8 @@ async def submit_job(
"Please use a different submission_id."
)

driver_logger = self._get_job_driver_logger(submission_id)
driver_logger.info("Runtime env is setting up.")
# Wait for the actor to start up asynchronously so this call always
# returns immediately and we can catch errors with the actor starting
# up.
Expand All @@ -525,8 +528,6 @@ async def submit_job(
f"Started a ray job {submission_id}.", submission_id=submission_id
)

driver_logger = self._get_job_driver_logger(submission_id)
driver_logger.info("Runtime env is setting up.")
supervisor = self._supervisor_actor_cls.options(
lifetime="detached",
name=JOB_ACTOR_NAME_TEMPLATE.format(job_id=submission_id),
Expand Down Expand Up @@ -559,8 +560,7 @@ async def submit_job(
)
except Exception as e:
tb_str = traceback.format_exc()

logger.warning(
driver_logger.warning(
f"Failed to start supervisor actor for job {submission_id}: '{e}'"
f". Full traceback:\n{tb_str}"
)
Expand All @@ -572,6 +572,8 @@ async def submit_job(
f". Full traceback:\n{tb_str}"
),
)
finally:
close_logger_file_descriptor(driver_logger)

return submission_id

Expand Down
1 change: 0 additions & 1 deletion python/ray/dashboard/modules/job/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@
Request = None
Response = None


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not include unwanted changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

logger = logging.getLogger(__name__)

MAX_CHUNK_LINE_LENGTH = 10
Expand Down
25 changes: 25 additions & 0 deletions python/ray/dashboard/tests/test_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import logging
import sys

import pytest

from ray.dashboard.utils import close_logger_file_descriptor


def test_close_logger_file_descriptor():
logger_format = "%(message)s"
logger = logging.getLogger("test_job_id")

job_driver_log_path = "/tmp/ray.log"
job_driver_handler = logging.FileHandler(job_driver_log_path)
job_driver_formatter = logging.Formatter(logger_format)
job_driver_handler.setFormatter(job_driver_formatter)
logger.addHandler(job_driver_handler)

assert job_driver_handler._closed is False
close_logger_file_descriptor(logger)
assert job_driver_handler._closed is True


if __name__ == "__main__":
sys.exit(pytest.main(["-v", __file__]))
5 changes: 5 additions & 0 deletions python/ray/dashboard/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -692,3 +692,8 @@ def compose_state_message(
else:
state_message = death_reason_message
return state_message


def close_logger_file_descriptor(logger_instance):
for handler in logger_instance.handlers:
handler.close()