Mark restarting actors are pending actors #47946

srinathk10 · 2024-10-08T19:21:10Z

Why are these changes needed?

In ActorPoolMapOperator that executes tasks on Actor pool, to schedule an incoming task, pick_actor is invoked. The pick_actor is a simple bin packing algo to pick a running Actor with least inflight tasks. When Actor restarts though, pick_actor needs to exclude it from task scheduling.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

raulchen · 2024-10-10T20:15:49Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+                        in self._actor_pool._restarting_actors
+                    )
+                    # Move the actor from restarting to running state.
+                    self._actor_pool.restarting_to_running(actor_to_return)


We need to return the actor in this case as well. Otherwise, the actor will be no longer usable.
BTW, let's add a unit test in test_actor_pool_map_operator.py to cover this case.

raulchen · 2024-10-10T20:16:04Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -221,7 +231,7 @@ def _task_done_callback(actor_to_return):
            self._submit_data_task(
                gen,
                bundle,
-                lambda: _task_done_callback(actor_to_return),
+                lambda: _task_done_callback(actor_to_return),  # noqa: B023


this change seems unrelated?

Yes, was suppressing a warning with ./scripts/format.sh

raulchen · 2024-10-10T20:19:46Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+        actors = list(self._actor_pool._num_tasks_in_flight.keys())
+        for actor in actors:
+            actor_state = actor._get_local_state()
+            if actor_state != gcs_pb2.ActorTableData.ActorState.ALIVE:


let's add an assertion here to make sure we are only handling RESTARTING state here.

raulchen · 2024-10-10T20:23:58Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+            if actor_state != gcs_pb2.ActorTableData.ActorState.ALIVE:
+                # If an actor is not ALIVE, it's a candidate to be marked as a
+                # restarting actor.
+                self._actor_pool.running_to_restarting(actor, actor.get_location)


Suggested change

self._actor_pool.running_to_restarting(actor, actor.get_location)

if self._actor_pool.is_actor_running(actor):

self._actor_pool.running_to_restarting(actor, actor.get_location.remote())

Moving the running check here would be cleaner. And more importantly, we should only send get_location when the actor switched from running to restarting.

.remote() was missed after get_location. It didn't error out probably because actor locality is disabled by default right now.

Ah ok. get_location is valid only when state is ALIVE

raulchen · 2024-10-10T20:25:14Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+            else:
+                # If an actor is ALIVE, it's a candidate to be marked as a
+                # running actor, if not already the case.
+                self._actor_pool.restarting_to_running(actor.get_location)


should use the actor handle as the key, actor.get_location is a method, not an object ref.

Similar to the above comment, let's add some unit test to cover the state transitions.

raulchen · 2024-10-10T20:29:17Z

python/ray/data/tests/test_actor_pool_fault_tolerance.py

+from ray.tests.conftest import *  # noqa
+
+
+def test_removed_nodes_and_added_back(ray_start_cluster):


let's also test pending_processor_usage reports the correct usages during different stages.

maybe just move this test to test_actor_pool_map_operator.py.

raulchen · 2024-10-10T20:30:13Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+            # The actor has been removed from the pool before becoming running.
+            return False
+        actor = self._restarting_actors.pop(ready_ref)
+        self._num_tasks_in_flight[actor] = 0


We need to keep the old _num_tasks_in_flights and restore it here

Ah I see, makes sense

raulchen · 2024-10-10T20:31:10Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+            return True
+
+        # Next prioritize killing restarting actor.
+        killed = self._maybe_kill_restarting_actor()


let's only keep restarting actors with in_flight_tasks = 0 here.

raulchen

High-level structure looks good. Left some comments

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

python/ray/data/_internal/execution/interfaces/physical_operator.py

raulchen · 2024-10-15T19:03:47Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+        else:
+            # If an actor is ALIVE, it's a candidate to be marked as a
+            # running actor, if not already the case.
+            self._actor_pool.clear_restarting_from_running_actor(actor)


nit, the method names sound a bit too verbose. maybe just mark_actor_as_alive/restarting?

raulchen · 2024-10-15T19:12:29Z

python/ray/data/_internal/execution/streaming_executor.py

@@ -309,6 +309,7 @@ def _scheduling_loop_step(self, topology: Topology) -> bool:
            i += 1
            if i % PROGRESS_BAR_UPDATE_INTERVAL == 0:
                self._refresh_progress_bars(topology)
+            topology[op].update_resource_usage()


just op.update_resource_usage() so we don't need the extra indirection in OpState

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

raulchen · 2024-10-15T20:08:21Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

-        self._num_tasks_in_flight[actor] -= 1
-        if self._should_kill_idle_actors and self._num_tasks_in_flight[actor] == 0:
+        # Mark restarting as false, now that the actor in running
+        self._running_actors[actor]._is_restarting = False


after a second thought, I think it'd be slightly more clear to remove this and let the next update_resource_usage to handle the state transition

python/ray/data/tests/test_actor_pool_map_operator.py

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

bveeramani · 2024-10-16T08:16:00Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+    def num_alive_actors(self) -> int:
+        return sum(
+            1
+            if (
+                running_actor_state.num_tasks_in_flight > 0
+                and running_actor_state.is_restarting is False
+            )
+            else 0
+            for running_actor_state in self._running_actors.values()
        )


Does this mean we don't count an actor as alive if it doesn't have any tasks in flight? If so, what're the implications of that (if any)?

For scheduling tasks, we invoke pick_actors(). Earlier, it did not cover for the restarting case, but now it excludes actors restarting even with low in flight tasks.

Also resource accounting APIs current_processor_usage() and pending_processor_usage() account for restarting actors.

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

bveeramani · 2024-10-16T08:21:34Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+        actor_state = actor._get_local_state()
+        if actor_state != gcs_pb2.ActorTableData.ActorState.ALIVE:


What happens if _get_local_state returns None? Looks like we assume that the actor is restarting -- do we need to worry about this edge case?

If raylet.pyx:4371 get_local_actor_state() returns None, then actor.py:1561 _get_local_state() can return None.

I think it's defensive to check for None here, given I am not sure about the interface guarantee for get_local_actor_state().

Good catch! Let me fix the code.

bveeramani · 2024-10-16T08:23:49Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+
+    def update_resource_usage(self) -> None:
+        """Updates resources usage."""
+        for actor in self._actor_pool._running_actors.keys():


Nit: Should we add a method to _ActorPool that provides a list of actor handles so that we don't access the internal _running_actors attribute?

bveeramani · 2024-10-16T08:24:34Z

python/ray/data/_internal/execution/streaming_executor_state.py

+    def update_resource_usage(self) -> None:
+        """Updates resources usage."""
+        self.op.update_resource_usage()
+


Where is OpState.update_resource_usage called?

This is invoked by _scheduling_loop_step in streaming_executor.py. Will add a comment here.

I thought that calls PhysicalOperator.update_resource_usage and not OpState.update_resource_usage? AFAIK OpState.update_resource_usage doesn't have any references

Ah, nevermind, saw you removed OpState.update_resource_usage

python/ray/data/tests/test_actor_pool_map_operator.py

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

raulchen · 2024-10-16T21:23:42Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+            self._running_actors[actor].num_tasks_in_flight >= self._max_tasks_in_flight
+            or self._running_actors[actor].is_restarting
+        ):
+            # All actors are at capacity or restarting.


nit, not a new issue of this PR. but I think it'd be more clear if we filter the running actors by validness and then find the min.

python/ray/data/_internal/execution/streaming_executor_state.py

python/ray/data/tests/test_executor_resource_management.py

python/ray/data/tests/test_actor_pool_map_operator.py

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

python/ray/data/tests/test_actor_pool_map_operator.py

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

raulchen · 2024-10-18T00:32:34Z

python/ray/data/tests/test_actor_pool_map_operator.py

+def test_actor_pool_fault_tolerance_e2e(ray_start_cluster):
+    """Test that a dataset with actor pools can finish, when
+    all nodes in the cluster are removed and added back."""
+    ray.shutdown()


this shutdown shouldn't be needed

Without this shutdown, ray.init() line 608 is throwing.

E RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.

I think it's because the previously test didn't shutdown the cluster. We can change the ray_start_regular_shared to ray_start_regular

python/ray/data/tests/test_actor_pool_map_operator.py

raulchen

LGTM. a few final small comments

raulchen · 2024-10-18T17:30:42Z

python/ray/data/tests/test_actor_pool_map_operator.py

+def test_actor_pool_fault_tolerance_e2e(ray_start_cluster):
+    """Test that a dataset with actor pools can finish, when
+    all nodes in the cluster are removed and added back."""
+    ray.shutdown()


I think it's because the previously test didn't shutdown the cluster. We can change the ray_start_regular_shared to ray_start_regular

python/ray/data/tests/test_actor_pool_map_operator.py

raulchen · 2024-10-18T17:34:36Z

python/ray/data/_internal/execution/streaming_executor_state.py

-                actor_str += f", (pending: {pending})"
-            desc += actor_str
+        # Actors info
+        desc += self.actor_info_progress_str()


oh, sorry, I meant just adding this actor_info_progress_str in PhysicalOperator and get rid of the num_xxx_actors methods. Because it seems a bit overkill to have so many methods and indirections

raulchen · 2024-10-18T17:39:05Z

python/ray/data/_internal/execution/streaming_executor.py

@@ -309,6 +309,7 @@ def _scheduling_loop_step(self, topology: Topology) -> bool:
            i += 1
            if i % PROGRESS_BAR_UPDATE_INTERVAL == 0:
                self._refresh_progress_bars(topology)
+            op.update_resource_usage()


let's move this call inside ResourceManager.update_usages() here
Because the updated info will be used in that function

python/ray/data/tests/test_actor_pool_map_operator.py

Signed-off-by: Srinath Krishnamachari <[email protected]>

This reverts commit ad9070d. Signed-off-by: Srinath Krishnamachari <[email protected]>

Signed-off-by: Srinath Krishnamachari <[email protected]>

…or.py Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

…perator.py Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Signed-off-by: Srinath Krishnamachari <[email protected]>

…perator.py Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch 5 times, most recently from 8e4578a to 766db7d Compare October 10, 2024 03:42

raulchen reviewed Oct 10, 2024

View reviewed changes

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch 2 times, most recently from 6c0de82 to 0c6ee88 Compare October 11, 2024 19:46

srinathk10 marked this pull request as ready for review October 11, 2024 20:24

srinathk10 requested review from scottjlee, bveeramani, stephanie-wang and omatthew98 as code owners October 11, 2024 20:24

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch 4 times, most recently from 766ba8c to 3969db2 Compare October 15, 2024 19:48

raulchen reviewed Oct 15, 2024

View reviewed changes

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch from 4c6b6c6 to 1f6474f Compare October 16, 2024 04:48

bveeramani reviewed Oct 16, 2024

View reviewed changes

raulchen reviewed Oct 16, 2024

View reviewed changes

alexeykudinkin reviewed Oct 16, 2024

View reviewed changes

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Oct 16, 2024

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch from 1085e80 to caae0a5 Compare October 17, 2024 05:26

bveeramani approved these changes Oct 17, 2024

View reviewed changes

alexeykudinkin reviewed Oct 17, 2024

View reviewed changes

raulchen reviewed Oct 18, 2024

View reviewed changes

raulchen approved these changes Oct 18, 2024

View reviewed changes

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch from 59602a9 to 86954bf Compare October 18, 2024 20:28

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch from 86954bf to a2aac45 Compare October 18, 2024 21:02

srinathk10 and others added 19 commits October 18, 2024 14:05

Mark restarting actors are pending actors

78798f7

Signed-off-by: Srinath Krishnamachari <[email protected]>

Manage restarting state for Actors in Actor pool

b30ae0a

Signed-off-by: Srinath Krishnamachari <[email protected]>

Manage restarting state for Actors during _task_done_callback

b4233af

Signed-off-by: Srinath Krishnamachari <[email protected]>

Addressed review comments

a6a25b1

Signed-off-by: Srinath Krishnamachari <[email protected]>

Handle restarting Actors in Actor pool map in return_actor

629f594

Signed-off-by: Srinath Krishnamachari <[email protected]>

Revert "Handle restarting Actors in Actor pool map in return_actor"

375a0d7

This reverts commit ad9070d. Signed-off-by: Srinath Krishnamachari <[email protected]>

Addressed review comments

100ea65

Signed-off-by: Srinath Krishnamachari <[email protected]>

Update python/ray/data/_internal/execution/interfaces/physical_operat…

64b43fa

…or.py Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Update python/ray/data/_internal/execution/operators/actor_pool_map_o…

71e84ca

…perator.py Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Update python/ray/data/tests/test_actor_pool_map_operator.py

cebb8bd

Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Update python/ray/data/tests/test_actor_pool_map_operator.py

8cc1331

Co-authored-by: Hao Chen <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Addressed review comments

95dbc1f

Signed-off-by: Srinath Krishnamachari <[email protected]>

Update python/ray/data/_internal/execution/operators/actor_pool_map_o…

fbda461

…perator.py Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Update python/ray/data/_internal/execution/operators/actor_pool_map_o…

7b24add

…perator.py Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Update python/ray/data/_internal/execution/operators/actor_pool_map_o…

67ec158

…perator.py Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Addressed review comments

81c0308

Signed-off-by: Srinath Krishnamachari <[email protected]>

Addressed review comments

1e26cc2

Signed-off-by: Srinath Krishnamachari <[email protected]>

Addressed review comments

a5b2d55

Signed-off-by: Srinath Krishnamachari <[email protected]>

Addressed review comments

1b506b5

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch 2 times, most recently from 5e3fb9a to 217ca5b Compare October 18, 2024 23:03

Addressed review comments

b168682

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 force-pushed the srinathk10-restart-actor-mark-pending branch from 217ca5b to b168682 Compare October 19, 2024 01:06

Merge branch 'master' into srinathk10-restart-actor-mark-pending

aefb9cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark restarting actors are pending actors #47946

Mark restarting actors are pending actors #47946

srinathk10 commented Oct 8, 2024 •

edited

Loading

raulchen Oct 10, 2024

raulchen Oct 10, 2024

srinathk10 Oct 11, 2024

raulchen Oct 10, 2024

raulchen Oct 10, 2024

srinathk10 Oct 11, 2024

raulchen Oct 10, 2024

raulchen Oct 10, 2024

raulchen Oct 10, 2024

raulchen Oct 10, 2024

srinathk10 Oct 11, 2024

raulchen Oct 10, 2024

raulchen left a comment

raulchen Oct 15, 2024

raulchen Oct 15, 2024

raulchen Oct 15, 2024

bveeramani Oct 16, 2024 •

edited

Loading

srinathk10 Oct 16, 2024

bveeramani Oct 16, 2024

srinathk10 Oct 16, 2024

bveeramani Oct 16, 2024

bveeramani Oct 16, 2024

srinathk10 Oct 16, 2024

bveeramani Oct 17, 2024

bveeramani Oct 17, 2024

raulchen Oct 16, 2024

raulchen Oct 18, 2024

srinathk10 Oct 18, 2024 •

edited

Loading

raulchen Oct 18, 2024

raulchen left a comment

raulchen Oct 18, 2024

raulchen Oct 18, 2024

raulchen Oct 18, 2024

	self._actor_pool.running_to_restarting(actor, actor.get_location)
	if self._actor_pool.is_actor_running(actor):
	self._actor_pool.running_to_restarting(actor, actor.get_location.remote())

		from ray.tests.conftest import * # noqa


		def test_removed_nodes_and_added_back(ray_start_cluster):

		actor_state = actor._get_local_state()
		if actor_state != gcs_pb2.ActorTableData.ActorState.ALIVE:

Mark restarting actors are pending actors #47946

Are you sure you want to change the base?

Mark restarting actors are pending actors #47946

Conversation

srinathk10 commented Oct 8, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srinathk10 Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srinathk10 commented Oct 8, 2024 •

edited

Loading

bveeramani Oct 16, 2024 •

edited

Loading

srinathk10 Oct 18, 2024 •

edited

Loading