-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes Action.*_async futures never complete #1308
base: rolling
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would add @aditya2592 as co-author, since this borrows the code from #1125.
Done |
Throwing this back to draft because it still doesn't fix all conditions where this breaks. If others have ideas on how to more reliably reproduce this that would be appreciated. |
What kinds of failures are you seeing still? This is the "fix" I've been using in production, and at least with our use cases of services we haven't seen failures since: class PatchRclpyIssue1123(RobustActionClient):
_lock: RLock = None # type: ignore
@property
def _cpp_client_handle_lock(self) -> RLock:
if self._lock is None:
self._lock = RLock()
return self._lock
async def execute(self, *args: Any, **kwargs: Any) -> None:
# This is ugly- holding on to a lock in an async environment feels gross
with self._cpp_client_handle_lock:
return await super().execute(*args, **kwargs) # type: ignore
def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super().send_goal_async(*args, **kwargs)
def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super()._cancel_goal_async(*args, **kwargs)
def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super()._get_result_async(*args, **kwargs) I'm not sure if these matter, but here are the differences I can note between this above fix and this PR:
|
I was defining a python action client server pair in two seperate terminals, and the custom server also contained a python client to another cpp nav2 action. After calling the custom python action server once it would work correctly the first time, but after wasn't even accepting the goal request to any subsequent calls. If I removed the nav2 action client from my custom action server it wouldn't deadlock. I suspect a lock wasn't correctly being released with the call to another action client inside the action server. I'd like to look more into try to see if the issue persists with either locking on calls to entire execute funciton as you've suggested or using Rlocks to see if the behavior persists. |
Ahh understood. I would hazard a guess that an RLock would work in this situation. |
I tried to recreate an example outside of my environment, but have been unsuccessful |
I was able to recreate this issue outside my environment now #1313, and looks like there is an issue with rclcpp server, and a nested rclpy action client nested inside an rcply action server, so that seems unrelated, so I'll pull this back out of draft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with green CI
@apockill if you are willing to do 2nd review on this, that would be really appreciated. |
Who ever wants to test the patch on iron. I cherry picked it on iron in this fork: Btw. from our tests today it seems to work fine on iron with our setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it would fix the issue. My only commentary is that this feels brittle, and future changes might break it easily.
Even now I have a hard time knowing which dicts need strict thread safety. For example, _remove_pending_request
, _remove_pending_goal_request
, _remove_pending_result_request
all use dictionaries that are holding state that can be edited in other callbacks.
I wonder if a future refactor could make this harder to mess up. Off the cuff, we could wrap up all the state that is ties the C++ self._client
and it's python stateful dictionaries into a single object, and protect that object. I'm just brainstorming here, but something a la:
class _ProtectedClientState:
"""Protect state between the C++ client and python that must be thread-safe"""
def __init__(self):
self.lock = RLock()
# key: UUID in bytes, value: weak reference to ClientGoalHandle
self.goal_handles = {}
# key: goal request sequence_number, value: Future for goal response
self.pending_goal_requests = {}
# key: goal request sequence_number, value: UUID
self.goal_sequence_number_to_goal_id = {}
# key: cancel request sequence number, value: Future for cancel response
self.pending_cancel_requests = {}
# key: result request sequence number, value: Future for result response
self.pending_result_requests = {}
# key: result request sequence_number, value: UUID
self.result_sequence_number_to_goal_id = {}
# key: UUID in bytes, value: callback function
self.feedback_callbacks = {}
def __getattribute__(...):
if not self.lock.acquired():
raise HeyThatsBad("This state is meant to be protected while using these!"):
return the requested attribute
Just bumping this, to see if other maintainers can review/merge this |
@jmblixt3 can you rebase this on rolling? and i will start the CI. |
@fujitatomoya Rebased |
Anything else needed from me? |
@sloretz @clalancette @adityapande-1995 can either of you take a look at this? |
I'm experiencing the same issue on ROS 2 Humble. Could you backport this fix to humble? Thank you! |
+1 |
I am having also this issue with Humble. Is there a timeframe on when the fix will be merged? Thanks! |
Per rclpy:1123 If two seperate client server actions are running in seperate executors the future given to the ActionClient will never complete due to a race condition This fixes the calls to rcl handles potentially leading to deadlock scenarios by adding locks to there references Co-authored-by: Aditya Agarwal <[email protected]> Co-authored-by: Jonathan Blixt <[email protected]> Signed-off-by: Jonathan Blixt <[email protected]>
Just rebased this again, are there any issues with that still need to be resolved |
@jmblixt3 no, we are just waiting for the 2nd review. @clalancette @ahcorde @sloretz friendly ping. |
Replaces #1125 since original contributor went inactive
Closes #1123
If two separate client server actions are running in separate executors the future given to the ActionClient will never complete due to a race condition on the rcl handles
Tested using this from @apockill which was adapted to match rolling then test with and without locks
To reproduce initial issue remove the lock/sleep in the RaceyAction client
Adapted example Code here
client.py
server.py
Before and after log results
Before
client output
server output
After
client output
I also initially tried to add locks to just the action client, but I was getting test failures in test_action_graph on the get_names_and_types function, so I added for action server as well.