-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fixed race condition in action server between is_ready and take #2250
Conversation
7efa763
to
91fb1dd
Compare
I'm still a bit puzzled about the is_ready() function. When will it be called in relation to the take_data function ? |
Looks like is_ready may be called multiple times before take_data is called.Therefore this patch needs a rewrite. |
1a7cf77
to
302771d
Compare
@clalancette can you do a review of this ? The exact description of the issue this is fixing can be found here #2242 (comment) |
@ros2/team this patch is hanging here since July, is there anything that I can do to speed this up ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I'm sorry that this took a long time to get to.
Second, thank you for the excellent explanation of the issue in #2242 , along with the reproducer. That helped me to understand the problem.
I've left a few questions inline, please take a look.
rclcpp_action/src/client.cpp
Outdated
ret = rcl_action_take_feedback( | ||
pimpl_->client_handle.get(), feedback_message.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure that this is the direction we want to go in. This change makes ClientBase::is_ready
no longer idempotent, and that seems like a huge change in semantics. Further, while the documentation for the is_ready
virtual method doesn't explicitly say so, it implies that this should be idempotent.
Can you explain further why you moved all of this here, rather than just adding the lock? We have to take the lock both here and in ClientBase::take_data
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is that I saw cases, were two threads would call is_ready, and then go on to take_data, but the second thread would trigger the exception, as there would only be one data item to take.
I am still puzzled which code path triggers this, as the multithreaded executor protects the sequence of both calls by holding a mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been working on reviewing this pr and I think I agree with @clalancette at the very least in that this changes the meaning/expectation for the is_ready
function.
I do think that your approach is probably on the right track (I'm also working on a test to narrow down the core issue, like you I'm a bit puzzled by the root cause), but I may propose a patch that changes how the code is laid out.
execute_goal_request_received( | ||
rcl_ret_t ret, rcl_action_goal_info_t goal_info, rmw_request_id_t request_header, | ||
std::shared_ptr<void> message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need these server changes to fix the crash on the client? Put another way, could we split these server fixes to another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory they are independent, and we could split them, would there be any advantage of doing this ?
Hi, I've run into #2242 on my test setup, as well. I have a (intentionally) not too capable PC running a robot driver plus MoveIt and have it execute random collision-free motions forever (or as long as possible). Basically, I want to try (end demonstrate) whether a ROS 2 based application can reliably work in a 24/7 application. We use a low-latency kernel and have our robot driver using FIFO-scheduling with a very high priority, so this could definitively be one of the reasons why we stumbled upon this. Without the fix from this PR the driver crashed after about 1-2 hours, with this fix applied (merged into current |
@clalancette @wjwwood I'm back from vacation, lets try to get this MR sorted next ;-) |
Is there any review going on? |
@tommy-erko fixed |
Putting a queue there could work, but would result in a weird mix of behaviors of the events. As far as I understand the code, (@wjwwood Correct me if this assumption is wrong !) for everything Therefor I would suggest, to change the behavior of the conditionals in this ways:
If we would decrease the count of an associated conditional every time take_data is called, this should solve all of our problems. |
@jmachowinski I just tested the latest patch version with our system and I got an error in the first run, instead of the usual "Taking data from action client...", this time it is the "Taking data from action server but nothing is ready" version. I see it happens if |
I messed up the error message in client.cpp:654 Interesting that you still get this error though, this means there is another way to trigger it, that I am not aware of yet. |
@tommy-erko I just did a test with the current version ontop of Iron, and it works without any issues for me. Can you try this version ? https://github.com/cellumation/rclcpp/tree/action_fix_final Note, this version of the patch will expose errors in waitables, and may introduce problems with interprocess communication. |
@Barry-Xu-2018 FYI I had a longer Q&A session with @wjwwood on discord. The outcome of it is, that by design it is the responsibility of a waitable, to trigger wakeups as long as it has data for processing. Therefore the patch from #2109 seems to be the wrong way to go, as it shifts the responsibility to the executioner. |
Thank you for sharing the results of the discussion.
While fixing this issue, I also had doubts about which module rclcpp or rmw should be responsible for addressing it. In RMW, the criteria for determining whether guard_conditions/events and subscriptions/clients/services have trigger condition are currently different. For subscriptions/clients/services, having data available in buffer is considered a trigger condition. But guard_conditions/events have no 'data'. For guard_conditions, trigger condition will be cleared before leaving __rmw_wait(). The issue is that currently, the RMW layer cannot know whether the trigger has been handled by the upper layers. If we fix this issue in the RMW layer, we would need to design a new mechanism that allows the upper layers to inform RMW that a guard condition has been handled and can be cleared. |
@Barry-Xu-2018 Is there another use case except the waitables that I am not aware of ? |
3912404
to
9f2ec0c
Compare
@clalancette @fujitatomoya @wjwwood I now pushed the version, that I would consider the correct fix for this issue. If you are not in favor of the revert of " Avoid losing waitable handles while using MultiThreadedExecut…" before a fix for the issue #2012 is in place, I would remove it from this MR and just merge the first commit. I'm going to be in vacation now and might not be responsive for the rest of the year. |
Apologies for the delayed response. |
@wjwwood @mjcarroll @fujitatomoya Ping, I would like to see this merged for jazzy |
Is there a reason for this PR not getting merged? If there is anything a user from the community can do to help getting this merged. Let me know! |
@firesurfer This is now in the pipeline for merge. The plan is to merge #2420 and afterwards this PR. If you want to help out, you could retry this patch as soon as 2420 was merged, to verify, that it is still working, and we got no regressions. |
9b0d6de
to
34ec478
Compare
@firesurfer this version should be ready to merge now. Can you run it on your system an verify it with the latest version of rolling ? |
As our stack is iron based and has never been built with rolling this might proove a bit tricky. I would be happy to test any backport of this PR as soon as it is available (I am not sure about ABI compatiblity between rolling and iron at the moment - if its compatible I could just recompile rclcpp from rolling) |
@firesurfer our stack is also iron based, but compiles without problems ontop of rolling. Note, if you still use gazebo11 (classic) it gets tricky. |
Uh,I'll try to setup my demonstrator with that version as well. Can't promise though. |
I have no idea what happened, but I tried pushing one commit to this branch and it reset |
…data and execute
Some background information: is_ready and take_data are guaranteed to be
called in sequence without interruption from another thread.
while execute is running, another thread may also call is_ready.
The problem was, that goal_request_ready_, cancel_request_ready_, result_request_ready_
and goal_expired_ were accessed and written from is_ready and execute.
This commit fixed this by only using the mentioned variables in is_ready and take_data.
execute now only accesses the given pointer and works on this.
Signed-off-by: Janosch Machowinski [email protected]
Note, this fixes #2242