Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize participant on first use. Destroy participant after last node is destroyed. #176

Merged
merged 7 commits into from
Apr 30, 2020

Conversation

ivanpauno
Copy link
Member

@ivanpauno ivanpauno commented Apr 30, 2020

The same approach was used in rmw_fastrtps.
Fixes #174.

…ode is destroyed

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno ivanpauno requested a review from eboasson April 30, 2020 14:45
@ivanpauno ivanpauno self-assigned this Apr 30, 2020
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@eboasson
Copy link
Collaborator

Wow, that's super quick!

Of course I had to try, and I get:

6: libc++abi.dylib: terminating with uncaught exception of type rclcpp::exceptions::RCLError: failed to create interrupt guard condition: failed to create guardcondition, at /Users/erik/ros2_ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2618, at /Users/erik/ros2_ws/src/ros2/rcl/rcl/src/rcl/guard_condition.c:90

which is

There is a running process, kill it and restart?: [Y/n] 
Process 37486 exited with status = 9 (0x00000009) 
Process 37490 launched: '/Users/erik/ros2_ws/build/test_communication/test_publisher_subscriber_cpp' (x86_64)
libddsc.0.5.0.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 37490 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001019ea88e libddsc.0.5.0.dylib`dds_create_guardcondition(owner=0) at dds_guardcond.c:42:13 [opt]
   39  	     init here is cheap.  If it is DDS_CYCLONEDDS_HANDLE, we may have to initialise the library, so
   40  	     have to call it.  If it is some bogus value and the library is not initialised yet ... so be
   41  	     it.  Naturally, this requires us to call delete on DDS_CYCLONEDDS_HANDLE afterward. */
-> 42  	  if ((rc = dds_init ()) < 0)
   43  	    return rc;
   44  	
   45  	  if ((rc = dds_entity_lock (owner, DDS_KIND_DONTCARE, &e)) != DDS_RETCODE_OK)
Target 0: (test_publisher_subscriber_cpp) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x00000001019ea88e libddsc.0.5.0.dylib`dds_create_guardcondition(owner=0) at dds_guardcond.c:42:13 [opt]
    frame #1: 0x000000010190d77a librmw_cyclonedds_cpp.dylib`::rmw_create_guard_condition(rmw_context_t *) [inlined] create_guard_condition(impl=0x0000000100f04d30) at rmw_node.cpp:2617:29 [opt]
    frame #2: 0x000000010190d75b librmw_cyclonedds_cpp.dylib`::rmw_create_guard_condition(context=<unavailable>) at rmw_node.cpp:2633 [opt]
    frame #3: 0x0000000100b1f215 librcl.dylib`__rcl_guard_condition_init_from_rmw_impl(guard_condition=0x0000000100f05160, rmw_guard_condition=0x0000000000000000, context=0x0000000100f04940, options=rcl_guard_condition_options_t @ 0x00007ffeefbf7e50) at guard_condition.c:86:41 [opt]
    frame #4: 0x000000010071d54a librclcpp.dylib`rclcpp::node_interfaces::NodeBase::NodeBase(this=0x0000000100f050c0, node_name="test_publisher_subscriber_Arrays", namespace_="", context=<unavailable>, rcl_node_options=0x0000000100f05180, use_intra_process_default=<unavailable>, enable_topic_statistics_default=<unavailable>) at node_base.cpp:49:19 [opt]

I may be able to help you here: the DDS spec says waitsets and guard conditions are things that you can create and delete at any time and that have no relationship with anything else. Cyclone DDS takes a slightly different tack and considers them regular entities that have a parent.

Because the participant used to be created immediately in the context initialisation, it could use the participant as the owner of all guard conditions, which also means they are guaranteed to be cleaned up when you delete the participant entity. Now the participant is created later and rclcpp tries to create a guard condition before the first node/the participant is created.

In Cyclone you don't have to use a participant as the parent (once upon a time, yes, but that's not been the case for quite some time). Above the participant sits a domain entity (created at the same time as the first node), and above that sits the Cyclone library itself, which is a singleton object with constant, pre-defined handle: DDS_CYCLONEDDS_HANDLE (this is already used for waitsets: see rmw_create_wait_set).

So if you use DDS_CYCLONEDDS_HANDLE as the owner when calling dds_create_guardcondition, this problem should go away. There are two things to keep in mind here:

  • That means it is no longer deleted automatically by the call to dds_delete on the participant handle. I believe all guard conditions used are RMW ones, and those are all cleaned up properly anyway, so it is probably not an issue. Otherwise there'll be a memory leak if you create and delete a node in a loop.
  • The waitset will only accept entities in the same scope — if the waitset is owned by a participant, it will only accept entities in that participant; and if it is owned by a domain entity, then only in that domain. Here the RMW waitsets should be ok because they have DDS_CYCLONEDDS_HANDLE as parent, but it will probably cause trouble with the guard condition in the waitset used in the discovery_thread. Changing the owner of that waitset should help.

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno
Copy link
Member Author

@eboasson I was going to ask how to handle that situation, thanks for answering it beforehand.

That means it is no longer deleted automatically by the call to dds_delete on the participant handle. I believe all guard conditions used are RMW ones, and those are all cleaned up properly anyway, so it is probably not an issue. Otherwise there'll be a memory leak if you create and delete a node in a loop.

Yes, all create_guard_condition calls seem to have a matching destroy_guard_condition call.

The waitset will only accept entities in the same scope — if the waitset is owned by a participant, it will only accept entities in that participant; and if it is owned by a domain entity, then only in that domain. Here the RMW waitsets should be ok because they have DDS_CYCLONEDDS_HANDLE as parent, but it will probably cause trouble with the guard condition in the waitset used in the discovery_thread. Changing the owner of that waitset should help.

Considering that wait sets can be created in rmw without passing a context or a node, I think this approach is more adequate.

@ivanpauno
Copy link
Member Author

ivanpauno commented Apr 30, 2020

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status

@ivanpauno ivanpauno marked this pull request as ready for review April 30, 2020 16:16
Copy link
Collaborator

@eboasson eboasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivanpauno! It basically looks good to me, except for the handling of failure during init and what looks like a use-after-free in rmw_destroy_node. I think both have trivial fixes.

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved
rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno
Copy link
Member Author

@eboasson PTAL, I will re-run CI after approval.

@ivanpauno ivanpauno requested a review from eboasson April 30, 2020 17:18
Copy link
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review; I'm not too familiar with this code base. Hope it's useful

rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved
rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved
rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved
rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno ivanpauno requested a review from sloretz April 30, 2020 17:38
Copy link
Collaborator

@eboasson eboasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ivanpauno
Copy link
Member Author

ivanpauno commented Apr 30, 2020

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@ivanpauno
Copy link
Member Author

While investigating another issue, I run in a problem similar to the one commented here #176 (comment).
Solved in a8e5131.

New CI, this time limited to test_communication, test_rclcpp, rmw_cyclonedds_cpp to get results faster:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@ivanpauno
Copy link
Member Author

@eboasson @sloretz PTAL

@ivanpauno
Copy link
Member Author

Ok, this solves most of the failures, but we've this new one https://ci.ros2.org/job/ci_windows/10483/testReport/junit/test_rclcpp/test_executor__rmw_cyclonedds_cpp/notify/ across all implementations.

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno
Copy link
Member Author

3d807fe 🤦‍♂️.
I think that one was the last bizarre error.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@sloretz
Copy link
Contributor

sloretz commented Apr 30, 2020

CI summary: Looks good to merge to me

  • Linux: no new failures
  • Linux aarch64: green
  • osx: Possibly one new failure test_rclcpp.test_parameter_server_cpp__rmw_cyclonedds_cpp.xunit.missing_result
  • windows: no new failures

While there appears to be a new failure of test_parameter_server with cylconedds, there is a longstanding issue ros2/system_tests#420 for the same thing with Fast-RTPS. The console output shows this new failure happening in the same place as Fast-RTPS, so I'm inclined to think it's a test issue rather than caused by this PR

64: Test that the second executable terminates after a finite amount of time. ... [INFO] [test_parameters_server_cpp-1]: process started with pid [19605]
64: [INFO] [test_remote_parameters_cpp-2]: process started with pid [19606]
64: [test_remote_parameters_cpp-2] Running main() from /Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
64: [test_remote_parameters_cpp-2] [==========] Running 6 tests from 2 test cases.
64: [test_remote_parameters_cpp-2] [----------] Global test environment set-up.
64: [test_remote_parameters_cpp-2] [----------] 3 tests from parameters__rmw_implementation
64: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_async
64: [test_remote_parameters_cpp-2] Setting parameters
64: [test_remote_parameters_cpp-2] Got set_parameters result
64: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
64: [test_remote_parameters_cpp-2] Listing parameters with depth 1
64: [test_remote_parameters_cpp-2] Listing parameters with depth 2
64: [test_remote_parameters_cpp-2] Getting parameters
64: [test_remote_parameters_cpp-2] Getting nonexistent parameters
64: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
64: [test_remote_parameters_cpp-2] [       OK ] parameters__rmw_implementation.test_remote_parameters_async (195 ms)
64: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_sync
64/77 Test #64: test_parameter_server_cpp__rmw_fastrtps_cpp .....................***Timeout  60.23 sec
40: [INFO] [test_remote_parameters_cpp-2]: process started with pid [19388]
40: [test_remote_parameters_cpp-2] Running main() from /Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
40: [test_remote_parameters_cpp-2] [==========] Running 6 tests from 2 test cases.
40: [test_remote_parameters_cpp-2] [----------] Global test environment set-up.
40: [test_remote_parameters_cpp-2] [----------] 3 tests from parameters__rmw_implementation
40: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_async
40: [test_parameters_server_cpp-1] [ERROR] [1588287331.386203070] [rclcpp]: executor taking a service server request from service '/test_parameters_server_allow_undeclared/list_parameters' unexpectedly failed: error not set
40: [test_parameters_server_cpp-1] [ERROR] [1588287331.390162050] [rclcpp]: executor taking a service server request from service '/test_parameters_server_allow_undeclared/set_parameters' unexpectedly failed: error not set
40: [test_parameters_server_cpp-1] [ERROR] [1588287331.396424426] [rclcpp]: executor taking a service server request from service '/test_parameters_server_allow_undeclared/get_parameters' unexpectedly failed: error not set
40: [test_remote_parameters_cpp-2] Setting parameters
40: [test_remote_parameters_cpp-2] Got set_parameters result
40: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
40: [test_remote_parameters_cpp-2] Listing parameters with depth 1
40: [test_remote_parameters_cpp-2] Listing parameters with depth 2
40: [test_remote_parameters_cpp-2] Getting parameters
40: [test_remote_parameters_cpp-2] Getting nonexistent parameters
40: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
40: [test_remote_parameters_cpp-2] [       OK ] parameters__rmw_implementation.test_remote_parameters_async (441 ms)
40: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_sync
40/77 Test #40: test_parameter_server_cpp__rmw_cyclonedds_cpp ...................***Timeout  60.23 sec

@sloretz sloretz merged commit f820994 into master Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lots of tests failing on Windows
3 participants