Initialize participant on first use. Destroy participant after last node is destroyed. #176

ivanpauno · 2020-04-30T14:45:06Z

The same approach was used in rmw_fastrtps.
Fixes #174.

…ode is destroyed Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

eboasson · 2020-04-30T15:10:27Z

Wow, that's super quick!

Of course I had to try, and I get:

6: libc++abi.dylib: terminating with uncaught exception of type rclcpp::exceptions::RCLError: failed to create interrupt guard condition: failed to create guardcondition, at /Users/erik/ros2_ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2618, at /Users/erik/ros2_ws/src/ros2/rcl/rcl/src/rcl/guard_condition.c:90

which is

There is a running process, kill it and restart?: [Y/n] 
Process 37486 exited with status = 9 (0x00000009) 
Process 37490 launched: '/Users/erik/ros2_ws/build/test_communication/test_publisher_subscriber_cpp' (x86_64)
libddsc.0.5.0.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 37490 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001019ea88e libddsc.0.5.0.dylib`dds_create_guardcondition(owner=0) at dds_guardcond.c:42:13 [opt]
   39  	     init here is cheap.  If it is DDS_CYCLONEDDS_HANDLE, we may have to initialise the library, so
   40  	     have to call it.  If it is some bogus value and the library is not initialised yet ... so be
   41  	     it.  Naturally, this requires us to call delete on DDS_CYCLONEDDS_HANDLE afterward. */
-> 42  	  if ((rc = dds_init ()) < 0)
   43  	    return rc;
   44  	
   45  	  if ((rc = dds_entity_lock (owner, DDS_KIND_DONTCARE, &e)) != DDS_RETCODE_OK)
Target 0: (test_publisher_subscriber_cpp) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x00000001019ea88e libddsc.0.5.0.dylib`dds_create_guardcondition(owner=0) at dds_guardcond.c:42:13 [opt]
    frame #1: 0x000000010190d77a librmw_cyclonedds_cpp.dylib`::rmw_create_guard_condition(rmw_context_t *) [inlined] create_guard_condition(impl=0x0000000100f04d30) at rmw_node.cpp:2617:29 [opt]
    frame #2: 0x000000010190d75b librmw_cyclonedds_cpp.dylib`::rmw_create_guard_condition(context=<unavailable>) at rmw_node.cpp:2633 [opt]
    frame #3: 0x0000000100b1f215 librcl.dylib`__rcl_guard_condition_init_from_rmw_impl(guard_condition=0x0000000100f05160, rmw_guard_condition=0x0000000000000000, context=0x0000000100f04940, options=rcl_guard_condition_options_t @ 0x00007ffeefbf7e50) at guard_condition.c:86:41 [opt]
    frame #4: 0x000000010071d54a librclcpp.dylib`rclcpp::node_interfaces::NodeBase::NodeBase(this=0x0000000100f050c0, node_name="test_publisher_subscriber_Arrays", namespace_="", context=<unavailable>, rcl_node_options=0x0000000100f05180, use_intra_process_default=<unavailable>, enable_topic_statistics_default=<unavailable>) at node_base.cpp:49:19 [opt]

I may be able to help you here: the DDS spec says waitsets and guard conditions are things that you can create and delete at any time and that have no relationship with anything else. Cyclone DDS takes a slightly different tack and considers them regular entities that have a parent.

Because the participant used to be created immediately in the context initialisation, it could use the participant as the owner of all guard conditions, which also means they are guaranteed to be cleaned up when you delete the participant entity. Now the participant is created later and rclcpp tries to create a guard condition before the first node/the participant is created.

In Cyclone you don't have to use a participant as the parent (once upon a time, yes, but that's not been the case for quite some time). Above the participant sits a domain entity (created at the same time as the first node), and above that sits the Cyclone library itself, which is a singleton object with constant, pre-defined handle: DDS_CYCLONEDDS_HANDLE (this is already used for waitsets: see rmw_create_wait_set).

So if you use DDS_CYCLONEDDS_HANDLE as the owner when calling dds_create_guardcondition, this problem should go away. There are two things to keep in mind here:

That means it is no longer deleted automatically by the call to dds_delete on the participant handle. I believe all guard conditions used are RMW ones, and those are all cleaned up properly anyway, so it is probably not an issue. Otherwise there'll be a memory leak if you create and delete a node in a loop.
The waitset will only accept entities in the same scope — if the waitset is owned by a participant, it will only accept entities in that participant; and if it is owned by a domain entity, then only in that domain. Here the RMW waitsets should be ok because they have DDS_CYCLONEDDS_HANDLE as parent, but it will probably cause trouble with the guard condition in the waitset used in the discovery_thread. Changing the owner of that waitset should help.

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2020-04-30T16:13:42Z

@eboasson I was going to ask how to handle that situation, thanks for answering it beforehand.

That means it is no longer deleted automatically by the call to dds_delete on the participant handle. I believe all guard conditions used are RMW ones, and those are all cleaned up properly anyway, so it is probably not an issue. Otherwise there'll be a memory leak if you create and delete a node in a loop.

Yes, all create_guard_condition calls seem to have a matching destroy_guard_condition call.

The waitset will only accept entities in the same scope — if the waitset is owned by a participant, it will only accept entities in that participant; and if it is owned by a domain entity, then only in that domain. Here the RMW waitsets should be ok because they have DDS_CYCLONEDDS_HANDLE as parent, but it will probably cause trouble with the guard condition in the waitset used in the discovery_thread. Changing the owner of that waitset should help.

Considering that wait sets can be created in rmw without passing a context or a node, I think this approach is more adequate.

ivanpauno · 2020-04-30T16:16:25Z

Linux
Linux-aarch64
macOS

eboasson

Thanks @ivanpauno! It basically looks good to me, except for the handling of failure during init and what looks like a use-after-free in rmw_destroy_node. I think both have trivial fixes.

rmw_cyclonedds_cpp/src/rmw_node.cpp

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2020-04-30T17:18:14Z

@eboasson PTAL, I will re-run CI after approval.

sloretz

Partial review; I'm not too familiar with this code base. Hope it's useful

rmw_cyclonedds_cpp/src/rmw_node.cpp

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

eboasson

Looks good to me!

ivanpauno · 2020-04-30T17:48:15Z

Linux
Linux-aarch64
macOS
Windows

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2020-04-30T20:07:12Z

While investigating another issue, I run in a problem similar to the one commented here #176 (comment).
Solved in a8e5131.

New CI, this time limited to test_communication, test_rclcpp, rmw_cyclonedds_cpp to get results faster:

Linux
Linux-aarch64
macOS
Windows

ivanpauno · 2020-04-30T20:07:39Z

@eboasson @sloretz PTAL

ivanpauno · 2020-04-30T21:24:31Z

Ok, this solves most of the failures, but we've this new one https://ci.ros2.org/job/ci_windows/10483/testReport/junit/test_rclcpp/test_executor__rmw_cyclonedds_cpp/notify/ across all implementations.

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2020-04-30T21:54:32Z

3d807fe 🤦‍♂️.
I think that one was the last bizarre error.

Linux
Linux-aarch64
macOS
Windows

sloretz · 2020-04-30T23:14:43Z

CI summary: Looks good to merge to me

Linux: no new failures
Linux aarch64: green
osx: Possibly one new failure test_rclcpp.test_parameter_server_cpp__rmw_cyclonedds_cpp.xunit.missing_result
windows: no new failures

While there appears to be a new failure of test_parameter_server with cylconedds, there is a longstanding issue ros2/system_tests#420 for the same thing with Fast-RTPS. The console output shows this new failure happening in the same place as Fast-RTPS, so I'm inclined to think it's a test issue rather than caused by this PR

64: Test that the second executable terminates after a finite amount of time. ... [INFO] [test_parameters_server_cpp-1]: process started with pid [19605]
64: [INFO] [test_remote_parameters_cpp-2]: process started with pid [19606]
64: [test_remote_parameters_cpp-2] Running main() from /Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
64: [test_remote_parameters_cpp-2] [==========] Running 6 tests from 2 test cases.
64: [test_remote_parameters_cpp-2] [----------] Global test environment set-up.
64: [test_remote_parameters_cpp-2] [----------] 3 tests from parameters__rmw_implementation
64: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_async
64: [test_remote_parameters_cpp-2] Setting parameters
64: [test_remote_parameters_cpp-2] Got set_parameters result
64: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
64: [test_remote_parameters_cpp-2] Listing parameters with depth 1
64: [test_remote_parameters_cpp-2] Listing parameters with depth 2
64: [test_remote_parameters_cpp-2] Getting parameters
64: [test_remote_parameters_cpp-2] Getting nonexistent parameters
64: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
64: [test_remote_parameters_cpp-2] [       OK ] parameters__rmw_implementation.test_remote_parameters_async (195 ms)
64: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_sync
64/77 Test #64: test_parameter_server_cpp__rmw_fastrtps_cpp .....................***Timeout  60.23 sec

40: [INFO] [test_remote_parameters_cpp-2]: process started with pid [19388]
40: [test_remote_parameters_cpp-2] Running main() from /Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
40: [test_remote_parameters_cpp-2] [==========] Running 6 tests from 2 test cases.
40: [test_remote_parameters_cpp-2] [----------] Global test environment set-up.
40: [test_remote_parameters_cpp-2] [----------] 3 tests from parameters__rmw_implementation
40: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_async
40: [test_parameters_server_cpp-1] [ERROR] [1588287331.386203070] [rclcpp]: executor taking a service server request from service '/test_parameters_server_allow_undeclared/list_parameters' unexpectedly failed: error not set
40: [test_parameters_server_cpp-1] [ERROR] [1588287331.390162050] [rclcpp]: executor taking a service server request from service '/test_parameters_server_allow_undeclared/set_parameters' unexpectedly failed: error not set
40: [test_parameters_server_cpp-1] [ERROR] [1588287331.396424426] [rclcpp]: executor taking a service server request from service '/test_parameters_server_allow_undeclared/get_parameters' unexpectedly failed: error not set
40: [test_remote_parameters_cpp-2] Setting parameters
40: [test_remote_parameters_cpp-2] Got set_parameters result
40: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
40: [test_remote_parameters_cpp-2] Listing parameters with depth 1
40: [test_remote_parameters_cpp-2] Listing parameters with depth 2
40: [test_remote_parameters_cpp-2] Getting parameters
40: [test_remote_parameters_cpp-2] Getting nonexistent parameters
40: [test_remote_parameters_cpp-2] Listing parameters with recursive depth
40: [test_remote_parameters_cpp-2] [       OK ] parameters__rmw_implementation.test_remote_parameters_async (441 ms)
40: [test_remote_parameters_cpp-2] [ RUN      ] parameters__rmw_implementation.test_remote_parameters_sync
40/77 Test #40: test_parameter_server_cpp__rmw_cyclonedds_cpp ...................***Timeout  60.23 sec

Initialize participant on first use. Destroy participant after last n…

f3b1e9f

…ode is destroyed Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno requested a review from eboasson April 30, 2020 14:45

ivanpauno self-assigned this Apr 30, 2020

Please linters

af65e1a

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Solve problems with guard conditions

3b9bd80

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno marked this pull request as ready for review April 30, 2020 16:16

ivanpauno requested review from hidmic and jacobperron April 30, 2020 16:16

eboasson reviewed Apr 30, 2020

View reviewed changes

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved

rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved

Address peer review comments

34c9a43

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno requested a review from eboasson April 30, 2020 17:18

sloretz reviewed Apr 30, 2020

View reviewed changes

rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved

Address peer review comments

ded89f9

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno requested a review from sloretz April 30, 2020 17:38

eboasson approved these changes Apr 30, 2020

View reviewed changes

ivanpauno mentioned this pull request Apr 30, 2020

Lots of tests failing on Windows #174

Closed

Used DDS_CYCLONEDDS_HANDLE to create all guard conditions

a8e5131

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Increase ref count always

3d807fe

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

sloretz approved these changes Apr 30, 2020

View reviewed changes

sloretz merged commit f820994 into master Apr 30, 2020

delete-merged-branch bot deleted the ivanpauno/fix-windows-failures branch April 30, 2020 23:16

ivanpauno mentioned this pull request May 1, 2020

Use RCL_RET_SERVICE_TAKE_FAILED and not RCL_RET_CLIENT_TAKE_FAILED when checking a request take ros2/rclcpp#1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize participant on first use. Destroy participant after last node is destroyed. #176

Initialize participant on first use. Destroy participant after last node is destroyed. #176

ivanpauno commented Apr 30, 2020 •

edited

Loading

eboasson commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020 •

edited

Loading

eboasson left a comment

ivanpauno commented Apr 30, 2020

sloretz left a comment

eboasson left a comment

ivanpauno commented Apr 30, 2020 •

edited

Loading

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

sloretz commented Apr 30, 2020

Initialize participant on first use. Destroy participant after last node is destroyed. #176

Initialize participant on first use. Destroy participant after last node is destroyed. #176

Conversation

ivanpauno commented Apr 30, 2020 • edited Loading

eboasson commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020 • edited Loading

eboasson left a comment

Choose a reason for hiding this comment

ivanpauno commented Apr 30, 2020

sloretz left a comment

Choose a reason for hiding this comment

eboasson left a comment

Choose a reason for hiding this comment

ivanpauno commented Apr 30, 2020 • edited Loading

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

ivanpauno commented Apr 30, 2020

sloretz commented Apr 30, 2020

ivanpauno commented Apr 30, 2020 •

edited

Loading

ivanpauno commented Apr 30, 2020 •

edited

Loading

ivanpauno commented Apr 30, 2020 •

edited

Loading