Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp` #728

sea-bass · 2023-10-25T14:47:05Z

Bug report

Required Info:

Operating System:
- Ubuntu 22.04 (inside a Docker container)
Installation type:
- Binaries
Version or commit hash:
- The latest available binaries on Humble
DDS implementation:
- rmw_fastrtps_cpp
Client library (if applicable):
- rclcpp

Steps to reproduce issue

We unfortunately don't have a full reproduction case, but we are creating a node with a multi-threaded executor as follows:

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

The segfault we see is:

[objective_server_node_main-8] [ERROR] [1684327668.068245959] [objective_server_node]: Execution of behavior tree for objective `3 Waypoints Pick and Place` did not succeed.
[objective_server_node_main-8] Stack trace (most recent call last) in thread 833:
[objective_server_node_main-8] #8    Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[objective_server_node_main-8] #7    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e5a5d1b, in 
[objective_server_node_main-8] #6    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e53d5c7, in 
[objective_server_node_main-8] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0xffff9e7731fb, in 
[objective_server_node_main-8] [1684327680.555]: Sequence                  IDLE -> RUNNING
[objective_server_node_main-8] [1684327680.555]: ClearSnapshot             IDLE -> RUNNING
[objective_server_node_main-8] #4    Source "./src/rclcpp/executors/multi_threaded_executor.cpp", line 85, in run [0xffff9edbffc7]
[objective_server_node_main-8] #3    Source "./src/rclcpp/executor.cpp", line 912, in get_next_executable [0xffff9edb8b5b]
[objective_server_node_main-8] #2    Source "./src/rclcpp/executor.cpp", line 756, in wait_for_work [0xffff9edb5bdf]
[objective_server_node_main-8] #1    Object "/opt/ros/humble/lib/librcl.so", at 0xffff9e185b97, in rcl_wait
[objective_server_node_main-8] #0    Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0xffff9d5f5718, in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*)
[objective_server_node_main-8] Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
[ERROR] [objective_server_node_main-8]: process has died [pid 666, exit code -11, cmd '/opt/overlay_ws/install/moveit_studio_agent/lib/moveit_studio_agent/objective_server_node_main --ros-args --log-level error --ros-args -r __node:=objective_server_node --params-file /tmp/launch_params_j4cwbvx_ --params-file /tmp/launch_params_7oyj5lbr --params-file /tmp/launch_params_2aqbg8_q --params-file /tmp/launch_params_12q1oav_ --params-file /tmp/launch_params_h4pxzy2y --params-file /tmp/launch_params_8vn2ch2o --params-file /tmp/launch_params_1c4b41ac --params-file /tmp/launch_params_blie9quv'].

One of our engineers looked into this with a debugger and saw:

Thread 25 "objective_serve" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffffa535e660 (LWP 74792)]
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {
(gdb) bt
#0 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (
identifier=, subscriptions=,
guard_conditions=0xaaaaab4e0b30, services=,
clients=, events=, wait_set=,
wait_timeout=)
at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
#1 0x0000fffff6fc5b98 in rcl_wait (wait_set=0xffffffffa110, timeout=250000000)
at ./src/rcl/wait.c:595
#2 0x0000fffff7bf5be0 in rclcpp::Executor::wait_for_work (
this=0xffffffffa060, timeout=...) at /usr/include/c++/11/chrono:521
#3 0x0000fffff7bf8b5c in rclcpp::Executor::get_next_executable (
this=0xffffffffa060, any_executable=..., timeout=...)
at ./src/rclcpp/executor.cpp:912
#4 0x0000fffff7bfffc8 in rclcpp::executors::MultiThreadedExecutor::run (
this=0xffffffffa060, this_thread_number=)
at ./src/rclcpp/executors/multi_threaded_executor.cpp:85
#5 0x0000fffff75b31fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#6 0x0000fffff737d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#7 0x0000fffff73e5d1c in ?? () from /lib/aarch64-linux-gnu/libc.so.6`

Additional information

This segfault does not occur using Cyclone DDS.

Implementation considerations

The text was updated successfully, but these errors were encountered:

fujitatomoya · 2023-10-25T16:12:31Z

Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {

rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp

Line 220 in 247e491

if (!condition->get_trigger_value()) {

static_cast<eprosima::fastdds::dds::GuardCondition *>(data) is NULL, that led to the segmentation fault.

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

fujitatomoya · 2023-10-25T16:13:03Z

CC: @Barry-Xu-2018 @iuhilnehc-ynos

sea-bass · 2023-10-25T16:23:33Z

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

Unfortunately the code is part of a private code base and we haven't been able to come up with a minimal repro case. But our main is fairly simple:

int main(int argc, char* argv[])
{
  rclcpp::init(argc, argv);
  rclcpp::NodeOptions opt;
  opt.automatically_declare_parameters_from_overrides(true);
  auto node = std::make_shared<OurNodeType>(opt);

  // This's a workaround a bug in MultiThreadedExecutor deadlock when spinning without a timeout
  rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
  exec.add_node(node->get_node_base_interface());
  exec.spin();
  exec.remove_node(node->get_node_base_interface());

  rclcpp::shutdown();
}

The node itself has an action server in it, and we notice the segfault occurring -- not consistently, but frequently -- when one action is cancelled and we try start a new one.

Another thing to note is that these actions that execute can (and often do) create their own subscribers, publishers, service/action clients, etc. So maybe that also plays into the problem?

Barry-Xu-2018 · 2023-10-28T02:43:19Z

@fujitatomoya

According to current information, the problem is related to
https://github.com/ros2/rclcpp/blob/8709146df87ab860c88f03a83f6df7d0347d0085/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L494
Here, shared pointer wasn't used.
It is used in Executor::wait_for_work() without lock.

It can be removed under the below function

Executor::wait_for_work()
Executor::remove_callback_group_from_map()
called by Executor::remove_callback_group() or Executor::remove_node()
Executor::~Executor()

So, while GuardCondition is used in Executor::wait_for_work(), it is possible that it is removed by Executor::remove_callback_group_from_map().
I don't know why current implementation don't use shared pointer to save GuardCondition.
Besides, remove action should be in Executor::wait_for_work() and Executor::~Executor().

sea-bass · 2024-01-14T23:30:04Z

I'm just coming back to this issue and I think adding locks to the functions in AllocatorMemoryStrategy that touch the guard_conditions_ member variable seems promising. I might have time this week to take a look and land some fixes.

If this works, seems like this file hasn't been modified in a while so I'm hopeful we can backport the fix to humble.

sea-bass · 2024-01-15T04:06:51Z

Update: I tried putting locks in both allocator_memory_strategy.cpp and executor.cpp and the chain of FastDDS issues keeps elevating up the chain with more segfaults.

I think the problem is more pervasive than just stuff at the rclcpp layer, sadly.

[planning_context_manager_node-2] terminate called after throwing an instance of 'std::out_of_range'
[planning_context_manager_node-2]   what():  vector::_M_range_check: __n (which is 2674159792) >= this->size() (which is 142)
[planning_context_manager_node-2] Stack trace (most recent call last) in thread 10468:
[move_group-1] double free or corruption (out)
[move_group-1] Stack trace (most recent call last) in thread 10409:
[waypoint_manager_node-4] corrupted size vs. prev_size
[planning_context_manager_node-2] #22   Object "", at 0xffffffffffffffff, in 
[planning_context_manager_node-2] #21   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537bdd84f, in 
[planning_context_manager_node-2] #20   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4bac2, in 
[planning_context_manager_node-2] #19   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537ddc252, in 
[component_container_mt-8] double free or corruption (out)
[planning_context_manager_node-2] #18   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/signal_handler.cpp", line 271, in deferred_signal_handler [0x7fc53868087a]
[planning_context_manager_node-2]         268:             "deferred_signal_handler(): "
[planning_context_manager_node-2]         269:             "shutting down rclcpp::Context @ %p, because it had shutdown_on_signal == true",
[planning_context_manager_node-2]         270:             static_cast<void *>(context_ptr.get()));
[planning_context_manager_node-2]       > 271:           context_ptr->shutdown("signal handler");
[planning_context_manager_node-2]         272:         }
[planning_context_manager_node-2]         273:       }
[planning_context_manager_node-2]         274:     }
[planning_context_manager_node-2] #17   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/context.cpp", line 343, in shutdown [0x7fc5385b3ae9]
[planning_context_manager_node-2]         340:     std::lock_guard<std::recursive_mutex> guard(*logging_mutex_);
[planning_context_manager_node-2]         341:     size_t & count = get_logging_reference_count();
[planning_context_manager_node-2]         342:     if (0u == --count) {
[planning_context_manager_node-2]       > 343:       rcl_ret_t rcl_ret = rcl_logging_fini();
[planning_context_manager_node-2]         344:       if (RCL_RET_OK != rcl_ret) {
[planning_context_manager_node-2]         345:         RCUTILS_SAFE_FWRITE_TO_STDERR(
[planning_context_manager_node-2]         346:           RCUTILS_STRINGIFY(__file__) ":"
[planning_context_manager_node-2] #16   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f4a1b6, in rcl_logging_fini
[planning_context_manager_node-2] #15   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49dcc, in 
[planning_context_manager_node-2] #14   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49521, in 
[planning_context_manager_node-2] #13   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220f13b, in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher(char const*, rmw_node_s const*, rmw_publisher_s*)
[planning_context_manager_node-2] #12   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220eccb, in rmw_fastrtps_shared_cpp::destroy_publisher(char const*, CustomParticipantInfo*, rmw_publisher_s*)
[planning_context_manager_node-2] #11   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521ce29fe, in eprosima::fastdds::dds::PublisherImpl::delete_datawriter(eprosima::fastdds::dds::DataWriter const*)
[planning_context_manager_node-2] #10   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521cc9298, in eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl()
[planning_context_manager_node-2] #9    Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521b5de42, in 
[planning_context_manager_node-2] #8    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf72dc, in _Unwind_Resume
[planning_context_manager_node-2] #7    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf6883, in 
[planning_context_manager_node-2] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad958, in __gxx_personality_v0
[planning_context_manager_node-2] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad1e8, in 
[planning_context_manager_node-2] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dae20b, in 
[planning_context_manager_node-2] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537da2b9d, in 
[planning_context_manager_node-2] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537adf7f2, in abort
[planning_context_manager_node-2] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537af9475, in raise
[planning_context_manager_node-2] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4d9fc, in pthread_kill
[planning_context_manager_node-2] Aborted (Signal sent by tkill() 10370 1000)

roncapat · 2024-03-20T16:08:24Z

Hi, I'm hitting I think the same bug.

Here's the backtrace:

#5  0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6  0x00007ffff6c7c753 in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate (this=0x7fffec8fa6e0, __n=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:103
#7  std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464
#8  std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate (__n=<optimized out>, this=<optimized out>)
    at /usr/include/c++/11/bits/stl_vector.h:346
#9  std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*> (this=this@entry=0x7fffec8fa6e0, 
    __position=0x0) at /usr/include/c++/11/bits/vector.tcc:440
#10 0x00007ffff6c7c256 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*> (this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/vector.tcc:121
#11 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back (__x=@0x7fffec8fa700: 0x7fffed1286c0, this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/stl_vector.h:1204
#12 rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=0x7ffff6945e08, guard_conditions=<optimized out>, services=0x7ffff6945e50, clients=0x7ffff6945e38, 
    events=0x7ffff6945e68, wait_set=0x7ffff69ef6a0, wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:161
#13 0x00007ffff6ce1ffa in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, 
    wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#14 0x00007ffff79701a8 in rcl_wait (wait_set=0x7ffff04d0cc0, timeout=-1) at /home/user/ros2_iron/src/ros2/rcl/rcl/src/rcl/wait.c:595
#15 0x00007ffff7789a0b in rclcpp::Executor::wait_for_work (this=0x7ffff04d0c90, timeout=...) at /usr/include/c++/11/chrono:521
#16 0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff04d0c90, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#17 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff04d0c90, this_thread_number=<optimized out>)
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#18 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Any updates since January? I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

Update: as per last @Barry-Xu-2018 post in this thread, I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

ros2/rmw_fastrtps#728 Signed-off-by: Tomoya Fujita <[email protected]>

fujitatomoya · 2024-03-27T23:15:01Z

@roncapat

I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

can you provide the reproducible example? i cannot reproduce the issue with fujitatomoya/ros2_test_prover@db4a5ab

i think we need add and remove nodes to&from the executor to make it happen?

I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

thanks for checking, once it is confirmed, i guess we can move this issue to rclcpp.

I don't know why current implementation don't use shared pointer to save GuardCondition.

me neither, this is introduced on ros2/rclcpp#1612. (in particular https://github.com/ros2/rclcpp/compare/b9bec69377dd3850c6db519a7af80a2eb1a9be31..2fc5501a09a17d286f9a68fa1e7bff4f8703cf1e)
before addressing this, i want to confirm the problem 1st with reproducible environment if possible.

roncapat · 2024-03-28T04:11:31Z

@fujitatomoya see linked rclcpp issue: ros2/rclcpp#2455

It all started from there for me. There is attached a minimum example.

fujitatomoya · 2024-04-30T02:56:07Z

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.

CC: @mjcarroll

sea-bass · 2024-04-30T10:36:44Z

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.

CC: @mjcarroll

Thank you for the update, and for the actual fix! This was a big PR.

@EzraBrooks @MikeWrock Putting this on your radar if you guys are at some point updating MoveIt Pro to 24.04 at some point and would like to try FastDDS.

fujitatomoya added the bug Something isn't working label Oct 25, 2023

audrow added the backlog label Jan 25, 2024

roncapat mentioned this issue Mar 20, 2024

Timer in MultiThreadedExecutor crashes ros2/rclcpp#2455

Open

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Mar 27, 2024

Segfault in MultiThreadedExecutor using rmw_fastrtps_cpp

db4a5ab

ros2/rmw_fastrtps#728 Signed-off-by: Tomoya Fujita <[email protected]>

maxime-clem mentioned this issue Aug 2, 2024

Crash during planning if goal is set for the second time autowarefoundation/autoware#4138

Closed

3 tasks

TSNoble mentioned this issue Oct 25, 2024

Segmentation fault when adding a short-lived node to an executor #777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp` #728

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp` #728

sea-bass commented Oct 25, 2023 •

edited

Loading

fujitatomoya commented Oct 25, 2023

fujitatomoya commented Oct 25, 2023

sea-bass commented Oct 25, 2023 •

edited

Loading

Barry-Xu-2018 commented Oct 28, 2023

sea-bass commented Jan 14, 2024

sea-bass commented Jan 15, 2024 •

edited

Loading

roncapat commented Mar 20, 2024 •

edited

Loading

fujitatomoya commented Mar 27, 2024

roncapat commented Mar 28, 2024 •

edited

Loading

fujitatomoya commented Apr 30, 2024

sea-bass commented Apr 30, 2024

Segfault in MultiThreadedExecutor using rmw_fastrtps_cpp #728

Segfault in MultiThreadedExecutor using rmw_fastrtps_cpp #728

Comments

sea-bass commented Oct 25, 2023 • edited Loading

Bug report

Steps to reproduce issue

Additional information

Implementation considerations

fujitatomoya commented Oct 25, 2023

fujitatomoya commented Oct 25, 2023

sea-bass commented Oct 25, 2023 • edited Loading

Barry-Xu-2018 commented Oct 28, 2023

sea-bass commented Jan 14, 2024

sea-bass commented Jan 15, 2024 • edited Loading

roncapat commented Mar 20, 2024 • edited Loading

fujitatomoya commented Mar 27, 2024

roncapat commented Mar 28, 2024 • edited Loading

fujitatomoya commented Apr 30, 2024

sea-bass commented Apr 30, 2024

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp` #728

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp` #728

sea-bass commented Oct 25, 2023 •

edited

Loading

sea-bass commented Oct 25, 2023 •

edited

Loading

sea-bass commented Jan 15, 2024 •

edited

Loading

roncapat commented Mar 20, 2024 •

edited

Loading

roncapat commented Mar 28, 2024 •

edited

Loading