Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly failure on aarch64 rcl.TestTwoExecutablesAfterShutdown.test_services (from rcl.test_services__rmw_cyclonedds_cpp_Release.launch_tests) #611

Closed
tfoote opened this issue Apr 9, 2020 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@tfoote
Copy link
Contributor

tfoote commented Apr 9, 2020

This has failed two nightlys in a row

https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_release/1126/testReport/rcl/TestTwoExecutablesAfterShutdown/test_services/

I've been unable to reproduce this on an aws aarch64 bionic machine but have been able to reproduce it in CI on focal

https://ci.ros2.org/job/ci_linux-aarch64/5630

Using this gist: https://gist.github.com/tfoote/dc6c33491124e863f554531b87b0492e

And reducing the build and test to rcl.

13:08:22 30: Test command: /home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3 "-u" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml" "--package-name" "rcl" "--output-file" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/launch_test/CHANGEME.txt" "--env" "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" "--append-env" "LD_LIBRARY_PATH=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl" "--command" "/home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3" "-m" "launch_testing.launch_test" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py" "--junit-xml=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml" "--package-name=rcl"
13:08:22 30: Test timeout computed to be: 15
13:08:22 30: -- run_test.py: extra environment variables:
13:08:22 30:  - RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
13:08:22 30: -- run_test.py: extra environment variables to append:
13:08:22 30:  - LD_LIBRARY_PATH+=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl
13:08:22 30: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test':
13:08:22 30:  - /home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3 -m launch_testing.launch_test /home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py --junit-xml=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml --package-name=rcl
13:08:23 30: [INFO] [launch]: All log files can be found below /home/rosbuild/.ros/log/2020-04-09-20-08-23-239587-e422d2c021a8-32656
13:08:23 30: [INFO] [launch]: Default logging verbosity is set to INFO
13:08:23 30: test_services (rcl.TestTwoExecutables)
13:08:23 30: Test that the executable under test terminates after a finite amount of time. ... [INFO] [service_fixture-1]: process started with pid [32659]
13:08:23 30: [INFO] [client_fixture-2]: process started with pid [32661]
13:08:33 30: [INFO] [client_fixture-2]: sending signal 'SIGINT' to process[client_fixture-2]
13:08:33 30: [INFO] [service_fixture-1]: sending signal 'SIGINT' to process[service_fixture-1]
13:08:33 30: [ERROR] [client_fixture-2]: process has died [pid 32661, exit code -2, cmd '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/client_fixture__rmw_cyclonedds_cpp service_fixture'].
13:08:33 30: [ERROR] [service_fixture-1]: process has died [pid 32659, exit code -2, cmd '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/service_fixture__rmw_cyclonedds_cpp'].
13:08:33 30: FAIL
13:08:33 30: 
13:08:33 30: ======================================================================
13:08:33 30: FAIL: test_services (rcl.TestTwoExecutables)
13:08:33 30: Test that the executable under test terminates after a finite amount of time.
13:08:33 30: ----------------------------------------------------------------------
13:08:33 30: Traceback (most recent call last):
13:08:33 30:   File "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py", line 42, in test_services
13:08:33 30:     proc_info.assertWaitForShutdown(process=executable_under_test, timeout=10)
13:08:33 30:   File "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/install/launch_testing/lib/python3.8/site-packages/launch_testing/proc_info_handler.py", line 144, in assertWaitForShutdown
13:08:33 30:     assert success, "Timed out waiting for process '{}' to finish".format(process)
13:08:33 30: AssertionError: Timed out waiting for process '<launch.actions.execute_process.ExecuteProcess object at 0xffffb39921f0>' to finish
13:08:33 30: 
13:08:33 30: ----------------------------------------------------------------------
13:08:33 30: Ran 1 test in 10.001s
13:08:33 30: 
13:08:33 30: FAILED (failures=1)

And I've validated that it also fails on Bionic on CI: https://ci.ros2.org/job/ci_linux-aarch64/5636/console

``` 14:37:13 30: Test command: /home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3 "-u" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml" "--package-name" "rcl" "--output-file" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/launch_test/CHANGEME.txt" "--env" "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" "--append-env" "LD_LIBRARY_PATH=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl" "--command" "/home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3" "-m" "launch_testing.launch_test" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py" "--junit-xml=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml" "--package-name=rcl" 14:37:13 30: Test timeout computed to be: 15 14:37:13 30: -- run_test.py: extra environment variables: 14:37:13 30: - RMW_IMPLEMENTATION=rmw_cyclonedds_cpp 14:37:13 30: -- run_test.py: extra environment variables to append: 14:37:13 30: - LD_LIBRARY_PATH+=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl 14:37:13 30: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test': 14:37:13 30: - /home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3 -m launch_testing.launch_test /home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py --junit-xml=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml --package-name=rcl 14:37:14 30: [INFO] [launch]: All log files can be found below /home/rosbuild/.ros/log/2020-04-09-21-37-14-680777-837c7344267a-7399 14:37:14 30: [INFO] [launch]: Default logging verbosity is set to INFO 14:37:14 30: test_services (rcl.TestTwoExecutables) 14:37:14 30: Test that the executable under test terminates after a finite amount of time. ... [INFO] [service_fixture-1]: process started with pid [7403] 14:37:14 30: [INFO] [client_fixture-2]: process started with pid [7404] 14:37:24 30: [INFO] [client_fixture-2]: sending signal 'SIGINT' to process[client_fixture-2] 14:37:24 30: [INFO] [service_fixture-1]: sending signal 'SIGINT' to process[service_fixture-1] 14:37:24 30: [ERROR] [service_fixture-1]: process has died [pid 7403, exit code -2, cmd '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/service_fixture__rmw_cyclonedds_cpp']. 14:37:24 30: [ERROR] [client_fixture-2]: process has died [pid 7404, exit code -2, cmd '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/client_fixture__rmw_cyclonedds_cpp service_fixture']. 14:37:24 30: FAIL 14:37:24 30: 14:37:24 30: ====================================================================== 14:37:24 30: FAIL: test_services (rcl.TestTwoExecutables) 14:37:24 30: Test that the executable under test terminates after a finite amount of time. 14:37:24 30: ---------------------------------------------------------------------- 14:37:24 30: Traceback (most recent call last): 14:37:24 30: File "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py", line 42, in test_services 14:37:24 30: proc_info.assertWaitForShutdown(process=executable_under_test, timeout=10) 14:37:24 30: File "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/install/launch_testing/lib/python3.6/site-packages/launch_testing/proc_info_handler.py", line 144, in assertWaitForShutdown 14:37:24 30: assert success, "Timed out waiting for process '{}' to finish".format(process) 14:37:24 30: AssertionError: Timed out waiting for process '' to finish 14:37:24 30: 14:37:24 30: ---------------------------------------------------------------------- 14:37:24 30: Ran 1 test in 10.002s 14:37:24 30: 14:37:24 30: FAILED (failures=1) 14:37:24 30: test_services (rcl.TestTwoExecutablesAfterShutdown) 14:37:24 30: Test that the executable under test finished cleanly. ... FAIL ```
@tfoote
Copy link
Contributor Author

tfoote commented Apr 9, 2020

I now have found a repeatable failure, but at the one in 20 level.

With docker installed there's extra console output that does not make it into our builds console logs nor the test output. However this also appears in the passing tests.

[service_fixture-1] 1586469829.637656 [0] service_fi: using network interface ens5 (udp/172.30.0.181) selected arbitrarily from: ens5, docker0
[client_fixture-2] 1586469829.641924 [0] client_fix: using network interface ens5 (udp/172.30.0.181) selected arbitrarily from: ens5, docker0
Run 22
-- run_test.py: extra environment variables:
 - RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
-- run_test.py: extra environment variables to append:
 - LD_LIBRARY_PATH+=/home/ubuntu/ros2/build/rcl
-- run_test.py: invoking following command in '/home/ubuntu/ros2':
 - python3 -m launch_testing.launch_test build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py --junit-xml=build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml --package-name=rcl
[INFO] [launch]: All log files can be found below /home/ubuntu/.ros/log/2020-04-09-22-03-49-602357-ip-172-30-0-181-17686
[INFO] [launch]: Default logging verbosity is set to INFO
test_services (rcl.TestTwoExecutables)
Test that the executable under test terminates after a finite amount of time. ... [INFO] [service_fixture-1]: process started with pid [17711]
[INFO] [client_fixture-2]: process started with pid [17712]
[service_fixture-1] 1586469829.637656 [0] service_fi: using network interface ens5 (udp/172.30.0.181) selected arbitrarily from: ens5, docker0
[client_fixture-2] 1586469829.641924 [0] client_fix: using network interface ens5 (udp/172.30.0.181) selected arbitrarily from: ens5, docker0
[INFO] [client_fixture-2]: sending signal 'SIGINT' to process[client_fixture-2]
[INFO] [service_fixture-1]: sending signal 'SIGINT' to process[service_fixture-1]
[ERROR] [client_fixture-2]: process has died [pid 17712, exit code -2, cmd '/home/ubuntu/ros2/build/rcl/test/client_fixture__rmw_cyclonedds_cpp service_fixture'].
[ERROR] [service_fixture-1]: process has died [pid 17711, exit code -2, cmd '/home/ubuntu/ros2/build/rcl/test/service_fixture__rmw_cyclonedds_cpp'].
FAIL

======================================================================
FAIL: test_services (rcl.TestTwoExecutables)
Test that the executable under test terminates after a finite amount of time.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/ros2/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py", line 42, in test_services
    proc_info.assertWaitForShutdown(process=executable_under_test, timeout=10)
  File "/home/ubuntu/ros2/install/launch_testing/lib/python3.6/site-packages/launch_testing/proc_info_handler.py", line 144, in assertWaitForShutdown
    assert success, "Timed out waiting for process '{}' to finish".format(process)
AssertionError: Timed out waiting for process '<launch.actions.execute_process.ExecuteProcess object at 0xffffb6705ba8>' to finish

----------------------------------------------------------------------
Ran 1 test in 10.001s

FAILED (failures=1)
test_services (rcl.TestTwoExecutablesAfterShutdown)
Test that the executable under test finished cleanly. ... FAIL

======================================================================
FAIL: test_services (rcl.TestTwoExecutablesAfterShutdown)
Test that the executable under test finished cleanly.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/ros2/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py", line 53, in test_services
    executable_under_test
  File "/home/ubuntu/ros2/install/launch_testing/lib/python3.6/site-packages/launch_testing/asserts/assert_exit_codes.py", line 64, in assertExitCodes
    info.returncode
AssertionError: Proc client_fixture-2 exited with code -2

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (failures=1)
-- run_test.py: return code 1
-- run_test.py: verify result file 'build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml'

Looping this script it failed on the 22nd run the first time the 56th time the second loop the 43rd iteration: 22, 56, 43, 38, 31

stress: info: [30154] dispatching hogs: 4 cpu, 4 io, 4 vm, 4 hdd
WIth high stress it was 32, 24, 16, 17, 254

Which shows moderately high correlation to stress but not always. We appear to be deailing with a race condition of some sort. Looking at the logs with timestamps it appears neither the client nor the server finish.

tfoote added a commit that referenced this issue Apr 9, 2020
Debugging #611 these timeouts were so long that the overall test timeout was being triggered before these internal timeouts were triggered.

1000 retries at 100ms -> 10 seconds each. I cut it down to 1 second for each to establish the connection.

Signed-off-by: Tully Foote <[email protected]>
@tfoote
Copy link
Contributor Author

tfoote commented Apr 9, 2020

With the stricter limits in #613 I can now focus the issue down to the client becoming ready:

[client_fixture-2] [ERROR] [1586472469.895886231] [rcl]: Client never became ready

Which means that the call to this function is timing out:

wait_for_client_to_be_ready(

It's reproducible in CI https://ci.ros2.org/job/ci_linux-aarch64/5645/consoleFull

00:22:52.638 30: Test command: /home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3 "-u" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml" "--package-name" "rcl" "--output-file" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/launch_test/CHANGEME.txt" "--env" "RMW_IMPLEMENTATION=rmw_cyclonedds_cpp" "--append-env" "LD_LIBRARY_PATH=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl" "--command" "/home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3" "-m" "launch_testing.launch_test" "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py" "--junit-xml=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml" "--package-name=rcl"
00:22:52.638 30: Test timeout computed to be: 15
00:22:52.907 30: -- run_test.py: extra environment variables:
00:22:52.908 30:  - RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
00:22:52.908 30: -- run_test.py: extra environment variables to append:
00:22:52.909 30:  - LD_LIBRARY_PATH+=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl
00:22:52.909 30: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test':
00:22:52.910 30:  - /home/jenkins-agent/workspace/ci_linux-aarch64/venv/bin/python3 -m launch_testing.launch_test /home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py --junit-xml=/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml --package-name=rcl
00:22:53.874 30: [INFO] [launch]: All log files can be found below /home/rosbuild/.ros/log/2020-04-09-23-28-52-110718-8a972606fd7e-9918
00:22:53.875 30: [INFO] [launch]: Default logging verbosity is set to INFO
00:22:53.879 30: test_services (rcl.TestTwoExecutables)
00:22:53.889 30: Test that the executable under test terminates after a finite amount of time. ... [INFO] [service_fixture-1]: process started with pid [9922]
00:22:53.890 30: [INFO] [client_fixture-2]: process started with pid [9923]
00:22:54.909 30: [client_fixture-2] [ERROR] [1586474933.166207415] [rcl]: Client never became ready
00:22:54.915 30: [ERROR] [client_fixture-2]: process has died [pid 9923, exit code 255, cmd '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/client_fixture__rmw_cyclonedds_cpp service_fixture'].
00:22:54.922 30: [INFO] [service_fixture-1]: sending signal 'SIGINT' to process[service_fixture-1]
00:22:54.924 30: [ERROR] [service_fixture-1]: process has died [pid 9922, exit code -2, cmd '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/service_fixture__rmw_cyclonedds_cpp'].
00:22:54.926 30: ok
00:22:54.926 30: 
00:22:54.927 30: ----------------------------------------------------------------------
00:22:54.927 30: Ran 1 test in 1.038s
00:22:54.928 30: 
00:22:54.928 30: OK
00:22:54.929 30: test_services (rcl.TestTwoExecutablesAfterShutdown)
00:22:54.930 30: Test that the executable under test finished cleanly. ... FAIL
00:22:54.930 30: 
00:22:54.931 30: ======================================================================
00:22:54.931 30: FAIL: test_services (rcl.TestTwoExecutablesAfterShutdown)
00:22:54.931 30: Test that the executable under test finished cleanly.
00:22:54.932 30: ----------------------------------------------------------------------
00:22:54.932 30: Traceback (most recent call last):
00:22:54.933 30:   File "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test/test/test_services__rmw_cyclonedds_cpp_Release.py", line 53, in test_services
00:22:54.934 30:     executable_under_test
00:22:54.934 30:   File "/home/jenkins-agent/workspace/ci_linux-aarch64/ws/install/launch_testing/lib/python3.6/site-packages/launch_testing/asserts/assert_exit_codes.py", line 64, in assertExitCodes
00:22:54.934 30:     info.returncode
00:22:54.935 30: AssertionError: Proc client_fixture-2 exited with code 255
00:22:54.935 30: 
00:22:54.936 30: ----------------------------------------------------------------------
00:22:54.937 30: Ran 1 test in 0.001s
00:22:54.937 30: 
00:22:54.937 30: FAILED (failures=1)
00:22:55.004 30: -- run_test.py: return code 1
00:22:55.005 30: -- run_test.py: verify result file '/home/jenkins-agent/workspace/ci_linux-aarch64/ws/build/rcl/test_results/rcl/test_services__rmw_cyclonedds_cpp.xunit.xml'
00:22:55.026 30/60 Test #30: test_services__rmw_cyclonedds_cpp ............***Failed    2.40 sec

@jacobperron
Copy link
Member

Possibly related: ros2/system_tests#420

I've seen issues in several places related to services, in particular clients talking to servers.

tfoote added a commit that referenced this issue Apr 14, 2020
Debugging #611 these timeouts were so long that the overall test timeout was being triggered before these internal timeouts were triggered.

1000 retries at 100ms -> 10 seconds each. I cut it down to 1 second for each to establish the connection.

Signed-off-by: Tully Foote <[email protected]>
tfoote added a commit that referenced this issue May 13, 2020
Debugging #611 these timeouts were so long that the overall test timeout was being triggered before these internal timeouts were triggered.

1000 retries at 100ms -> 10 seconds each. I cut it down to 1 second for each to establish the connection.

Signed-off-by: Tully Foote <[email protected]>
@ivanpauno ivanpauno added the bug Something isn't working label Oct 28, 2020
@jacobperron
Copy link
Member

It doesn't look like the test has failed since ros2/system_tests#420 was resolved. Closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants