Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scan option to build packages in subdirectories #139

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rkent
Copy link
Contributor

@rkent rkent commented Aug 7, 2024

Replaces #81

This PR implements a new top-level option, 'scan', which scans subdirectories, and builds any found packages. It uses the same options as build with a couple of additions: a per-package timeout, and a maximum number of packages.

I've been running variations of this for a few weeks now in a nightly Jenkins build of all ros2:iron packages, so the available output has been tweaked to focus on that environment.

It captures output from each package build in a file stdout.txt in the docs_build folder. Summary information is printed in the more typical console locations. Looking at the notes from #81, I believe this implements most of those requests, with the exception of the request to fail on first error. Plenty of existing packages have errors, so I don't really see the value of this if you are running a full distro. If you are only running a repository, it runs quickly enough that I don't see the value of that option either. I do include though a maximum number of packages options which is useful to limit time when testing. If you really want that feature, let's do that as a followup, it would not be very hard. But please state the use case.

On my system, if I set a -t 1 timeout then about a third of the test packages time out, which can give some idea of what error output looks like. So I can run rosdoc2 scan -p test -t 1 from the rosdoc2 directory as a simple test. Or you could do that on a repository with multiple packages instead.

I used colcon as requested, but I have some comments on that. First, the documentation on colcon that I could find is all from the perspective of using the resulting commands that are generated using colcon. I could find no hints of how to use it as a python library. Second, the code has multiple levels of abstraction, with key features being added through extensions. Examining the code for, say, colcon_core.package_discovery give no clue on what options are supported, as all of that is added in extensions. What extensions? I have no idea. Basically through trial and error and guesswork, I was able to discover that colcon-common-extensions give the functionality that I need, with --base-paths as the option I needed to set. Third, using colcon as a package discovery library requires 7 lines of obscure setup code. Yes this was partly written as a demo for me, but this is not at all the expectation one has when using library code. In summary, I would not recommend to anyone else that they attempt to use colcon as library code like this.

@rkent
Copy link
Contributor Author

rkent commented Aug 7, 2024

Here is the list of failed packages from a recent ros2:iron run of this with the default 15 minute timeout (this is a 10 year old i7 machine using a physical hard disk):

37 packages failed:
rviz_common: retval=1: RuntimeError Sphinx-build exited with return code '2'
azure_iot_sdks: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/azure-iot-sdk-c/github-Azure-azure-iot-sdk-c/azure-iot-sdk-c'
ecl_geometry: retval=1: RuntimeError Sphinx-build exited with return code '2'
rotconv: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/rot_conv_lib/github-AIS-Bonn-rot_conv_lib/rot_conv_lib'
proxsuite: retval=2: TimeoutError runtime 900.351 seconds
gazebo_plugins: retval=1: RuntimeError Sphinx-build exited with return code '2'
magic_enum: retval=1: RuntimeError Sphinx-build exited with return code '2'
avt_vimba_camera: retval=1: RuntimeError Sphinx-build exited with return code '1'
fastcdr: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/fastcdr/github-eProsima-Fast-CDR/fastcdr'
octomap-distribution: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/octomap/github-octomap-octomap/octomap'
mrpt2: retval=1: RuntimeError Sphinx-build exited with return code '2'
libcaer: retval=2: TimeoutError runtime 900.348 seconds
mrpt_path_planning: retval=1: RuntimeError Sphinx-build exited with return code '2'
vitis_common: retval=2: TimeoutError runtime 900.349 seconds
sm_multi_stage_1: retval=2: TimeoutError runtime 900.349 seconds
sm_pack_ml: retval=2: TimeoutError runtime 900.349 seconds
VRPN: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/vrpn/github-vrpn-vrpn/vrpn'
VISP: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/visp/github-lagadic-visp/visp'
urdfdom_headers: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/urdfdom_headers/github-ros-urdfdom_headers/urdfdom_headers'
nav2_costmap_2d: retval=1: RuntimeError Sphinx-build exited with return code '2'
ecl_containers: retval=1: RuntimeError Sphinx-build exited with return code '2'
sick_scan_xd: retval=2: TimeoutError runtime 900.347 seconds
robot_localization: retval=1: RuntimeError Sphinx-build exited with return code '2'
naoqi_driver: retval=1: RuntimeError Sphinx-build exited with return code '2'
ecl_manipulators: retval=1: RuntimeError Sphinx-build exited with return code '2'
pinocchio: retval=2: TimeoutError runtime 900.351 seconds
ecl_converters: retval=1: RuntimeError Sphinx-build exited with return code '2'
fastrtps: retval=2: TimeoutError runtime 900.357 seconds
geodesy: retval=1: RuntimeError Sphinx-build exited with return code '2'
ign_rviz: retval=1: RuntimeError Sphinx-build exited with return code '2'
cv_bridge: retval=1: RuntimeError Sphinx-build exited with return code '2'
camera_calibration: retval=1: RuntimeError Sphinx-build exited with return code '2'
transmission_interface: retval=1: RuntimeError Sphinx-build exited with return code '1'
ur_client_library: retval=1: RuntimeError Sphinx-build exited with return code '2'
IFM3D: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/ifm3d_core/github-ifm-ifm3d/ifm3d_core'
ign_rviz_plugins: retval=1: RuntimeError Sphinx-build exited with return code '2'
rmf_visualization_schedule: retval=1: RuntimeError Sphinx-build exited with return code '2'

I can look at the logs on docs_build, for example, for rviz_common to see that the error was:

Exception occurred:
  File "/rosdoc2/venv/lib/python3.10/site-packages/breathe/renderer/sphinxrenderer.py", line 334, in create
    cls, name = DomainDirectiveFactory.cpp_classes[args[0]]  # type: ignore
KeyError: 'property'

During the active run, the console (Jenkins) output looks like this excerpt:

21:53:08 Begin processing geometry2
21:53:13 Package joint_state_publisher_gui succeeded in 10.981 seconds
[rosdoc2_scan] [INFO] joint_state_publisher_gui successful (1412/1440)
21:53:13 Begin processing system_modes_msgs
21:53:17 Package geometry2 succeeded in 8.858 seconds
[rosdoc2_scan] [INFO] geometry2 successful (1413/1440)
21:53:17 Begin processing mola_test_datasets
21:53:31 Package system_modes_msgs succeeded in 17.519 seconds
[rosdoc2_scan] [INFO] system_modes_msgs successful (1414/1440)
21:53:31 Begin processing ament_cmake_test
21:53:32 Package mola_test_datasets succeeded in 14.247 seconds
[rosdoc2_scan] [INFO] mola_test_datasets successful (1415/1440)

@rkent
Copy link
Contributor Author

rkent commented Aug 7, 2024

A few notes on the code:

  • I had to abandon running sphinx-build directly as a python module rather than in a subprocess, because exhale when it dies does an abrupt system exit without cleanup, which tends to hang the process pool since I cannot catch it.
  • I replaced the pipx tests with a venv test. There does not seem to be any interest in promoting pipx in the ros2 community, while venv is important for development, and the install requirements can be different if not starting from a ros2 enabled system.

@rkent
Copy link
Contributor Author

rkent commented Aug 20, 2024

I've done additional work on the whole area of locating packages to document in the last couple of weeks. I am now firmly convinced that colcon is the wrong tool for this use in rosdoc2, and I have replaced it with catkin's find_packages_allowing_duplicates in my fork. I'll eventually update this PR with that change, but for now please hold this PR.

Here's some background. colcon seems to be designed to find packages for building. Fine, but that is not what we are doing. Specifically (in addition to the complications mentioned above) it additionally scans CMakeLists.txt to try to find what it thinks is a package. Fine for building, not good for us since we need a real package.xml to process. So colcon is finding packages that we can't actually process, so it just ends up causing errors (like the urdfdom_headers: retval=1: RuntimeError Failed to find a ROS package at given path 'rosindex/_plugins_data/checkout/urdfdom_headers/github-ros-urdfdom_headers/urdfdom_headers' above).

I've fixed that in my fork, but that fix (which would partially be at the buildfarm level) needs more tweaking. In any case, you have to replace colcon with catkin to eliminate nuisance errors.

@rkent
Copy link
Contributor Author

rkent commented Aug 23, 2024

OK this is the updated version, reverting to catkin from colcon. Reasons are above.

Yes seems like a big PR, but a lot of this is just rearranging the tests so I can reuse code.

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall adding the new verb is a good way to isolate the search capability. And it's calling through to the underlying build main which is good.

I have a few smaller cleanups that would be great to resolve. I think some are development/debugging legacy that can probably be just cleaned out.

@@ -243,7 +244,8 @@ def build(self, *, doc_build_folder, output_staging_directory):
logger.info(
f"Running Doxygen: '{' '.join(cmd)}' in '{working_directory}'"
)
completed_process = subprocess.run(cmd, cwd=working_directory)
completed_process = subprocess.run(
cmd, cwd=working_directory, stdout=sys.stdout, stderr=sys.stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this doing the default behavior with more complexity. The default is that subprocess.run doesn't capture the output which means it goes to the console.

>>> import sys
>>> subprocess.run('echo hi', shell=True, stdout=sys.stdout, stderr=sys.stderr)
hi
CompletedProcess(args='echo hi', returncode=0)
>>> subprocess.run('echo hi', shell=True)
hi
CompletedProcess(args='echo hi', returncode=0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I did this is so that I can remap sys.stdout and sys.stderr to a file, and capture the output of individual package runs. IIRC from my tests, without this I could not remap to capture the output.

I was definitely following my "Programming is an Experimental Science"™ philosophy here though, trying things until something worked. But I'll look this over again, and confirm that I really need to do this. It seems strange to me as well that this is needed, but it seems to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do find that this helps with that extra bit of debuggability, it's fine to keep it. It would be great if you could add a comment to explain it so others can also find this benefit too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this again, and I can confirm that leaving completed_process = subprocess.run(cmd, cwd=working_directory) without adding , stdout=sys.stdout, stderr=sys.stderr does not work. The outputs go to the console, which is not what we want, we want to capture to a file.

I tried looking at pytest to see how they do it, but got lost in the levels of abstraction. Pytest has a lot of code around capturing, but I am sure they deal with a lot of edge cases that we don't care about.

If I look at subprocess, the code says:

def run(*popenargs,
        input=None, capture_output=False, timeout=None, check=False, **kwargs):
    """Run command with arguments and return a CompletedProcess instance.

    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
    or pass capture_output=True to capture both.

So in fact the default for syserr is not sys.stderr but None. We want to capture the output to our own PIPE, which we store in sys.stderr.

We would not have to store that pipe in sys.stderr in the subprocess, we could define a variable for it, but by doing so we also capture output generated by, for example, print(f"[rosdoc2] adding markdown parser", file=sys.stderr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed that you're changing the sys.stdout/err in the per package scan code here https://github.com/ros-infrastructure/rosdoc2/pull/139/files#diff-4d67c84fa514f2538371532c7f2d3979b42da8852649e61ed9cac143e83f1c55R169-R180

Which is capturing the output different than capturing it through the subprocess native mechanisms. Instead of changing sys.stderr at the global level, I think it would make sense to extend the main_impl to include the ability to pass an alternative output file. And if that output file is set, then pass that through to the subprocess invocation. This will avoid worries about race conditions on globally redirecting for potential other implementations of parallel threading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding and implementing what you want here.

First, I don't see any mechanism within Pool to control capture of stdout and stderr, unlike the basic subprocess.run(). Do you know how to do that?

Once I get into package_impl, output is going to the normal sys.stdout and sys.stderr, and we have to fix that (there is no subprocess.run involved with build_main_impl). I fixed if via sys.stdout = outfile. I don't see what else to do.

I was uncertain if that was the correct approach, but then I ran across contextlib.redirect_stdout in the standard library, and they do exactly what process_impl does: save the old value of sys.stdout, change sys.stdout to the new value, then restore it when they are done.

So I need more clarity about what you want here.

@@ -523,7 +523,8 @@ def build(self, *, doc_build_folder, output_staging_directory):
logger.info(
f"Running sphinx-apidoc: '{' '.join(cmd)}' in '{wrapped_sphinx_directory}'"
)
completed_process = subprocess.run(cmd, cwd=wrapped_sphinx_directory)
completed_process = subprocess.run(cmd, cwd=wrapped_sphinx_directory,
stdout=sys.stdout, stderr=sys.stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same behavior here looks the same as default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

msg = f"sphinx_build exited with return code '{returncode}'"
if returncode == 0:
completed_process = subprocess.run(cmd, cwd=wrapped_sphinx_directory,
stdout=sys.stdout, stderr=sys.stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncaptured here too

self.__dict__.update(entries)


def prepare_arguments(parser):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to expose BATCH_SIZE here too otherwise it's a magic number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed below, I don't think this is useful to the typical user, so I'll just remove batching instead.

logger_scan.info(f'Begin batch # {batch_no}')
for package in batch_packages:
logger_scan.info(f'Adding {package.name} for processing')
pool = mp.Pool(maxtasksperchild=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely allow the ability to control the pool size as an argument. Otherwise some machines could easily be overrun if their memory to CPU core counts are different than we usually use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wilco.

packages = list(packages.values())
for i in range(len(packages)):
batch_packages.append(packages.pop())
if len(batch_packages) >= BATCH_SIZE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a broader view, what is the value of the batching for running this? It basically forces the mp.Pool to be reset periodically. And with the current default it's not expected to ever be hit. As such might it make sense to reduce the complexity and not batch things before the pool. If someone wants to debug a build/scan they can change the argument to a specific subfolder which is where the problem is being encountered more intuitively than trying to tune the batch size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batching originally got added during development, when things were dying and it was really hard to figure out where and why. With small batch sizes, the problem could be isolated to a small number of possible packages to look for problems. As a debugging tool, it was easy enough to just change the value in code to enable batching. I don't really see the value except in debugging, hence it was not exposed to the user.

I have not had issues lately with it dying though, so I think I've caught all of the problems with other tools (adding a timeout, running Sphinx as a subprocess again, better logging of individual package outputs.) But I left it in since it was useful during initial debugging. But I don't feel strongly either, as --max-packages and --timeout mostly serve that need. I think I would rather remove it than expose it to the user though, as it really is not of any value except to the rosdoc2 developer. So unless you have another viewpoint, I'll remove the batching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think now that it's working well, it would be better to remove it and keep the much simpler flow of build and list and execute it in a pool.

test/utils.py Outdated
@@ -0,0 +1,187 @@
# Copyright 2024 R. Kent James <[email protected]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks like the content is primarily moved content and not newly written so it should retain the original copyright owner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@rkent
Copy link
Contributor Author

rkent commented Sep 4, 2024

The commit I just did addresses all review comments, AFAICT.

@rkent
Copy link
Contributor Author

rkent commented Sep 4, 2024

Re "If you do find that this helps with that extra bit of debuggability, it's fine to keep it"

If this comment is about the stdout=sys.stdout, stderr=sys.stderr then no, this is not about debugging, it is essential. As for a comment, if I found this strange I would look for the history, which will point here with a lot more detail than is likely in a comment. I can add a comment though if you like (or you could just add one when you land).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants