-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scan option to build packages in subdirectories #139
base: main
Are you sure you want to change the base?
Conversation
Here is the list of failed packages from a recent ros2:iron run of this with the default 15 minute timeout (this is a 10 year old i7 machine using a physical hard disk):
I can look at the logs on docs_build, for example, for rviz_common to see that the error was:
During the active run, the console (Jenkins) output looks like this excerpt:
|
A few notes on the code:
|
I've done additional work on the whole area of locating packages to document in the last couple of weeks. I am now firmly convinced that colcon is the wrong tool for this use in rosdoc2, and I have replaced it with catkin's Here's some background. colcon seems to be designed to find packages for building. Fine, but that is not what we are doing. Specifically (in addition to the complications mentioned above) it additionally scans CMakeLists.txt to try to find what it thinks is a package. Fine for building, not good for us since we need a real package.xml to process. So colcon is finding packages that we can't actually process, so it just ends up causing errors (like the I've fixed that in my fork, but that fix (which would partially be at the buildfarm level) needs more tweaking. In any case, you have to replace colcon with catkin to eliminate nuisance errors. |
OK this is the updated version, reverting to catkin from colcon. Reasons are above. Yes seems like a big PR, but a lot of this is just rearranging the tests so I can reuse code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall adding the new verb is a good way to isolate the search capability. And it's calling through to the underlying build main which is good.
I have a few smaller cleanups that would be great to resolve. I think some are development/debugging legacy that can probably be just cleaned out.
@@ -243,7 +244,8 @@ def build(self, *, doc_build_folder, output_staging_directory): | |||
logger.info( | |||
f"Running Doxygen: '{' '.join(cmd)}' in '{working_directory}'" | |||
) | |||
completed_process = subprocess.run(cmd, cwd=working_directory) | |||
completed_process = subprocess.run( | |||
cmd, cwd=working_directory, stdout=sys.stdout, stderr=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this doing the default behavior with more complexity. The default is that subprocess.run doesn't capture the output which means it goes to the console.
>>> import sys
>>> subprocess.run('echo hi', shell=True, stdout=sys.stdout, stderr=sys.stderr)
hi
CompletedProcess(args='echo hi', returncode=0)
>>> subprocess.run('echo hi', shell=True)
hi
CompletedProcess(args='echo hi', returncode=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I did this is so that I can remap sys.stdout and sys.stderr to a file, and capture the output of individual package runs. IIRC from my tests, without this I could not remap to capture the output.
I was definitely following my "Programming is an Experimental Science"™ philosophy here though, trying things until something worked. But I'll look this over again, and confirm that I really need to do this. It seems strange to me as well that this is needed, but it seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do find that this helps with that extra bit of debuggability, it's fine to keep it. It would be great if you could add a comment to explain it so others can also find this benefit too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this again, and I can confirm that leaving completed_process = subprocess.run(cmd, cwd=working_directory)
without adding , stdout=sys.stdout, stderr=sys.stderr
does not work. The outputs go to the console, which is not what we want, we want to capture to a file.
I tried looking at pytest to see how they do it, but got lost in the levels of abstraction. Pytest has a lot of code around capturing, but I am sure they deal with a lot of edge cases that we don't care about.
If I look at subprocess, the code says:
def run(*popenargs,
input=None, capture_output=False, timeout=None, check=False, **kwargs):
"""Run command with arguments and return a CompletedProcess instance.
The returned instance will have attributes args, returncode, stdout and
stderr. By default, stdout and stderr are not captured, and those attributes
will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
or pass capture_output=True to capture both.
So in fact the default for syserr is not sys.stderr
but None. We want to capture the output to our own PIPE, which we store in sys.stderr
.
We would not have to store that pipe in sys.stderr
in the subprocess, we could define a variable for it, but by doing so we also capture output generated by, for example, print(f"[rosdoc2] adding markdown parser", file=sys.stderr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed that you're changing the sys.stdout/err in the per package scan code here https://github.com/ros-infrastructure/rosdoc2/pull/139/files#diff-4d67c84fa514f2538371532c7f2d3979b42da8852649e61ed9cac143e83f1c55R169-R180
Which is capturing the output different than capturing it through the subprocess native mechanisms. Instead of changing sys.stderr at the global level, I think it would make sense to extend the main_impl to include the ability to pass an alternative output file. And if that output file is set, then pass that through to the subprocess invocation. This will avoid worries about race conditions on globally redirecting for potential other implementations of parallel threading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time understanding and implementing what you want here.
First, I don't see any mechanism within Pool to control capture of stdout and stderr, unlike the basic subprocess.run(). Do you know how to do that?
Once I get into package_impl, output is going to the normal sys.stdout and sys.stderr, and we have to fix that (there is no subprocess.run involved with build_main_impl). I fixed if via sys.stdout = outfile
. I don't see what else to do.
I was uncertain if that was the correct approach, but then I ran across contextlib.redirect_stdout in the standard library, and they do exactly what process_impl does: save the old value of sys.stdout, change sys.stdout to the new value, then restore it when they are done.
So I need more clarity about what you want here.
@@ -523,7 +523,8 @@ def build(self, *, doc_build_folder, output_staging_directory): | |||
logger.info( | |||
f"Running sphinx-apidoc: '{' '.join(cmd)}' in '{wrapped_sphinx_directory}'" | |||
) | |||
completed_process = subprocess.run(cmd, cwd=wrapped_sphinx_directory) | |||
completed_process = subprocess.run(cmd, cwd=wrapped_sphinx_directory, | |||
stdout=sys.stdout, stderr=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same behavior here looks the same as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
msg = f"sphinx_build exited with return code '{returncode}'" | ||
if returncode == 0: | ||
completed_process = subprocess.run(cmd, cwd=wrapped_sphinx_directory, | ||
stdout=sys.stdout, stderr=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncaptured here too
self.__dict__.update(entries) | ||
|
||
|
||
def prepare_arguments(parser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to expose BATCH_SIZE here too otherwise it's a magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed below, I don't think this is useful to the typical user, so I'll just remove batching instead.
rosdoc2/verbs/scan/impl.py
Outdated
logger_scan.info(f'Begin batch # {batch_no}') | ||
for package in batch_packages: | ||
logger_scan.info(f'Adding {package.name} for processing') | ||
pool = mp.Pool(maxtasksperchild=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely allow the ability to control the pool size as an argument. Otherwise some machines could easily be overrun if their memory to CPU core counts are different than we usually use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wilco.
rosdoc2/verbs/scan/impl.py
Outdated
packages = list(packages.values()) | ||
for i in range(len(packages)): | ||
batch_packages.append(packages.pop()) | ||
if len(batch_packages) >= BATCH_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a broader view, what is the value of the batching for running this? It basically forces the mp.Pool to be reset periodically. And with the current default it's not expected to ever be hit. As such might it make sense to reduce the complexity and not batch things before the pool. If someone wants to debug a build/scan they can change the argument to a specific subfolder which is where the problem is being encountered more intuitively than trying to tune the batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batching originally got added during development, when things were dying and it was really hard to figure out where and why. With small batch sizes, the problem could be isolated to a small number of possible packages to look for problems. As a debugging tool, it was easy enough to just change the value in code to enable batching. I don't really see the value except in debugging, hence it was not exposed to the user.
I have not had issues lately with it dying though, so I think I've caught all of the problems with other tools (adding a timeout, running Sphinx as a subprocess again, better logging of individual package outputs.) But I left it in since it was useful during initial debugging. But I don't feel strongly either, as --max-packages and --timeout mostly serve that need. I think I would rather remove it than expose it to the user though, as it really is not of any value except to the rosdoc2 developer. So unless you have another viewpoint, I'll remove the batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think now that it's working well, it would be better to remove it and keep the much simpler flow of build and list and execute it in a pool.
test/utils.py
Outdated
@@ -0,0 +1,187 @@ | |||
# Copyright 2024 R. Kent James <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file looks like the content is primarily moved content and not newly written so it should retain the original copyright owner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
The commit I just did addresses all review comments, AFAICT. |
Re "If you do find that this helps with that extra bit of debuggability, it's fine to keep it" If this comment is about the |
Replaces #81
This PR implements a new top-level option, 'scan', which scans subdirectories, and builds any found packages. It uses the same options as
build
with a couple of additions: a per-package timeout, and a maximum number of packages.I've been running variations of this for a few weeks now in a nightly Jenkins build of all ros2:iron packages, so the available output has been tweaked to focus on that environment.
It captures output from each package build in a file stdout.txt in the docs_build folder. Summary information is printed in the more typical console locations. Looking at the notes from #81, I believe this implements most of those requests, with the exception of the request to fail on first error. Plenty of existing packages have errors, so I don't really see the value of this if you are running a full distro. If you are only running a repository, it runs quickly enough that I don't see the value of that option either. I do include though a maximum number of packages options which is useful to limit time when testing. If you really want that feature, let's do that as a followup, it would not be very hard. But please state the use case.
On my system, if I set a
-t 1
timeout then about a third of the test packages time out, which can give some idea of what error output looks like. So I can runrosdoc2 scan -p test -t 1
from the rosdoc2 directory as a simple test. Or you could do that on a repository with multiple packages instead.I used colcon as requested, but I have some comments on that. First, the documentation on colcon that I could find is all from the perspective of using the resulting commands that are generated using colcon. I could find no hints of how to use it as a python library. Second, the code has multiple levels of abstraction, with key features being added through extensions. Examining the code for, say, colcon_core.package_discovery give no clue on what options are supported, as all of that is added in extensions. What extensions? I have no idea. Basically through trial and error and guesswork, I was able to discover that colcon-common-extensions give the functionality that I need, with
--base-paths
as the option I needed to set. Third, using colcon as a package discovery library requires 7 lines of obscure setup code. Yes this was partly written as a demo for me, but this is not at all the expectation one has when using library code. In summary, I would not recommend to anyone else that they attempt to use colcon as library code like this.