Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip metadata refactoring #680

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

slimreaper35
Copy link
Collaborator

@slimreaper35 slimreaper35 commented Oct 9, 2024

My local approximate results:

(venv) ~/cachi2 (main) $ time tox -e py312

real    0m24.600s
user    0m23.449s
sys     0m1.419s

(venv) ~/cachi2 (pip-refactoring) $ time tox -e py312

real    0m13.625s
user    0m12.713s
sys     0m0.997s

Maintainers will complete the following section

  • Commit messages are descriptive enough
  • Code coverage from testing does not decrease and new code is covered
  • Docs updated (if applicable)
  • Docs links in the code are still valid (if docs were updated)

Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:

Copy link
Collaborator

@a-ovchinnikov a-ovchinnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor nitpicks.

cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved
cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved
@eskultety
Copy link
Member

There's too much going on in this single commit, so it's difficult to follow all the changes in the diff, please introduce them gradually.

Copy link
Member

@eskultety eskultety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favour of this work, needs some polishing though.

),
docs=PIP_METADATA_DOC,
)
return None, None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor isn't a 1:1 replacement - previously if a name was found it was carried over to other source checks. Here, if you extract name but can't find version your code will resort to returning (None,None) which means we'll infer the name of the project from the URL instead, that doesn't sound right.

cachi2/core/package_managers/pip.py Show resolved Hide resolved
cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved
Comment on lines 280 to 291
First, try to parse the setup.py script (if present) and extract name and version
from keyword arguments to the setuptools.setup() call. If either name or version
could not be resolved and there is a setup.cfg file, try to fill in the missing
values from metadata.name and metadata.version in the .cfg file.
repo_name = Path(repo_id.parsed_origin_url.path.removesuffix(".git")).name
subpath = package_dir.subpath_from_root
resolved_name = Path(repo_name).joinpath(subpath)
return canonicalize_name(str(resolved_name).replace("/", "-")).strip("-.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extraction should be a separate commit. Actually 2:

  • one to perform the extraction
  • one to rearrange the logic

nitpick: While at it, you can get rid of removesuffix and use stem from Path instead of name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to not rearrange the commits in this PR at all + don't waste too much time here. Each commit would require modification in unit tests, and that is not trivial based on the extremely complicated logic in there. I chose a much easier path, so, delete everything and write this from scratch.

I know, in GitHub the PR is basically unreviewable. Sorry for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, in GitHub the PR is basically unreviewable. Sorry for that.

First things first, GH has nothing to do with this, I review stuff locally with 50 lines of context usually and even with flags to git to detect moved code doesn't change the situation in any way.

I would like to not rearrange the commits in this PR at all + don't waste too much time here. Each commit would require modification in unit tests, and that is not trivial based on the extremely complicated logic in there. I chose a much easier path, so, delete everything and write this from scratch.

The only blocker for splitting here is the unit tests here, I agree. But given that this refactor actually changes the behavioral logic such that we don't return the same results as previously from the pip_metadata helper function. Additionally, the tests that you introduce actually don't test the compound cases which is the whole point of the pip metadata querying logic (e.g. name comes from one project file, version from another one). Since it looks like you'll have to rework the unit tests quite a bit that gives you the opportunity to split the changes into commits - I have actually tried myself and was pretty straightforward except for the tests, but like I said, looks like you'll have to rework that.

cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved
cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved
Copy link
Member

@eskultety eskultety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having carefully gone through the unit tests which I didn't do in my first round of reviews I think we're actually opening us up for potential issues with pyproject.toml setup.py etc. mixed metadata.
I think that while we may cosmetically change the code and break the logic into smaller helper functions, we'll have to test the metadata querying in the compound way we're doing now.

Comment on lines 316 to 331
# setup.py
if setup_py.exists():
log.debug("Checking setup.py for metadata")
name = setup_py.get_name()
version = setup_py.get_version()

if name and version:
return name, version

# setup.cfg
if setup_cfg.exists():
log.debug("Checking setup.cfg for metadata")
name = setup_cfg.get_name()
version = setup_cfg.get_version()

return name, version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not a 1:1 refactor and I'd argue this is a breaking change in behaviour (although in rare cases, but it is) - if you can't infer version from e.g. pyproject.toml you try setup.py, but if setup.py only defines version, but not name, you'll overwrite the name to None and eventually fall back to infering the name from the repo URL identifier which is not what the previous behaviour did and I'm not sure we want this to be changed.

Comment on lines 57 to 58
@pytest.mark.parametrize(
"dynamic_version",
[True, False],
)
@mock.patch("cachi2.core.package_managers.pip.PyProjectTOML")
@mock.patch("cachi2.core.package_managers.pip.get_repo_id")
def test_get_pip_metadata(
mock_get_repo_id: mock.Mock,
def test_get_pip_metadata_from_pyproject_toml(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You decreased the value of the unit test by adding per-project file test case because you never test the combinations by making the main test split into multiple ones which is surely more readable but only part of the original logic is being tested, that was the point of the complex test.

Comment on lines 280 to 291
First, try to parse the setup.py script (if present) and extract name and version
from keyword arguments to the setuptools.setup() call. If either name or version
could not be resolved and there is a setup.cfg file, try to fill in the missing
values from metadata.name and metadata.version in the .cfg file.
repo_name = Path(repo_id.parsed_origin_url.path.removesuffix(".git")).name
subpath = package_dir.subpath_from_root
resolved_name = Path(repo_name).joinpath(subpath)
return canonicalize_name(str(resolved_name).replace("/", "-")).strip("-.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, in GitHub the PR is basically unreviewable. Sorry for that.

First things first, GH has nothing to do with this, I review stuff locally with 50 lines of context usually and even with flags to git to detect moved code doesn't change the situation in any way.

I would like to not rearrange the commits in this PR at all + don't waste too much time here. Each commit would require modification in unit tests, and that is not trivial based on the extremely complicated logic in there. I chose a much easier path, so, delete everything and write this from scratch.

The only blocker for splitting here is the unit tests here, I agree. But given that this refactor actually changes the behavioral logic such that we don't return the same results as previously from the pip_metadata helper function. Additionally, the tests that you introduce actually don't test the compound cases which is the whole point of the pip metadata querying logic (e.g. name comes from one project file, version from another one). Since it looks like you'll have to rework the unit tests quite a bit that gives you the opportunity to split the changes into commits - I have actually tried myself and was pretty straightforward except for the tests, but like I said, looks like you'll have to rework that.

@slimreaper35
Copy link
Collaborator Author

After having carefully gone through the unit tests which I didn't do in my first round of reviews I think we're actually opening us up for potential issues with pyproject.toml setup.py etc. mixed metadata.

That's a good point. The fact that we were mixing metadata from multiple project configuration files was there reason why we ended up with extremely complicated and long unit tests. Splitting name and version into multiple configuration files makes no sense on its own. In the end, we only need the name, one string, for an SBOM component.

- consolidate metadata extraction logic from files
  (pyproject.toml, setup.py, and setup.cfg) into a single function
- consolidate extraction logic from the origin remote to a single function
- clean up too many logging statements during execution
- drastically simplify unit tests and speed up overall time
- preserve the same coverage
- get rid of pyproject.toml dynamic version warnings

Signed-off-by: Michal Šoltis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants