Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding --json to lint #654

Merged
merged 60 commits into from
Jun 1, 2023
Merged

Adding --json to lint #654

merged 60 commits into from
Jun 1, 2023

Conversation

linozen
Copy link
Member

@linozen linozen commented Jan 24, 2023

This should hopefully close #183.

More tests need to be added and more feedback gathered. This is just a first stab at adding this functionality. Maybe we get it done pre-FOSDEM!

Many thanks!

@linozen linozen requested review from mxmehl and carmenbianca and removed request for mxmehl January 24, 2023 16:35
@mxmehl
Copy link
Member

mxmehl commented Jan 25, 2023

Thanks a bunch for implementing this! I'd also like to make @alpianon aware of this PR as he asked for the feature IIRC.

Then, let me post some remarks and questions:

  1. My idea behind the source key in licenses and copyrights was that it should show where the information came from: the file directly, an accompanying .license file, or dep5? Currently, this always is file.spdxfile.name
  2. I tested it with a repo containing a lot of issues, and it threw a traceback with --json added:
$ reuse lint --json
reuse.project - WARNING - Could not resolve SPDX License Identifier of LICENSES/InVaLiD.txt, resolving to InVaLiD. Make sure the license is in the license list found at <https://spdx.org/licenses/> or that it starts with 'LicenseRef-', and that it has a file extension.
reuse.report - ERROR - Could not read 'README.md'
PermissionError: [Errno 13] Permission denied: 'README.md'
Traceback (most recent call last):
  File "/home/max/.cache/pypoetry/virtualenvs/reuse-RJlAEOXQ-py3.10/bin/reuse", line 6, in <module>
    sys.exit(main())
  File "/home/max/Git/github/reuse-tool/src/reuse/_main.py", line 283, in main
    return parsed_args.func(parsed_args, project, out)
  File "/home/max/Git/github/reuse-tool/src/reuse/lint.py", line 331, in run
    lint(data, formatter=formatter, out=out)
  File "/home/max/Git/github/reuse-tool/src/reuse/lint.py", line 311, in lint
    out.write(formatter(data))
  File "/home/max/Git/github/reuse-tool/src/reuse/lint.py", line 295, in format_json
    return json.dumps(
  File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 439, in _iterencode
    yield from _iterencode(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 429, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 439, in _iterencode
    yield from _iterencode(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 436, in _iterencode
    raise ValueError("Circular reference detected")
ValueError: Circular reference detected

To reproduce:

git clone https://github.com/mxmehl/reuse-example.git test && cd test
git checkout so-many-errors
chmod 000 README.md   # also provoke a read error
reuse lint --json

The same error also occurs when linting the Linux kernel for example.

  1. Then, copy from Add --json flag to lint #183 (comment): The files key also shows the SPDX-FileCopyrightText tag in the value key. As per reuse spdx should exclude the tag from FileCopyrightText #536 we consider to remove that from the spdx subcommand, so we may also want to do it here.

Otherwise, it looks good to me so far. It would be interesting to use it against a few concrete test cases.

@linozen
Copy link
Member Author

linozen commented Apr 5, 2023

Thanks a bunch for implementing this! I'd also like to make @alpianon aware of this PR as he asked for the feature IIRC.

Then, let me post some remarks and questions:

  1. My idea behind the source key in licenses and copyrights was that it should show where the information came from: the file directly, an accompanying .license file, or dep5? Currently, this always is file.spdxfile.name

This is now fixed by slightly adapting the SpdxInfo namedtuple. We decided that the move from a namedtuple to a dataclass is best handled in #669. We left a note in #669 about where SpdxInfo needs to be adapted.

  1. I tested it with a repo containing a lot of issues, and it threw a traceback with --json added:
$ reuse lint --json
reuse.project - WARNING - Could not resolve SPDX License Identifier of LICENSES/InVaLiD.txt, resolving to InVaLiD. Make sure the license is in the license list found at <https://spdx.org/licenses/> or that it starts with 'LicenseRef-', and that it has a file extension.
reuse.report - ERROR - Could not read 'README.md'
PermissionError: [Errno 13] Permission denied: 'README.md'
Traceback (most recent call last):
  File "/home/max/.cache/pypoetry/virtualenvs/reuse-RJlAEOXQ-py3.10/bin/reuse", line 6, in <module>
    sys.exit(main())
  File "/home/max/Git/github/reuse-tool/src/reuse/_main.py", line 283, in main
    return parsed_args.func(parsed_args, project, out)
  File "/home/max/Git/github/reuse-tool/src/reuse/lint.py", line 331, in run
    lint(data, formatter=formatter, out=out)
  File "/home/max/Git/github/reuse-tool/src/reuse/lint.py", line 311, in lint
    out.write(formatter(data))
  File "/home/max/Git/github/reuse-tool/src/reuse/lint.py", line 295, in format_json
    return json.dumps(
  File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 439, in _iterencode
    yield from _iterencode(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 429, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 439, in _iterencode
    yield from _iterencode(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 436, in _iterencode
    raise ValueError("Circular reference detected")
ValueError: Circular reference detected

To reproduce:

git clone https://github.com/mxmehl/reuse-example.git test && cd test
git checkout so-many-errors
chmod 000 README.md   # also provoke a read error
reuse lint --json

The same error also occurs when linting the Linux kernel for example.

This should now be fixed but using a custom JSON serializer.

  1. Then, copy from Add --json flag to lint #183 (comment): The files key also shows the SPDX-FileCopyrightText tag in the value key. As per reuse spdx should exclude the tag from FileCopyrightText #536 we consider to remove that from the spdx subcommand, so we may also want to do it here.

IMHO, this is probably best treated in a separate PR as it does not only relate to the JSON output of lint. What do you think?

@mxmehl
Copy link
Member

mxmehl commented Apr 5, 2023

Thank you! I can a few first tests and noticed some issues. But first of all, I took the liberty to merge latest master in the branch.

  1. Then, copy from Add --json flag to lint #183 (comment): The files key also shows the SPDX-FileCopyrightText tag in the value key. As per reuse spdx should exclude the tag from FileCopyrightText #536 we consider to remove that from the spdx subcommand, so we may also want to do it here.

IMHO, this is probably best treated in a separate PR as it does not only relate to the JSON output of lint. What do you think?

Yes, that makes sense.

Now, to the issues I found:

  1. I noticed that many outputs are appearing in a random order, e.g. the list of used licenses, both in the JSON and plain text output. Also, the files in the files key would be great to have sorted. It's not critical but a UX issue. Two runs of the program should result in the very same output.

  2. I see some issues with the detection of where a file is described. Perhaps the core problem lies further down than what you touched in this PR, but it's clearly wrong. I made some test cases in my fork:

git clone https://github.com/mxmehl/reuse-example.git test && cd test
git checkout json-tests

cat.jpg

With my edits, cat.jpg is described in both the cat.jpg.license as well as .reuse/dep5. The tool detects the different source, but does not mention them correctly here. Both the "Tester Testerson" copyright and the "Apache-2.0" license are in the dep5 file.

    "img/cat.jpg": {
      "copyrights": [
        {
          "value": "SPDX-FileCopyrightText: 2017 Peter Janzen",
          "source": "img/cat.jpg.license"
        },
        {
          "value": "Tester Testerson",
          "source": "img/cat.jpg.license"
        }
      ],
      "licenses": [
        {
          "value": "CC-BY-4.0",
          "source": "img/cat.jpg.license"
        },
        {
          "value": "Apache-2.0",
          "source": "img/cat.jpg.license"
        }
      ]
    }

However, for dog.jpg, it correctly displays the origin as dep5, but for this image there is no .license file.

.gitignore

That's even more complicated. The file itself contains only copyright info, the .license file a license and copyright, and dep5 a copyright and some invalid licensing info. This is the result:

    ".gitignore": {
      "copyrights": [
        {
          "value": "SPDX-FileCopyrightText: 2019 Jane Doe <[email protected]>",
          "source": ".gitignore.license"
        }
      ],
      "licenses": [
        {
          "value": "CC0-1.0",
          "source": ".gitignore.license"
        }
      ]
    }

As you see, the copyright in the file itself is ignored. I think this is intentional as we once decided that such a file can override all contents from the file itself to circumvent false-positive info. We may want to rethink that.

However, for some reason, the dep5 info is completely ignored here. I cannot explain why this is different from the cat.jpg case.

@linozen
Copy link
Member Author

linozen commented Apr 6, 2023

Thanks for your feedback @mxmehl!

  • I noticed that many outputs are appearing in a random order, e.g. the list of used licenses, both in the JSON and plain text output. Also, the files in the files key would be great to have sorted. It's not critical but a UX issue. Two runs of the program should result in the very same output.

Yes, this was absolutely an issue. The last two commits should have fixed that for JSON and plaintext output respectively.

  • I see some issues with the detection of where a file is described. Perhaps the core problem lies further down than what you touched in this PR, but it's clearly wrong. I made some test cases in my fork.

Yes, I also expect the problem to have deeper roots then this PR. In a sense, we're now exposing REUSE internals (e.g. where a license/copyright information was gathered from) that were not exposed previously. Before JSON output, the tool only really needed to know whether a file had license and copyright information attached to it and whether is info was valid. It didn't really matter where it came from. Now with the much more verbose JSON output we're facing the following issue:

It is unclear what to do when license and copyright information is described in two sources, i.e. a .license file and a dep5 file and which one takes precedence. This is related to this docs issue about precedence and also this issue about the undesired merging of copyright information from different sources.

My approach to this would be:

  1. Clearly define precedence in the the spec. Option 3 as outlined by @mxmehl in the issue seems to be most popular. It is this:

file.license
explicit in adjacent yaml (so file directly)
explicit in faraway
file header
glob in adjacent yaml (file covered e.g. as part of *.png)
glob in faraway yaml

IMHO, we would need to include DEP5 here as well as we're not hard deprecating its use.

  1. Implement precedence in this PR (keeping in mind that we don't have REUSE.yaml yet. The code where that happens is here and it should not be too hard to (a) disable the merging behaviour described in this issue and (b) implement the precedence logic to be implemented until we have REUSE.yaml.

What do you think?

Copy link
Member

@carmenbianca carmenbianca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of remarks. I like the bones of this PR. There's just a lot of nitpicking 😅

src/reuse/report.py Outdated Show resolved Hide resolved
src/reuse/report.py Outdated Show resolved Hide resolved
src/reuse/report.py Show resolved Hide resolved
src/reuse/report.py Outdated Show resolved Hide resolved
tests/test_main.py Show resolved Hide resolved
src/reuse/lint.py Outdated Show resolved Hide resolved
src/reuse/lint.py Outdated Show resolved Hide resolved
src/reuse/lint.py Outdated Show resolved Hide resolved
src/reuse/lint.py Outdated Show resolved Hide resolved
src/reuse/lint.py Outdated Show resolved Hide resolved
src/reuse/lint.py Show resolved Hide resolved
src/reuse/project.py Show resolved Hide resolved
src/reuse/lint.py Outdated Show resolved Hide resolved
Copy link
Contributor

@floriansnow floriansnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Would you please also check if the copyright information is updated in all files you contributed to? It looked to me like some may not have changed, but perhaps those were renamed files.

src/reuse/__init__.py Show resolved Hide resolved
src/reuse/lint.py Show resolved Hide resolved
src/reuse/__init__.py Outdated Show resolved Hide resolved
This commit adds a new enum `SourceType` with three possible values to indicate if the source type is a `.license file`, `file header` or `DEP5 file`. It then updates the usage of `source_type` by replacing the string type with the new `SourceType` enum type. This improves readability and makes the code more maintainable.
@linozen
Copy link
Member Author

linozen commented Jun 1, 2023

LGTM! Would you please also check if the copyright information is updated in all files you contributed to? It looked to me like some may not have changed, but perhaps those were renamed files.

Since I worked during my work hours, I'm fine with FSFE holding the copyright.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add --json flag to lint
4 participants