Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort Turtle output #1978

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

ajnelson-nist
Copy link
Contributor

Summary of changes

This patch series starts an API for sorting graph serializations, beginning with Turtle. The main objective is to produce consistent Turtle output, no matter the order of RDF triples being added to a Graph.

This PR will close Issue 1890.

The initial pair of patches only starts the PR. Some discussion will be needed to design the remaining patches.

The PR's total effects are expected to be additive and preserve backward compatibility.

Checklist

  • Checked that there aren't other open pull requests for the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests pass. (Test added, known to not yet pass, but should pass by end of PR development.)
  • Checked that all type checking passes. (Type signatures are being added to touched code.)
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies. (Not yet; unknown what to update.)
    • Considered adding additional documentation. (Not yet; unknown what to update.)
    • Considered adding an example in ./examples for new features. (Not yet; unknown what to update. Hard-coded sorted graph in new unit test can be copied to desired location.)
    • Considered updating our changelog (CHANGELOG.md).
  • Considered granting push permissions to the PR branch, so maintainers can fix minor issues and keep your PR up to date.

This patch adds a test to start specifying what sorting Turtle output
would look like.  This is intended to start discussion about
expectations of blank node sorting, and to set an initial interface for
triggering sorted output with a propagated keyword argument in
`Graph.serialize()`.

This patch will fail CI, but should not fail for code-style reasons.
The new test script was reviewed with black, flake8, isort, and
mypy (--strict).

References:
* RDFLib#1890

Signed-off-by: Alex Nelson <[email protected]>
@ajnelson-nist ajnelson-nist marked this pull request as draft June 1, 2022 22:21
@aucampia
Copy link
Member

aucampia commented Jun 4, 2022

Something that may be of interest to you:

https://github.com/aucampia/rdflib/blob/38e854abc845a13dcaa61a66d5fcdba69bcf3ba1/test/utils/earl.py#L438-L446

class OrderedMemory(Memory):
    def __init__(self, configuration=None, identifier=None):
        super().__init__(configuration, identifier)
        self.__spo = OrderedDict()
        self.__pos = OrderedDict()
        self.__osp = OrderedDict()
        self.__namespace = OrderedDict()
        self.__prefix = OrderedDict()
        self.__context_obj_map = OrderedDict()

I wanted stable output for test reports, so it is easier to diff them and so they are not just noisy spam when added to git, so I made that. Seems to work, but it depends on order that things are added, and I'm naming blank nodes to make it work. Doing something similar with SortedDict may also work.

@ajnelson-nist
Copy link
Contributor Author

@aucampia - thanks for the strategy suggestion. I just gave SortedDict a shot in the same manner as you used in the earl test. It wasn't that simple a solution, I'm afraid.

After reviewing the code, there's a chance SortedDict won't be needed. I think where most of the sorting would need to have an impact would be in the Turtle (not longturtle) RecursiveSerializer class, particularly its objectList() method and the nearby _object_comparator sort-key function. The random-order Git noise is coming from BNodes being cast with str() in _object_comparator. If instead the BNodes could be compared in a consistently-sorting form, the random-noise issue should go away naturally, because sortProperties() unconditionally sorts object-lists.

I think one end-result of this analysis is that making consistently-sorting BNodes may have a significant impact on the serializing algorithm flow. BNodes would need to be serialized into some sortable form first (a flat string?), so they may need to be crawled/stringified before doing anything else. There may be some optimizations that can be done based on number of triples where the BNode is the object.

Does this way of thinking sound like it's on the right track?

Unrelatedly, I was using mypy --strict on the new test. Once I brought Memory into the test file, mypy --strict crawls rdflib/plugins/stores/memory.py and comes back with ~1,100 wide-spreading issues, mostly that type annotations haven't been done yet except in a few spots. So, that'll be a decent leg of type review to bring to fruition, but likely doesn't need to be a blocker on this PR.

@nicholascar
Copy link
Member

@ajnelson-nist do you want to pick this up again, now that we have R's being merged again?

Also, is this in line with recent W3C work on canonical hashing of RDF graphs:

@nicholascar nicholascar added the awaiting feedback More feedback is needed from the author of the PR or Issue. label Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting feedback More feedback is needed from the author of the PR or Issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorting Turtle output?
3 participants