Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd-tool schema: be less restrictive on input/ouptut_filegrp #168

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kba
Copy link
Member

@kba kba commented Jul 25, 2020

Designing the ocrd-sanitize processor makes it obvious that the input_file_grp/output_file_grp are not only obsolete but an anti-pattern:

  • The semantics are unintuitive or rather nonsensical, examples of how file groups MIGHT be named (but never are in practice) do not help
  • There are use cases where the input is not strictly mets:files from a mets:fileGrp. Since the core implementation depends on a single input file group (Processor.input_files property) we cannot just drop it though (hence the ["PLACEHOLDER"] default)
  • The naming restriction on top of input_file_grp/output_file_grp makes it worse

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree on the general direction, but...

* Since the core implementation depends on a single input file group (`Processor.input_files` property) we cannot just drop it though (hence the `["PLACEHOLDER"]` default)

Are you referring to OCR-D/core#274 by any chance? It looks like this PR is trying to conflate a) providing a default for the fileGrps in the tool json (and clarifying their semantics) with b) how core handles these when the processor is called without fileGrp parameters. (IMO we could live with INPUT and OUTPUT as default examples, but should remove them as defaults in the Processor runtime.)

Also related: OCR-D/core#364

ocrd_tool.schema.yml Outdated Show resolved Hide resolved
ocrd_tool.schema.yml Outdated Show resolved Hide resolved
ocrd_tool.schema.yml Outdated Show resolved Hide resolved
ocrd_tool.schema.yml Outdated Show resolved Hide resolved
kba and others added 4 commits July 27, 2020 12:30
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
@kba
Copy link
Member Author

kba commented Jul 27, 2020

Are you referring to OCR-D/core#274 by any chance?

That issue is obsolete, based on a confused understanding of the semantics of output_file_grp

I was mostly thinking of

    @property
    def input_files(self):
        """
        List the input files
        """
        return self.workspace.mets.find_files(fileGrp=self.input_file_grp, pageId=self.page_id)

in BaseProcessor. But thinking about it some more, if a processor does not set input_file_grp, it just should not use self.input_files. And the placeholder value in the ocrd-tool.json is unrelated to self.input_file_grp (the same mistaken interpretation as OCR-D/core#274). So even the placeholders can go IMHO.

@cneud
Copy link
Member

cneud commented Feb 4, 2021

It seems this went circle, but towards the general question of relaxing input/output fileGrp conventions: I would be fine with any form of relaxation that will still allows us to maintain direct ability to do a round trip DFG-Viewer-METS --> OCR-D -->DFG-Viewer-METS. E.g. this could be just one possible input setting/template for fileGrps?

@bertsky bertsky self-requested a review February 5, 2021 13:02
@bertsky
Copy link
Collaborator

bertsky commented Feb 5, 2021

It seems this went circle, but towards the general question of relaxing input/output fileGrp conventions: I would be fine with any form of relaxation that will still allows us to maintain direct ability to do a round trip DFG-Viewer-METS --> OCR-D -->DFG-Viewer-METS. E.g. this could be just one possible input setting/template for fileGrps?

This PR is not about the relaxation of our fileGrp (METS) conventions, though. It is about the interpretation of the fileGrp section in the ocrd-tool.json, namely becoming a mere example of how the user could wire that processor in a workflow, and their defaults in case the developer did not specify any.

We've discussed templating fileGrp names for likely or simple workflows for a while, but agreed against this IIRC because this would almost always be impractical anyway, quickly clash, and raise more misconception and irritation on the user side than provide actual value.

But maybe I misunderstood what you meant by getting that roundtrip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants