Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace bagger: allow selecting pages for download/inclusion #1215

Open
bertsky opened this issue Apr 24, 2024 · 1 comment
Open

workspace bagger: allow selecting pages for download/inclusion #1215

bertsky opened this issue Apr 24, 2024 · 1 comment

Comments

@bertsky
Copy link
Collaborator

bertsky commented Apr 24, 2024

It would be nice if ocrd zip bag supported creating partial clones with some FLocats as mere URL instead of local paths in the payload.

Possible use cases:

  • gt-repo-template on existing METS with annotations only on some pages: the bagit should not be bloated by sole images
  • long-term archiving ingest with a partial update (some pages/fileGrps)
  • data transfer for processing with page range split across nodes
  • sharing workspaces for debugging purposes: only those fileGrps/pages relevant to the issue (but keeping the others for reproducability)

On the CLI, it would just be another option, but I am not sure it's even allowed in the Bagit data format.

@MehmedGIT
Copy link
Contributor

Here is the request we talked about during our meeting today. Please take a look at the following block of code:

    workspace = Workspace(resolver, directory=workspace_dir, mets_basename=mets_basename)
    WorkspaceBagger(resolver).bag(
        workspace, 
        ocrd_identifier=ocrd_identifier, 
        dest=bag_dest, 
        ocrd_mets=mets_basename, 
        processes=1
    )

It would be great if the WorkspaceBagger.bag() method also took an extra flag skip_download to avoid downloading file groups not existing on the local storage. There are, of course, white- and blacklist options with include_fileGrp and exclude_fileGrp to achieve that by simply ignoring some file groups, but that requires some extra steps plus knowledge of what file groups are locally available and which are not. I am mainly interested in doing that programmatically. How the bagger CLI should handle skip_download does not matter much, so no extra requirements there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants