Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace validator: non-URI path #1179

Open
bertsky opened this issue Jan 30, 2024 · 7 comments
Open

workspace validator: non-URI path #1179

bertsky opened this issue Jan 30, 2024 · 7 comments
Assignees
Labels

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jan 30, 2024

From a workspace validate I got:

<report valid="false">
  <error>METS has no unique identifier</error>
  <error>Validation aborted with exception: Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 149, in _validate
    self._validate_mets_files()
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 302, in _validate_mets_files
    scheme = f.url[0:f.url.index(':')]
ValueError: substring not found
</error>
</report>

The url in question simply was a relative file name, which obviously makes the URI validator crash.

This is very problematic for two reasons:

  1. before ocrd differentiated between LOCTYPE=URL and OTHER, we created lots of data (including GT) with URL, despite being local paths – this now broken
  2. in this case, the data was just created by current ocrd itself – via ocrd workspace add, because that implementation sets both local_filename and url to the local path
@kba kba added the bug label Jan 30, 2024
@kba kba self-assigned this Jan 30, 2024
@mikegerber
Copy link
Contributor

mikegerber commented Feb 28, 2024

Yeah, this happens with a workspace built with ocrd workspace itself...

add seems to make, for example, this:

      <mets:file ID="XXX" MIMETYPE="image/jpeg">
        <mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
        <mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="URL"/>
      </mets:file>

@mikegerber
Copy link
Contributor

mikegerber commented Feb 28, 2024

  1. Removing the FLocats with LOCTYPE="URL"
  2. and making the image filename referenced in the PAGE XML consistent (I imported the XML)

fixes the validation at least (in the sense that it doesn't choke on exceptions itself).

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 29, 2024

add also created this structMap:

that's also what I witnessed as prime problem in OCR-D/ocrd_tesserocr#201. We need more diagnostics why and exactly when this is happening.

But this is a separate issue (has nothing to do with the validator).

@mikegerber
Copy link
Contributor

But this is a separate issue (has nothing to do with the validator).

True, I'll open another GitHub issue for that, if you didn't already.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 1, 2024

I'll open another GitHub issue for that, if you didn't already.

No, please do!

@mikegerber
Copy link
Contributor

No, please do!

Just for the sake of completeness: #1195

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 1, 2024

With 2. solved in #1199, only 1. is left.

The URI validator should not crash – it should first check if the href is in fact a URI and then simply add a specific error to the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants