Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocrd processors #9

Merged
merged 15 commits into from
Nov 9, 2022
Merged

Ocrd processors #9

merged 15 commits into from
Nov 9, 2022

Conversation

kba
Copy link
Collaborator

@kba kba commented Feb 21, 2022

Implement page2tsv and tsv2page as OCR-D processors, to be included in ocrd_all and then the OCR-D Butler.

All is working fine except the IIIF URL. I consistently fail to produce the right previews in the neat HTML.

There are two variants:

  • Unscaled, i.e. image dimensions == PAGE-XML dimensions, i.e. scale_factor == 1.0. In this case, assuming width 800 the IIIF looks like this:
    • https://<server>/<prefix>/<identifier>/left,top,width,height/800,/0/default.jpg
  • Scaled, i.e. the OCR was done on downscaled version of the full scans, i.e. scale_factor != 1.0 by comparing with the images in another fileGrp like MAX. URL looks like this:
    • https://<server>/<prefix>/<identifier>/left,top,width,height/full/0/default.jpg

@labusch If you have any idea what I am doing wrong, I'd appreciate any hints.

@labusch
Copy link
Member

labusch commented Feb 22, 2022

Ist das nicht verkehrt herum?
Wenn Bilddimensionen == Page-XML-Dimension ist der Skalierungsfaktor == 1.0 ansonsten von 1 verschieden.

  • https://///left,top,width,height/800,/0/default.jpg

Warum steht da 800? Wie wird der Skalierungsfaktor berechnet?

@kba
Copy link
Collaborator Author

kba commented Feb 22, 2022

Ist das nicht verkehrt herum? Wenn Bilddimensionen == Page-XML-Dimension ist der Skalierungsfaktor == 1.0 ansonsten von 1 verschieden.

Ja, so ist es gemeint, bzw. implementiert

  • https://///left,top,width,height/800,/0/default.jpg

Warum steht da 800? Wie wird der Skalierungsfaktor berechnet?

In dem Fall keine Skalierung, sondern IIIF sollte die Breite von PAGE-XML/Bild haben. Unter der Annahme, dass das Seitenverhältnis stimmt.

@kba
Copy link
Collaborator Author

kba commented May 30, 2022

The outstanding issues are fixed, added a test to verify behavior. So this can be merged AFAICT.

However, we should discuss (tomorrow) whether there is a more lightweight way to add the processors to ocrd_all without pulling in all the dependencies unrelated to the functionality here.

@kba kba marked this pull request as draft May 30, 2022 15:11
Copy link

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know about the dependencies, but looking forward to see this in ocrd_all. (Or ocrd_fileformat / ocr-fileformat?)

tsvtools/ocrd_processors.py Outdated Show resolved Hide resolved
pcgts = page_from_file(self.workspace.download_file(input_file))
page = pcgts.get_Page()

iiif_url = iiif_url_template\
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that work universally? If so, we should probably write an IIIF image importer for OCR-D from scratch (instead of extending and using https://github.com/karkraeg/iiimets).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To an extent. It is geared towards the ID conventions @StaatsbibliothekBerlin ({{ PPN }}-{{ page_no }}) but except from that, it is applicable to any IIIF URL scheme.

tsvtools/ocrd_processors.py Outdated Show resolved Hide resolved
tsvtools/ocrd_processors.py Outdated Show resolved Hide resolved
tsvtools/ocrd_processors.py Outdated Show resolved Hide resolved
@kba kba marked this pull request as ready for review November 8, 2022 15:25
@labusch labusch merged commit 3b10dcb into qurator-spk:master Nov 9, 2022
@kba kba deleted the ocrd-processors branch November 9, 2022 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants