Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recognize: use primaryScript or TextStyle to load model #69

Closed
bertsky opened this issue Aug 5, 2019 · 2 comments
Closed

recognize: use primaryScript or TextStyle to load model #69

bertsky opened this issue Aug 5, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@bertsky
Copy link
Collaborator

bertsky commented Aug 5, 2019

In the current state, the OCR model has to be selected in the fixed parameter JSON for the whole pipeline (all pages, all regions, all lines). We should at least offer a setting like dynamic that instead looks into ...

  • mods:language of the workspace's METS file
  • @primaryScript and @secondaryScript of the elements to be processed (or their parents), depending on textequiv_level
  • TextStyle/@fontFamily of the elements to be processed (or their parents), depending on textequiv_level – as described by the spec

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 22, 2021

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

One idea to get this configured would be a (partial) mapping between ISO 15924 (script) / ISO 639 (language) / OCR-D (font) to Tesseract models (in the usual notation) as parameter. A mapping for the empty string could become the manual fall-back. Matches in multiple categories (language/script/font), or across multiple levels (page/region/line), could be mixed via + in the result.

(Mappings of type: object are allowed syntactically in OCR-D's parameter JSON now.)

So for example, I could first run ocrd-typegroups-classifier (for font detection) and/or ocrd-tesserocr-deskew (for script detection), and then call ocrd-tesserocr-recognize -P model_map '{ "German": "deu+Latin", "deu": "deu+Latin", "Latin": "lat+Latin", "Latn": "Latin", "Latf": "GT4HistOCR+ONB+Fraktur+frk", "Greek": "grc+ell+Greek", "Grek": "Greek", "Hebr": "Hebrew", "": "eng" }'.

Probably we also should introduce some model_conf threshold here.

Unfortunately, due to Tesseract's API, the implementation would need to re-initialize Tesseract each time a segment has a different script/language/font annotation than the previous. But one could control this performance/quality trade-off by running detection on regions or pages only.

(A problem that first needs to be addressed though is the formalization of script and language identifications in PAGE.)

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 30, 2021

Fixed by #175 (completely)

@bertsky bertsky closed this as completed Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant