recognize: use primaryScript or TextStyle to load model #69

bertsky · 2019-08-05T13:26:05Z

In the current state, the OCR model has to be selected in the fixed parameter JSON for the whole pipeline (all pages, all regions, all lines). We should at least offer a setting like dynamic that instead looks into ...

mods:language of the workspace's METS file
@primaryScript and @secondaryScript of the elements to be processed (or their parents), depending on textequiv_level
TextStyle/@fontFamily of the elements to be processed (or their parents), depending on textequiv_level – as described by the spec

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

The text was updated successfully, but these errors were encountered:

bertsky · 2021-01-22T15:38:51Z

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

One idea to get this configured would be a (partial) mapping between ISO 15924 (script) / ISO 639 (language) / OCR-D (font) to Tesseract models (in the usual notation) as parameter. A mapping for the empty string could become the manual fall-back. Matches in multiple categories (language/script/font), or across multiple levels (page/region/line), could be mixed via + in the result.

(Mappings of type: object are allowed syntactically in OCR-D's parameter JSON now.)

So for example, I could first run ocrd-typegroups-classifier (for font detection) and/or ocrd-tesserocr-deskew (for script detection), and then call ocrd-tesserocr-recognize -P model_map '{ "German": "deu+Latin", "deu": "deu+Latin", "Latin": "lat+Latin", "Latn": "Latin", "Latf": "GT4HistOCR+ONB+Fraktur+frk", "Greek": "grc+ell+Greek", "Grek": "Greek", "Hebr": "Hebrew", "": "eng" }'.

Probably we also should introduce some model_conf threshold here.

Unfortunately, due to Tesseract's API, the implementation would need to re-initialize Tesseract each time a segment has a different script/language/font annotation than the previous. But one could control this performance/quality trade-off by running detection on regions or pages only.

(A problem that first needs to be addressed though is the formalization of script and language identifications in PAGE.)

bertsky · 2021-06-30T19:12:53Z

Fixed by #175 (completely)

bertsky added the enhancement New feature or request label Aug 5, 2019

bertsky mentioned this issue Jan 22, 2021

Make it clearer which Tesseract engine is being used #168

Closed

bertsky mentioned this issue Mar 16, 2021

OSD on line level, recognition by loading script or lang from PAGE #175

Merged

4 tasks

bertsky closed this as completed Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recognize: use primaryScript or TextStyle to load model #69

recognize: use primaryScript or TextStyle to load model #69

bertsky commented Aug 5, 2019

bertsky commented Jan 22, 2021

bertsky commented Jun 30, 2021

recognize: use primaryScript or TextStyle to load model #69

recognize: use primaryScript or TextStyle to load model #69

Comments

bertsky commented Aug 5, 2019

bertsky commented Jan 22, 2021

bertsky commented Jun 30, 2021