unclear/ambiguous language code notation #68

michaelkubina · 2022-01-31T11:10:26Z

While working on a mapping of bibliographic language codes (ISO 639-2/B due to the RDA application guidelines of the German National Library https://wiki.dnb.de/download/attachments/127172808/Kapitel_6.pdf?version=2&modificationDate=1505213938000&api=v2) to the corresponding (presumably ISO 639-2/T coded?) language models, I came across three language codes for which I could not find a match:

tesseract_best/frk.traineddata
tesseract_best/kmr.traineddata
tesseract_best/osd.traineddata

I therefore suspected that the encoding of the language models was done according to ISO 639-3 (https://iso639-3.sil.org/code_tables/639/data) and found matches for frk(=Frankish) and kmr(=Northern Kurdish), but still none for osd. Finally, I consulted the documentation (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) again and could see that kmr(=Northern Kurdish) is a correct suffix, while frk actually means "Frakturschrift" (is that correct?) and osd stands for the "Orientation & Script Detection Module".

Since these two are not languages in the actual sense, it would be appropriate in my opinion to designate these two training models in such a way that they can be clearly distinguished from the language codes consisting of three letters. Furthermore it would be surely good to put these models into another directory and not parallel to the languages, so that no further misunderstandings arise here. This already happens with the font types and should therefore be done here in the same way.

I am aware that there are some languages with multiple models that need to be in parallel because:

_vert (vertical text flow)
_latn / _cyrl (other type system)
_old (old)
_sim (simplified writing system)
_tra (traditional notation)

In such cases it is also clear that here a notation reduced to three letters is not possible and all those models are necessary as they are. Nevertheless, at least the stem of the language code should be clearly assignable. A clear indication whether for this purpose ISO 639-2/T or ISO 639-3 is coded would be helpful.

Since kmr(=Northern Kurdish) does not emerge from ISO 639-2, but only kur(=Kurdish) is used there, one would conclude, as I do, that ISO 639-3 is authoritative. In this case, however, one could possibly also erroneously expect the typification according to macrolanguage construct used in ISO 639-3. ara(=Arabic) could then be understood as "Includes Standard Arabic and Egyptian Arabic), or nor(=Norwegian) "Includes Nynorsk and Bokmal). However, there are no language models available for this, as far as I can see.

For macro languages, however, a complete examination of the current language codes would actually have to take place and would also require the creation of "language sets", which do not currently exist in the form...I think this could be a stimulating discussion point and possibly an interesting feature in the future...but goes too far at this point. Also ISO 639-3 with currently 7910 language-codes is far bigger, than ISO 639-2 with about 460 language codes.

Thank you,
Michael Kubina

michaelkubina · 2022-02-01T12:52:40Z

Sorry to bother again...i have now looked into this document (https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#LANGUAGES) as well but here it states, that frk does actually stand for Frankish, which would also be an ISO 639-3 code just like kmr(kurmandschi). This would leave the osd as the non-language, and the rest as proper ISO 639-2/T languagecodes.

This on the other hand would mean, that the documenation at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html would describe frk wrongfully as "German Fraktur", whilst Fraktur in general has been moved to the scripts subfolder.

So its not clear, if frk stands for frankish or german fraktur...

stweil · 2022-02-09T20:04:39Z

Related issue: tesseract-ocr/tessdata_fast#28

stweil · 2022-02-09T20:11:24Z

frk was for a long time described as being "Frankish", but the included dictionary, the character set and the fonts used for training indicate that it is de facto a model for German Fraktur.

script/Fraktur is similar, but covers all Western European languages (so also includes characters with French accents for example).

stweil · 2022-02-09T20:22:13Z

The current developers (including myself) are not related to Google where the current OCR models where trained, and details of the training process and the reasons for the existing names are therefore unknown.

Meanwhile there also exist models which were trained outside of Google, for example better models for Fraktur and other historic scripts which I trained at UB Mannheim.

Do you have suggestions how the current situation could be improved? Personally I'd like to have a JSON file which describes all OCR models, their names, their meaning (maybe with translations), comments, download link and other metadata).

Blightbuster · 2022-02-10T00:29:36Z

Such a json file would be a nice addition. Perhaps it could even include fields for different ISO notations like ISO 639-2/B and ISO 639-3.

M3ssman · 2022-02-10T19:53:07Z

Another guess regarding the data included in the standard frk model: since the notably large word list (+450.000 entries, which is really hard to gain with real historical print data) contains very unlikely terms (DSL-HARDWARE, AZ-Web.de or IT-Fortbildung, to mention a few) plus the list of used fonts assume that is has been trained like the standard en model from synthetic image-line pairs and text tokens from the web. (Default punctuation and number files fit in with their poor quality.)

stweil · 2022-02-10T20:07:06Z

That's right, all standard models where trained with synthetic line images. And as you noted, the text look like a collection of various texts from the web, maybe using language indicators from the HTML code which also results in dictionaries which are not necessarily representative for the time and language. And sometimes even important characters were missing in that texts, so they never were trained.

michaelkubina · 2022-02-11T08:33:58Z

Meanwhile there also exist models which were trained outside of Google, for example better models for Fraktur and other historic scripts which I trained at UB Mannheim.

I already use GT4HistOCR and i really appreciate your hard work. it works very well...even outside of the whole OCR-D space.

Do you have suggestions how the current situation could be improved? Personally I'd like to have a JSON file which describes all OCR models, their names, their meaning (maybe with translations), comments, download link and other metadata).

I agree with you, that we need some kind of description for the different models and a JSON would suit this very well, since it can be easily read by humans and machines. You have already mentioned a important set of information and with your expertise you will most certainly come up with a plethora of other important metadata. I think apart from the name, the "trainer" and his/hers contact, download links or links to the training data that you have mentioned, @M3ssman stated the important fact of wether its from real image data or from a synthetic image set - so a clear description on how the model was actually trained and in which depth needs to find its way in there as well.

...and i agree with @Blightbuster that there need to be fields for the models target languag(es). I would prefer not only to state which ISO 639-2B/T and ISO 639-3 it refers to, but also a fallback to ISO 639-1. It would take away some struggle of doing a mapping for projects, where the database holds the language code in ISO 639-1.

IMO we would additionally need an infomation about which writing system it was trained on/for. Here we could use the ISO 15924 standard (https://en.wikipedia.org/wiki/ISO_15924), which is partially used in naming the models for the scripts as well. IETF Language Tags (https://en.wikipedia.org/wiki/IETF_language_tag) are an interesting concept, but i am not quite happy, that the languagecode is preferably in ISO 639-1.

But this could inspiring for a more descriptive filename-convention for the models...especially when it comes to languages, that use different writing systems, e.g. Azerbaijan with Latin script since the independence from the soviet union (prior cyrillic) and persian script in its southern region ("always"). Similiar for German with Fraktur or Latin script, Serbian (Cyrillic & Latin) etc. So a mixture of both the ISO 639-2/T and the ISO 15924 could work here in our favor naming at least the standard models and giving some insights on whats currently hidden or just implied:

aze_Latn, aze_Cyrl, aze_Arab
deu_Latf, deu_Latn or deu_Latf_Latn
srp_Cyrl, srp_Latn or srp_Cyrl_Latn

So, i believe that a clear naming convention for the models and a descriptive JSON for the metadata would help a lot. And it could be used to automatically aggregate such informations for the documentation pages.

As for my initial question, i will keep in mind, that frk does actually stand for Fraktur.

Although frk is Frankish in ISO 693-3, the data is actually for German Fractur. See: tesseract-ocr/tessdata_best#68 tesseract-ocr/tessdata#49 tesseract-ocr/langdata#61

unhammer · 2022-06-02T11:05:49Z

the rest as proper ISO 639-2/T languagecodes

Just to clarify, does that mean that nor is trained on both nno and nob text?

tfmorris · 2023-10-15T22:45:16Z

IETF Language Tags (https://en.wikipedia.org/wiki/IETF_language_tag) are an interesting concept, but i am not quite happy, that the languagecode is preferably in ISO 639-1.

Although the IETF BCP 47 rule of using the shortest available code, making for variable length codes, is a little awkward, BCP 47 codes are common on the web and are also used by things like the fastText language identification module (even though they're mistakenly called ISO codes on the model page), so I think they're important to include.

Wikidata is a good way to be able to easily crosswalk the different codes, as well as access other useful information, like autonym, translated versions of the name, etc., and might be a good link to include in the metadata to allow easy access to additional information.

danpla added a commit to danpla/dpscreenocr that referenced this issue May 15, 2022

languages: Fix frk description

e6025f7

Although frk is Frankish in ISO 693-3, the data is actually for German Fractur. See: tesseract-ocr/tessdata_best#68 tesseract-ocr/tessdata#49 tesseract-ocr/langdata#61

stweil added the enhancement label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unclear/ambiguous language code notation #68

unclear/ambiguous language code notation #68

michaelkubina commented Jan 31, 2022

michaelkubina commented Feb 1, 2022

stweil commented Feb 9, 2022

stweil commented Feb 9, 2022 •

edited

Loading

stweil commented Feb 9, 2022

Blightbuster commented Feb 10, 2022

M3ssman commented Feb 10, 2022

stweil commented Feb 10, 2022

michaelkubina commented Feb 11, 2022 •

edited

Loading

unhammer commented Jun 2, 2022

tfmorris commented Oct 15, 2023

unclear/ambiguous language code notation #68

unclear/ambiguous language code notation #68

Comments

michaelkubina commented Jan 31, 2022

michaelkubina commented Feb 1, 2022

stweil commented Feb 9, 2022

stweil commented Feb 9, 2022 • edited Loading

stweil commented Feb 9, 2022

Blightbuster commented Feb 10, 2022

M3ssman commented Feb 10, 2022

stweil commented Feb 10, 2022

michaelkubina commented Feb 11, 2022 • edited Loading

unhammer commented Jun 2, 2022

tfmorris commented Oct 15, 2023

stweil commented Feb 9, 2022 •

edited

Loading

michaelkubina commented Feb 11, 2022 •

edited

Loading