Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chi_tra_vert and chi_sim_vert to supported languages #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Gowee
Copy link

@Gowee Gowee commented Mar 1, 2023

Hi!

I am finding a way for the OCR to recognize Chinese in vertical writing before I upload tons of old books in Traditional Chinese to the IA, as I said in the email.

This PR simply adds chi_tra_vert/HantT_vert, chi_sim_vert/HanS_vert for this purpose. It does not change the existing language code mapping of Chinese, Chinese (Simplified) and etc. So there won't be breaking changes.

I plan to set language = Chinese (Traditional) and set ocr_default_parameters = ocr_additional_languages:chi_tra_vert on my books so that -l chi_tra+chi_tra_vert would be applied to Tesseract, which appears to be the only feasible way so far to recognize Traditional Chinese in either vertical or horizontal writing, even though it yields less accuracy in contrast to applied -l chi_tra or -l chi_tra_vert independently.

It might not be the right way to resolve the problem. But I think it is an acceptable workaround. I have also opened an issue in the upstream tesseract-ocr/tessdata_best#72.

@MerlijnWajer
Copy link
Collaborator

MerlijnWajer commented Mar 1, 2023

Thank you for the request, I think this looks good as is, I would just like to clarify that the ocr_default_parameters is not read from any archive.org item metadata. It is read either from a task argument, or from a collection that the items are part of.

If you could let me know if you have a collection for your books, I could help you (create and) upload to a collection that has the right metadata set.

Alternatively, in addition to this pull request, you could add additional languages to your items, which will then automatically trigger using the right language packs.

@MerlijnWajer
Copy link
Collaborator

Sorry that it's taking a while to get back to you on this -- I'll get it done this week!

@Gowee
Copy link
Author

Gowee commented Mar 16, 2023

No problem at all. Actually, I have been busy with some other things these days so I did not have time to finish my upload scripts.

@Gowee
Copy link
Author

Gowee commented Apr 25, 2023

I just wanted to follow up on this. Is there any progress? :-)

@MerlijnWajer
Copy link
Collaborator

Sorry, I was mostly gone last week. Really trying to fit this in ASAP. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants