You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am uploading tons of old books in Traditional Chinese to the Internet Archive. And I am trying to find a set of proper cli options so that these books can be OCR-ed properly to be searchable. Some of them are in vertical text while some of them are in horizontal text. Rarely, some book contains both vertical and horizontal text on a single page.
Hi!
I am uploading tons of old books in Traditional Chinese to the Internet Archive. And I am trying to find a set of proper cli options so that these books can be OCR-ed properly to be searchable. Some of them are in vertical text while some of them are in horizontal text. Rarely, some book contains both vertical and horizontal text on a single page.
According to https://github.com/tesseract-ocr/tessdata_fast/#example---jpn-and--japanese and #22 (comment),
chi_tra
loadschi_tra_vert
"as a secondary language so it can try it in case the text is rendered vertically". So I supposechi_tra
should have recognized both vertical and horizontal Chinese text as documented.But in my test with different images,
chi_tra
appears to never be able to recognize vertical Chinese text.For example:
A scanned book page on Wikimedia Commons (medium quality):
With
tesseract 5.0.0
+ tessdata-best +-l chi_tra_vert
, the result is60%+ accuracy
With
tesseract 5.0.0
+ tessdata-best +-l chi_tra
, it is0% accuracy, pure garbage text
A digital book screenshot from the Internet (high quality):
With
tesseract 5.0.0
+ tessdata-best +-l chi_tra_vert
80%+ accuracy
With
tesseract 5.0.0
+ tessdata-best +-l chi_tra
0% accuracy
The issue is not limited to some specific images. It can be reproduced on many (or every?) other book pages as well.
The text was updated successfully, but these errors were encountered: