-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rotated text lines in hOCR output #148
Comments
Let us be more precise here: The lines are rotated by 90 degree clock-wise.
Well, but that should be improved first. I think that this rotation should be indicated by In the Japanese text the lines are not rotated but the text direction is from top-to-bottom. |
That spec says "angle in degrees by which textual content has been rotate[d] relative to the rest of the page". I think this is not very precise and helpful, because for the two pages in question, both pages would have the default value (0 °) as each line has the same rotation as "the rest of the page". |
Tesseract 3.05 used to add I agree that the specs are not that clear about the details, see also kba/hocr-spec#101. |
That's an important hint. You are right, the old hOCR for the same image includes the |
|
This image contains a full page of vertical text lines. The hOCR ouput which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.
It might be interesting to have a filter program which detects the line orientation from the hOCR data by interpreting the coordinates of the bounding boxes.
A similar algorithm would be needed for rendering of the OCR results, for example in PDF output created by hocr-pdf or by Tesseract or in hocrjs.
See also issue #54.
The text was updated successfully, but these errors were encountered: