Rotated text lines in hOCR output #148

stweil · 2019-03-08T12:38:34Z

This image contains a full page of vertical text lines. The hOCR ouput which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.

It might be interesting to have a filter program which detects the line orientation from the hOCR data by interpreting the coordinates of the bounding boxes.

A similar algorithm would be needed for rendering of the OCR results, for example in PDF output created by hocr-pdf or by Tesseract or in hocrjs.

See also issue #54.

zuphilip · 2019-03-08T19:07:56Z

This image contains a full page of vertical text lines.

Let us be more precise here: The lines are rotated by 90 degree clock-wise.

The hOCR output which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.

Well, but that should be improved first. I think that this rotation should be indicated by textangle property, see http://kba.cloud/hocr-spec/1.2/#textangle, but @kba might know better than I do.

In the Japanese text the lines are not rotated but the text direction is from top-to-bottom.

stweil · 2019-03-08T19:16:07Z

That spec says "angle in degrees by which textual content has been rotate[d] relative to the rest of the page". I think this is not very precise and helpful, because for the two pages in question, both pages would have the default value (0 °) as each line has the same rotation as "the rest of the page".

zuphilip · 2019-03-08T19:30:05Z

Tesseract 3.05 used to add textangle property, see e.g. https://raw.githubusercontent.com/zuphilip/ocr-fileformat-samples/3590006039022801e3847f67feb085b3872585be/samples/hocr/1.1/452114306.hocr . What happened with that?

I agree that the specs are not that clear about the details, see also kba/hocr-spec#101.

stweil · 2019-03-08T19:39:26Z

That's an important hint. You are right, the old hOCR for the same image includes the textangle property. I'll open an issue for Tesseract.

stweil · 2023-06-28T12:53:34Z

hocr-extract-images currently ignores the textangle property, so line images with rotated text don't get rotated into a horizontal line (which is required for training).

stweil changed the title ~~Non horizontal text lines in hOCR output~~ Rotated text lines in hOCR output Mar 8, 2019

stweil mentioned this issue Mar 8, 2019

Missing information on text orientation (hOCR property textangle) with LSTM tesseract-ocr/tesseract#2303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rotated text lines in hOCR output #148

Rotated text lines in hOCR output #148

stweil commented Mar 8, 2019

zuphilip commented Mar 8, 2019

stweil commented Mar 8, 2019

zuphilip commented Mar 8, 2019

stweil commented Mar 8, 2019

stweil commented Jun 28, 2023

Rotated text lines in hOCR output #148

Rotated text lines in hOCR output #148

Comments

stweil commented Mar 8, 2019

zuphilip commented Mar 8, 2019

stweil commented Mar 8, 2019

zuphilip commented Mar 8, 2019

stweil commented Mar 8, 2019

stweil commented Jun 28, 2023