Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direction, orientation, and reading order (text direction elements) #74

Merged
merged 14 commits into from
Feb 11, 2022

Conversation

mittagessen
Copy link
Contributor

@mittagessen mittagessen commented Jul 29, 2021

This pull requests bundles multiple backward compatible changes to the schema that resolve issues related to line orientation, direction, and reading order. While I would usually split them into separate PRs, they've been discussed jointly in the past (see #12 (comment)) and the addressed deficiencies are somewhat complementary.

Principal inline text direction

The first part of the proposal takes up #12 and #73. It adds an attribute BASEDIRECTION on the *Block and TextLine elements which indicates the base text direction of the lines/text contained therein ((ltr|rtl|ttb|btt). This is helpful not only for rendering purposes of many East Asian scripts that can be written both vertically and horizontally but also to correctly set the base text direction of the BiDi algorithm during processing.

Example of the use of this new attribute:

...
<TextBlock ID="block_0" ... BASEDIRECTION='ltr'>
<TextLine ID="line_0" ....>....</TextLine>
<TextLine ID="line_1"....>...</TextLine>
<TextLine ID="line_2" BASEDIRECTION="rtl">...</TextLine>
</TextBlock>

Different settings on lower levels of the hierarchy override those inherited from higher levels.

Reading Order

This part is a fairly truthful adaptation of the example in #18 which in turn derives from PageXML. Some changes are made to allow the encoding of more complex historical documents and the serialization of multiple reading orders. The two principal changes are:

  • referencing elements below the TextBlock such as TextLine, String, and Glyph is possible.
  • Elements can occur more than once in the reading order(s).

In addition, elements in the reading order can be tagged with TAGREFS to indicate roles of a particular elements in a reading order (such as an addition, correction or a particle that is the continuation of text on another line).

A single reading order with roles:

...
<RoleTag ID="type_0" LABEL="correction"/>
</Tags>
<ReadingOrder>
   <OrderedGroup ID="main_0">
        <ElementRef ID="o_0" REF="block_0"/>
        <ElementRef ID="o_1" REF="block_10">
        <ElementRef ID="o_2" REF="string_25"/>
        <ElementRef ID="o_2" REF="line_10" TAGSREFS="type_0"/>
        <ElementRef ID="o_2" REF="string_26"/>
        <ElementRef ID="o_3" REF="block_2"/>
   </OrderedGroup>
</ReadingOrder>
<Layout>
...

Multiple reading orders can be encoded through the nesting of unordered and ordered groups:

...
<OtherTag ID="type_1" LABEL="A valid, complete reading order"/>
</Tags>
<ReadingOrder>
   <UnorderedGroup ID="valid_orders">
       <OrderedGroup ID="order_0" TAGREFS="type_1">
           <ElementRef .../>
           <UnorderedGroup ...>
           ....
           </UnorderedGroup>
           <ElementRef .../>
       </OrderedGroup>
       <OrderedGroup ID="order_1" TAGREFS="type_1">
       ...
       </OrderedGroup>
       ....
   </UnorderedGroup>
</ReadingOrder>
...

While this can potentially result in ambiguity in the absence of a taxonomy of well-defined roles of elements in the reading order I'd like to avoid specifying this until a later point in time as it requires substantial input from users of the standard, especially those with more esoteric material. As can be seen in the last examples groups can also be arbitrarily nested.


EDIT: Removed explicit indices as attributes and the corresponding indexed types.

@cneud
Copy link
Member

cneud commented Jul 29, 2021

Awesome contribution, many thanks! Would you perhaps be interested to present and discuss these at the next ALTO board meeting?

@mittagessen
Copy link
Contributor Author

Sure. I assume it's sometime after the summer holidays? We can probably also find some examples of the more unusual pages we'd like to be able to encode.

@cneud
Copy link
Member

cneud commented Jul 29, 2021

I assume it's sometime after the summer holidays?

I don't think we have fixed the date yet but it will probably be 1st or 2nd week of September, will let you know!

Example pages are certainly very welcome too.

@artunit
Copy link
Member

artunit commented Aug 3, 2021

This looks fantastic. We have a general ALTO Board meeting this week but this seems worthy of a single-topic gathering. Maybe the 2nd week of September? We tend to gravitate towards Thursdays, so tentatively 2021-09-09 (9-10:30 am EST) but we can be flexible on this. Some examples of unusual pages would be great as well!

@bertsky
Copy link
Contributor

bertsky commented Oct 4, 2021

Interesting. This introduces PAGE-XML concepts in a radical way (along with their semantic problems). It would be great to have that kind of flexibility in ALTO (multiple RO, labels, independence of semantic and element ordering) IMO. Just a few comments/questions:

  • Why reference elements below block level at all in RO? Since they also get an ordering attribute of their own here (becoming independent of element ordering), would that not be redundant? That is: what if @BASEDIRECTION and ReadingOrder (and element ordering) clash?
  • Since in ALTO all structural elements below the block level have merely optional @ID: Wouldn't it be preferable to start requiring them everywhere? (Or, conversely, why strictly require group elements to have an @ID of their own?)
  • I never understood why PAGE-XML decided to need @index for ordered groups, and cast the ordered/unordered distinction into a 2x3 matrix of un/ordered.../indexed types. Why not use the element ordering of the refs itself for ordered groups, and represent the difference between ordered and unordered by a simple boolean @ordered?
  • Another issue I have is the relationship of this new RO mechanism to the existing @IDNEXT mechanism, which are now redundant to some degree: What if they both are used, and what if they clash?
  • Regarding @BASEDIRECTION, am I right in assuming it would practically only make sense to have block and line level use orthogonal values, i.e. rtl/ltr in a ttb/btt or ttb/btt in a rtl/ltr? How is this enforced by the schema?
  • Also, since these are absolute notions, what is the relationship to @ROTATION (which could apply to each block differently)? Should we read this as applying before or after deskewing? What is your point of reference for absolute terms like top and bottom, left and right when you have non-orthogonal @ROTATION – does the interpretation of "left" snap from one side to the other as the angle crosses 45°?

@mittagessen
Copy link
Contributor Author

I'll try to answer one by one.

  • Ordering below block level is somewhat crucial for many complex texts that have elements which cannot reasonably belong to the same topological text block 'inserted' into the reading order. There are marginal insertions, notes, apparatus criticus, etc. which are located outside of the current text block but are read between elements inside the block. @BASEDIRECTION and ReadingOrder are completely independent, in fact I believe the purpose of @BASEDIRECTION (and readingDirection in PageXML but the docs are rather mute on this point) is/should solely be to indicate the base direction parameter of the Unicode BiDi algorithm and potentially rotation for display purposes (rotating into the horizontal/vertical depending on ltr/rtl and ttb/btt). Somewhat related to this is Clarify implicit reading order #68; by enforcing the (implicit) order of elements below a TextLine to be the logical order (in the sense of the BiDi algorithm) we can still extract the text in correct order for simple visual display or computation while at the same time inserting non-line elements into the reading order for more advanced viewers. This document makes the different purposes of these ordering elements clearer.
  • We could make them mandatory but this would break forward compatibility of existing documents. While schema versioning in theory should prevent this in practice people write ad-hoc parsers, so I'm a bit wary of introducing changes like this.
  • That's a personal preference. The indices are admittedly only in there to not deviate too much from Page.
  • The easiest way would be to disallow one when the other is present. I'm not proficient enough with XSD to know how one would encode this.
  • It should be allowed in any combination which hopefully makes sense given the BiDi comment above. In any case, I'm loath to prohibit redundant encodings. They are often easier to serialize/deserialize than more compact encodings while not allowing these doesn't offer any benefits.
  • Top to bottom, bottom to top, left to right, and right to left are line-relative and abstract notions and not absolute with regard to page orientation. Rotation is mostly independent of that. Every single line of this manuscript page would be rtl (or ltr if the text transcription was produced in an environment with the BiDi algorithm base direction set to ltr) but the @ROTATION could be anything.
    Add MS 23494_0035
    @ROTATION only comes into play when deciding how to extract the line image for visual display (ttb/btt lines should be rectified/rotated to be vertical, ltr/rtl lines to be horizontal) but it isn't well enough specified to be useful for that as it isn't clear relative to which axis the rotational angle is. In any case, @BASELINE is much more powerful as it allows rectifying arbitrarily curved and rotated lines; at least for manuscripts where line angle and curvature tends to change inside a text block @ROTATION is woefully inadequate.

@bertsky
Copy link
Contributor

bertsky commented Oct 5, 2021

Thanks for elaborating, just a few follow-ups:

  • Why reference elements below block level at all in RO? Since they also get an ordering attribute of their own here (becoming independent of element ordering), would that not be redundant? That is: what if @BASEDIRECTION and ReadingOrder (and element ordering) clash?
  • Ordering below block level is somewhat crucial for many complex texts that have elements which cannot reasonably belong to the same topological text block 'inserted' into the reading order. There are marginal insertions, notes, apparatus criticus, etc. which are located outside of the current text block but are read between elements inside the block.

I see. Indeed, for that purpose PAGE-XML's "flat" ReadingOrder + @textLineOrder is not enough, you do need a general "onto" mapping. (On the other hand, nothing syntactially prevents you from using TextLine/@ID or Word/@ID for @regionRef in PAGE-XML already – they are mere xs:IDREF, only documentation currently says they are meant for regions alone.)

But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)

@BASEDIRECTION and ReadingOrder are completely independent, in fact I believe the purpose of @BASEDIRECTION (and readingDirection in PageXML but the docs are rather mute on this point) is/should solely be to indicate the base direction parameter of the Unicode BiDi algorithm and potentially rotation for display purposes (rotating into the horizontal/vertical depending on ltr/rtl and ttb/btt). Somewhat related to this is Clarify implicit reading order #68; by enforcing the (implicit) order of elements below a TextLine to be the logical order (in the sense of the BiDi algorithm) we can still extract the text in correct order for simple visual display or computation while at the same time inserting non-line elements into the reading order for more advanced viewers. This document makes the different purposes of these ordering elements clearer.

Sorry, I misunderstood InlineDirType to denote something like @textLineOrder on the line block level. But your documentation already states the two levels are merely for inheritance. (Also, I had not given display/digital rendering much thought.)

  • Since in ALTO all structural elements below the block level have merely optional @ID: Wouldn't it be preferable to start requiring them everywhere?
  • We could make them mandatory but this would break forward compatibility of existing documents. While schema versioning in theory should prevent this in practice people write ad-hoc parsers, so I'm a bit wary of introducing changes like this.

Agreed.

(Or, conversely, why strictly require group elements to have an @ID of their own?)

  • I never understood why PAGE-XML decided to need @index for ordered groups, and cast the ordered/unordered distinction into a 2x3 matrix of un/ordered.../indexed types. Why not use the element ordering of the refs itself for ordered groups, and represent the difference between ordered and unordered by a simple boolean @ordered?
  • That's a personal preference. The indices are admittedly only in there to not deviate too much from Page.

The deviation would merely be syntactical though. (And the syntactic candy here does weigh heavy.) The actual semantic deviation is regarding sub-region refs (but see above).

BTW, PRImA's own implementation so far does not even respect the indices (but uses implicit ordering solely).

  • Another issue I have is the relationship of this new RO mechanism to the existing @IDNEXT mechanism, which are now redundant to some degree: What if they both are used, and what if they clash?
  • The easiest way would be to disallow one when the other is present. I'm not proficient enough with XSD to know how one would encode this.

It's not possible by schema AFAIK, but one could add documentation stating that any @IDNEXT is to be ignored if ReadingOrder is present…

  • Regarding @BASEDIRECTION, am I right in assuming it would practically only make sense to have block and line level use orthogonal values, i.e. rtl/ltr in a ttb/btt or ttb/btt in a rtl/ltr? How is this enforced by the schema?
  • It should be allowed in any combination which hopefully makes sense given the BiDi comment above. In any case, I'm loath to prohibit redundant encodings. They are often easier to serialize/deserialize than more compact encodings while not allowing these doesn't offer any benefits.

Yes. (The question followed from my misunderstanding. I'm not worried about the cost of redundancy here. And functionally, in a DOM you can always fully expand the inheritance.)

  • Also, since these are absolute notions, what is the relationship to @ROTATION (which could apply to each block differently)? Should we read this as applying before or after deskewing? What is your point of reference for absolute terms like top and bottom, left and right when you have non-orthogonal @ROTATION – does the interpretation of "left" snap from one side to the other as the angle crosses 45°?
  • Top to bottom, bottom to top, left to right, and right to left are line-relative and abstract notions and not absolute with regard to page orientation. Rotation is mostly independent of that. Every single line of this manuscript page would be rtl (or ltr if the text transcription was produced in an environment with the BiDi algorithm base direction set to ltr) but the @ROTATION could be anything. @ROTATION only comes into play when deciding how to extract the line image for visual display (ttb/btt lines should be rectified/rotated to be vertical, ltr/rtl lines to be horizontal) but it isn't well enough specified to be useful for that as it isn't clear relative to which axis the rotational angle is. In any case, @BASELINE is much more powerful as it allows rectifying arbitrarily curved and rotated lines; at least for manuscripts where line angle and curvature tends to change inside a text block @ROTATION is woefully inadequate.

Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?)

Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block (HPOS+0.5*WIDTH, VPOS+0.5*HEIGHT).

Indeed, @BASELINE is much more precise, but is itself not rich enough to automatically extract masked line images, for which in my understanding only the polygonal hull of the glyphs (i.e. TextLine/Shape/Polygon) would be adequate.

@mittagessen
Copy link
Contributor Author

But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)

Yes you could decompose it like this but you're losing some of the semantics of TextLine or lower level elements.

It's not possible by schema AFAIK, but one could add documentation stating that any @IDNEXT is to be ignored if ReadingOrder is present…

Yeah, I'm not sure how to do this well. AFAIK there's no good document introducing the standard and the schema comments are a bit lacking a lot of the time. We should probably get around to write down the semantics of most constructs a bit more explicitly.

Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?)

Almost certainly. It doesn't really make sense otherwise.

Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block (HPOS+0.5*WIDTH, VPOS+0.5*HEIGHT).

I'm mostly talking about the 'target' rotation. Does a perfectly vertical ttb/btt line have a rotation of 90°/270° or 0°?

Indeed, @BASELINE is much more precise, but is itself not rich enough to automatically extract masked line images, for which in my understanding only the polygonal hull of the glyphs (i.e. TextLine/Shape/Polygon) would be adequate.

Of course. You actually need both to rotate a line correctly into the plane as the polygonal boundary can be deceiving when curvature and messy or differently sized letters come in combination.

@mittagessen
Copy link
Contributor Author

@cneud @artunit Can we get this discussed at the next board meeting? I've missed the on in September but can definitely prepare something for the next one.

@bertsky
Copy link
Contributor

bertsky commented Oct 8, 2021

I'm mostly talking about the 'target' rotation. Does a perfectly vertical ttb/btt line have a rotation of 90°/270° or 0°?

I would argue for the latter, because 90/270/left/right is different from vertical writing. So @ROTATION and @orientation are catch-alls for skew and 90° multiples, while the other attributes are truly ordering relations. (The fact that vertical script is trained horizontally and thus glyphs are not upwards when they enter the OCR engine should not be relevant here.)

@artunit
Copy link
Member

artunit commented Oct 8, 2021

I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessen, @bertsky, @cneud - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinu - would that work for you as well? We could consider an earlier date if it works for @cipriandinu, I just can't guarantee a network connection until that point.

@mittagessen
Copy link
Contributor Author

mittagessen commented Oct 9, 2021 via email

@cipriandinu
Copy link
Member

cipriandinu commented Oct 10, 2021 via email

@artunit
Copy link
Member

artunit commented Oct 11, 2021

Apologies, I will follow up by email instead of overloading the issue thread.

@bertsky
Copy link
Contributor

bertsky commented Nov 19, 2021

Sry, did not see this earlier: I wonder if OrderedGroupType and UnorderedGroupType should also get a @REF (as they do in PAGE-XML). Without this, you'd need to add one additional ElementRefType into each group – but you would need to construct the order hierarchy differently than in PAGE-XML (i.e. graphs would have to be transformed when converting).

Also, IMHO the formulation for @BASEDIRECTIONIndicates the inline base direction – and InlineDirTypeDescribes the base direction of text inside a line or of all lines inside a text block. – can still be improved.

v4/alto-4-3.xsd Outdated Show resolved Hide resolved
v4/alto-4-3.xsd Outdated Show resolved Hide resolved
v4/alto-4-3.xsd Show resolved Hide resolved
v4/alto-4-3.xsd Outdated Show resolved Hide resolved
v4/alto-4-3.xsd Outdated Show resolved Hide resolved
v4/alto-4-3.xsd Show resolved Hide resolved
v4/alto-4-3.xsd Outdated Show resolved Hide resolved
v4/alto-4-3.xsd Show resolved Hide resolved
v4/alto-4-3.xsd Show resolved Hide resolved
v4/alto-4-3.xsd Outdated Show resolved Hide resolved
Co-authored-by: Robert Sachunsky <[email protected]>
@mittagessen
Copy link
Contributor Author

I've created some examples on how to use these extensions: alto_ro_examples.

@cipriandinu cipriandinu merged commit f893b8c into altoxml:master Feb 11, 2022
@cneud cneud mentioned this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants