-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direction, orientation, and reading order (text direction elements) #74
Direction, orientation, and reading order (text direction elements) #74
Conversation
Awesome contribution, many thanks! Would you perhaps be interested to present and discuss these at the next ALTO board meeting? |
Sure. I assume it's sometime after the summer holidays? We can probably also find some examples of the more unusual pages we'd like to be able to encode. |
I don't think we have fixed the date yet but it will probably be 1st or 2nd week of September, will let you know! Example pages are certainly very welcome too. |
This looks fantastic. We have a general ALTO Board meeting this week but this seems worthy of a single-topic gathering. Maybe the 2nd week of September? We tend to gravitate towards Thursdays, so tentatively 2021-09-09 (9-10:30 am EST) but we can be flexible on this. Some examples of unusual pages would be great as well! |
Interesting. This introduces PAGE-XML concepts in a radical way (along with their semantic problems). It would be great to have that kind of flexibility in ALTO (multiple RO, labels, independence of semantic and element ordering) IMO. Just a few comments/questions:
|
I'll try to answer one by one.
|
Thanks for elaborating, just a few follow-ups:
I see. Indeed, for that purpose PAGE-XML's "flat" But couldn't we in theory always decompose blocks/regions recursively (into single-line regions if necessary) to achieve the same thing without sub-block refs? (Just wondering how to best read PAGE-XML's intended representation.)
Sorry, I misunderstood
Agreed.
The deviation would merely be syntactical though. (And the syntactic candy here does weigh heavy.) The actual semantic deviation is regarding sub-region refs (but see above). BTW, PRImA's own implementation so far does not even respect the indices (but uses implicit ordering solely).
It's not possible by schema AFAIK, but one could add documentation stating that any
Yes. (The question followed from my misunderstanding. I'm not worried about the cost of redundancy here. And functionally, in a DOM you can always fully expand the inheritance.)
Thanks for clarifying! (So perhaps I also misunderstood these in PAGE-XML, where they might also be meant relative to the baseline?) Regarding rotational axis, I do think this is specified clearly in ALTO-XML (see discussion here): axis runs through the center of the block ( Indeed, |
suggestion by @bertsky
Yes you could decompose it like this but you're losing some of the semantics of
Yeah, I'm not sure how to do this well. AFAIK there's no good document introducing the standard and the schema comments are a bit lacking a lot of the time. We should probably get around to write down the semantics of most constructs a bit more explicitly.
Almost certainly. It doesn't really make sense otherwise.
I'm mostly talking about the 'target' rotation. Does a perfectly vertical
Of course. You actually need both to rotate a line correctly into the plane as the polygonal boundary can be deceiving when curvature and messy or differently sized letters come in combination. |
I would argue for the latter, because 90/270/left/right is different from vertical writing. So |
I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessen, @bertsky, @cneud - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinu - would that work for you as well? We could consider an earlier date if it works for @cipriandinu, I just can't guarantee a network connection until that point. |
* artunit :: 2021-10-08 22:29 Fri:
I have been terribly disconnected lately but am happy to try to align
this discussion with a Board meeting. @mittagessen, @bertsky, @cneud
- would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting
date/time for you?
Unfortunately I'm teaching during those exact hours. Otherwise my
November is still free though, so any other date (or even on 18/11 in
the afternoon) would work.
|
Hi Art,
For me is ok 18th of November, or earlier
Best,
Cip
***@***.***
Ciprian Dinu
Managing Director (CCS Romania)
CCS Content Conversion Specialists ROM SRL
Calea Grivitei nr. 143 | 010708 Bucharest | Romania
Phone +40 21 31 079 69 | Fax +40 21 31 079 69
Mobile +40 723 297 127
***@***.******@***.***> | www.ccs-romania.ro<http://www.ccs-romania.ro/>
P Be nice to the world. Please don't print this e-mail unless you really need to.
The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message. Thank you.
From: artunit ***@***.***>
Sent: Friday, October 8, 2021 11:30 PM
To: altoxml/schema ***@***.***>
Cc: Ciprian Dinu ***@***.***>; Mention ***@***.***>
Subject: Re: [altoxml/schema] Direction, orientation, and reading order (text direction elements) (#74)
I have been terribly disconnected lately but am happy to try to align this discussion with a Board meeting. @mittagessen<https://github.com/mittagessen>, @bertsky<https://github.com/bertsky>, @cneud<https://github.com/cneud> - would Thursday, Nov. 18 (9am - 10:30am EST) be a possible meeting date/time for you? @cipriandinu<https://github.com/cipriandinu> - would that work for you as well? We could consider an earlier date if it works for @cipriandinu<https://github.com/cipriandinu>, I just can't guarantee a network connection until that point.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#74 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANLNLFMYN47BHU3DZVEOUZLUF5ICZANCNFSM5BGLCM3Q>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Apologies, I will follow up by email instead of overloading the issue thread. |
Sry, did not see this earlier: I wonder if Also, IMHO the formulation for |
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
I've created some examples on how to use these extensions: alto_ro_examples. |
This pull requests bundles multiple backward compatible changes to the schema that resolve issues related to line orientation, direction, and reading order. While I would usually split them into separate PRs, they've been discussed jointly in the past (see #12 (comment)) and the addressed deficiencies are somewhat complementary.
Principal inline text direction
The first part of the proposal takes up #12 and #73. It adds an attribute
BASEDIRECTION
on the*Block
andTextLine
elements which indicates the base text direction of the lines/text contained therein ((ltr|rtl|ttb|btt
). This is helpful not only for rendering purposes of many East Asian scripts that can be written both vertically and horizontally but also to correctly set the base text direction of the BiDi algorithm during processing.Example of the use of this new attribute:
Different settings on lower levels of the hierarchy override those inherited from higher levels.
Reading Order
This part is a fairly truthful adaptation of the example in #18 which in turn derives from PageXML. Some changes are made to allow the encoding of more complex historical documents and the serialization of multiple reading orders. The two principal changes are:
TextBlock
such asTextLine
,String
, andGlyph
is possible.In addition, elements in the reading order can be tagged with
TAGREFS
to indicate roles of a particular elements in a reading order (such as an addition, correction or a particle that is the continuation of text on another line).A single reading order with roles:
Multiple reading orders can be encoded through the nesting of unordered and ordered groups:
While this can potentially result in ambiguity in the absence of a taxonomy of well-defined roles of elements in the reading order I'd like to avoid specifying this until a later point in time as it requires substantial input from users of the standard, especially those with more esoteric material. As can be seen in the last examples groups can also be arbitrarily nested.
EDIT: Removed explicit indices as attributes and the corresponding indexed types.