-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not duplicate regions during serialization #287
Conversation
Relates to mittagessen#269. Only regions with lines are added. More work is needed to support non-text regions.
Not terribly convinced of that approach as it completely breaks the reading order with the current templates. If it's not too much to ask, could you add explicit reading order to the PageXML template? There's a meeting next month about adding explicit order to ALTO which will hopefully get approval and will allow us to serialize whatever order we like for the principal two formats. |
I'd be happy to look into this, but I wonder if a strange/breaking reading order is caused by this change. Combining the changes from #285 and this gives me the best overall results, but sometimes the order of lines is still a bit weird. Although I now realise my mistake – the regions in the argument may not be in reading order, and this new approach ignores the order of the lines (which should be in reading order at the time of serialisation). That is something I should be able to fix without adding explicit reading order. |
I think 08d55ab did it.
and the serialisation should be the same. |
I've added some basic tests that round trip the serialization/deserialization as well as checking for duplicate IDs and merged them into master. It seems your change fixes the issue for PageXML but not the ALTO output. |
Thanks for taking the time to test my changes and merging the PR. I'm surprised to hear that PageXML and ALTO are affected differently, because with these changes I don't see duplicate IDs in ALTO while without the changes I do get duplicate IDs. The issue won't appear if the order of lines (the Line 20 (ALGER) is in a different region (say B) than lines 19 (....), 21 (PRÉSIDENT.) and 22 (JUGES.) in region A. Without the changes in this PR, region A is output twice, producing a duplicate It's true that my approach outputs lines 21 and 22 before line 20. This is how it could fail the roundtrip tests. I can understand that you do not want this to happen, and it makes me wonder: should finding the region for each line be done during serialisation? |
It didn't fail with duplicate IDs but incorrect round tripping but the sample records the tests use(d) were created before your fix for #285. I updated them and everything seems to run through. As mentioned a better way to deal with this whole issue is to decouple line-region assignment and reading order. The idea is that the segmenter outputs a line-region hierarchy and an arbitrary explicit ordering of the lines. The serializer would then transform the unordered hierarchy into a pseudo-RO (obeying the continuity criterion inside a region) which is used as the implicit ordering for all the formats while at the same time serializing the actual order through whatever explicit ordering mechanism is provided by each standard (if any). This would largely disentangle serializer and segmenter. We can do that in principle for the PageXML template and hopefully soon for ALTO if altoxml/schema#74 gets approved. |
Fixes #269. Instead of creating a region dictionary every time the current region for a line changes, which may introduce duplicates when the reading order of lines is weird, this change creates the dictionaries once and adds lines to existing dictionaries'
lines
. All regions that contain lines are added to the page'sentities
at the end, just before loading and rendering the template.As discussed in #269, there is more to improve. For example, more work is needed to support non-text regions. And improving the reading order of lines is (partially) done in #285.