Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not duplicate regions during serialization #287

Merged
merged 2 commits into from
Oct 21, 2021

Conversation

bencomp
Copy link
Contributor

@bencomp bencomp commented Oct 20, 2021

Fixes #269. Instead of creating a region dictionary every time the current region for a line changes, which may introduce duplicates when the reading order of lines is weird, this change creates the dictionaries once and adds lines to existing dictionaries' lines. All regions that contain lines are added to the page's entities at the end, just before loading and rendering the template.

As discussed in #269, there is more to improve. For example, more work is needed to support non-text regions. And improving the reading order of lines is (partially) done in #285.

Relates to mittagessen#269.
Only regions with lines are added.
More work is needed to support non-text regions.
@mittagessen
Copy link
Owner

mittagessen commented Oct 20, 2021

Not terribly convinced of that approach as it completely breaks the reading order with the current templates. If it's not too much to ask, could you add explicit reading order to the PageXML template? There's a meeting next month about adding explicit order to ALTO which will hopefully get approval and will allow us to serialize whatever order we like for the principal two formats.

@bencomp
Copy link
Contributor Author

bencomp commented Oct 20, 2021

I'd be happy to look into this, but I wonder if a strange/breaking reading order is caused by this change. Combining the changes from #285 and this gives me the best overall results, but sometimes the order of lines is still a bit weird.

Although I now realise my mistake – the regions in the argument may not be in reading order, and this new approach ignores the order of the lines (which should be in reading order at the time of serialisation). That is something I should be able to fix without adding explicit reading order.

@bencomp
Copy link
Contributor Author

bencomp commented Oct 20, 2021

I think 08d55ab did it.
To really test this latest change, we should have test cases that:

  1. have lines in reading order and regions in reading order
  2. have lines in reading order and regions in non-reading order

and the serialisation should be the same.

@mittagessen mittagessen merged commit 02edbbe into mittagessen:master Oct 21, 2021
@mittagessen
Copy link
Owner

mittagessen commented Oct 21, 2021

I've added some basic tests that round trip the serialization/deserialization as well as checking for duplicate IDs and merged them into master. It seems your change fixes the issue for PageXML but not the ALTO output.

@bencomp
Copy link
Contributor Author

bencomp commented Oct 22, 2021

Thanks for taking the time to test my changes and merging the PR. I'm surprised to hear that PageXML and ALTO are affected differently, because with these changes I don't see duplicate IDs in ALTO while without the changes I do get duplicate IDs.

The issue won't appear if the order of lines (the ocr_records) puts all lines for each region together. Then every time a line is in a different region, you know that that region wasn't added to page['entities'] before.
However, even with #285 merged, sometimes I see that the reading order is off. See these results for example:

Detected regions and lines

Line 20 (ALGER) is in a different region (say B) than lines 19 (....), 21 (PRÉSIDENT.) and 22 (JUGES.) in region A. Without the changes in this PR, region A is output twice, producing a duplicate block_A ID. Of course I would like to make sure that lines are ordered better, but if this happens I would still like to get valid XML output.

It's true that my approach outputs lines 21 and 22 before line 20. This is how it could fail the roundtrip tests. I can understand that you do not want this to happen, and it makes me wonder: should finding the region for each line be done during serialisation?

@mittagessen
Copy link
Owner

It didn't fail with duplicate IDs but incorrect round tripping but the sample records the tests use(d) were created before your fix for #285. I updated them and everything seems to run through.

As mentioned a better way to deal with this whole issue is to decouple line-region assignment and reading order. The idea is that the segmenter outputs a line-region hierarchy and an arbitrary explicit ordering of the lines. The serializer would then transform the unordered hierarchy into a pseudo-RO (obeying the continuity criterion inside a region) which is used as the implicit ordering for all the formats while at the same time serializing the actual order through whatever explicit ordering mechanism is provided by each standard (if any). This would largely disentangle serializer and segmenter.

We can do that in principle for the PageXML template and hopefully soon for ALTO if altoxml/schema#74 gets approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Region IDs are not unique in PageXML and ALTO outputs after baseline segmentation
2 participants