Do not duplicate regions during serialization #287

bencomp · 2021-10-20T14:05:41Z

Fixes #269. Instead of creating a region dictionary every time the current region for a line changes, which may introduce duplicates when the reading order of lines is weird, this change creates the dictionaries once and adds lines to existing dictionaries' lines. All regions that contain lines are added to the page's entities at the end, just before loading and rendering the template.

As discussed in #269, there is more to improve. For example, more work is needed to support non-text regions. And improving the reading order of lines is (partially) done in #285.

Relates to mittagessen#269. Only regions with lines are added. More work is needed to support non-text regions.

mittagessen · 2021-10-20T15:38:41Z

Not terribly convinced of that approach as it completely breaks the reading order with the current templates. If it's not too much to ask, could you add explicit reading order to the PageXML template? There's a meeting next month about adding explicit order to ALTO which will hopefully get approval and will allow us to serialize whatever order we like for the principal two formats.

bencomp · 2021-10-20T20:41:52Z

I'd be happy to look into this, but I wonder if a strange/breaking reading order is caused by this change. Combining the changes from #285 and this gives me the best overall results, but sometimes the order of lines is still a bit weird.

Although I now realise my mistake – the regions in the argument may not be in reading order, and this new approach ignores the order of the lines (which should be in reading order at the time of serialisation). That is something I should be able to fix without adding explicit reading order.

bencomp · 2021-10-20T21:15:38Z

I think 08d55ab did it.
To really test this latest change, we should have test cases that:

have lines in reading order and regions in reading order
have lines in reading order and regions in non-reading order

and the serialisation should be the same.

mittagessen · 2021-10-21T22:04:09Z

I've added some basic tests that round trip the serialization/deserialization as well as checking for duplicate IDs and merged them into master. It seems your change fixes the issue for PageXML but not the ALTO output.

bencomp · 2021-10-22T08:47:52Z

Thanks for taking the time to test my changes and merging the PR. I'm surprised to hear that PageXML and ALTO are affected differently, because with these changes I don't see duplicate IDs in ALTO while without the changes I do get duplicate IDs.

The issue won't appear if the order of lines (the ocr_records) puts all lines for each region together. Then every time a line is in a different region, you know that that region wasn't added to page['entities'] before.
However, even with #285 merged, sometimes I see that the reading order is off. See these results for example:

Line 20 (ALGER) is in a different region (say B) than lines 19 (....), 21 (PRÉSIDENT.) and 22 (JUGES.) in region A. Without the changes in this PR, region A is output twice, producing a duplicate block_A ID. Of course I would like to make sure that lines are ordered better, but if this happens I would still like to get valid XML output.

It's true that my approach outputs lines 21 and 22 before line 20. This is how it could fail the roundtrip tests. I can understand that you do not want this to happen, and it makes me wonder: should finding the region for each line be done during serialisation?

mittagessen · 2021-10-22T11:58:30Z

It didn't fail with duplicate IDs but incorrect round tripping but the sample records the tests use(d) were created before your fix for #285. I updated them and everything seems to run through.

As mentioned a better way to deal with this whole issue is to decouple line-region assignment and reading order. The idea is that the segmenter outputs a line-region hierarchy and an arbitrary explicit ordering of the lines. The serializer would then transform the unordered hierarchy into a pseudo-RO (obeying the continuity criterion inside a region) which is used as the implicit ordering for all the formats while at the same time serializing the actual order through whatever explicit ordering mechanism is provided by each standard (if any). This would largely disentangle serializer and segmenter.

We can do that in principle for the PageXML template and hopefully soon for ALTO if altoxml/schema#74 gets approved.

Do not duplicate regions during serialization

e64b2b7

Relates to mittagessen#269. Only regions with lines are added. More work is needed to support non-text regions.

Add region entities in order of lines

08d55ab

mittagessen merged commit 02edbbe into mittagessen:master Oct 21, 2021

mittagessen mentioned this pull request Oct 21, 2021

Revert "Do not duplicate regions during serialization" #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not duplicate regions during serialization #287

Do not duplicate regions during serialization #287

bencomp commented Oct 20, 2021

mittagessen commented Oct 20, 2021 •

edited

Loading

bencomp commented Oct 20, 2021

bencomp commented Oct 20, 2021

mittagessen commented Oct 21, 2021 •

edited

Loading

bencomp commented Oct 22, 2021

mittagessen commented Oct 22, 2021

Do not duplicate regions during serialization #287

Do not duplicate regions during serialization #287

Conversation

bencomp commented Oct 20, 2021

mittagessen commented Oct 20, 2021 • edited Loading

bencomp commented Oct 20, 2021

bencomp commented Oct 20, 2021

mittagessen commented Oct 21, 2021 • edited Loading

bencomp commented Oct 22, 2021

mittagessen commented Oct 22, 2021

mittagessen commented Oct 20, 2021 •

edited

Loading

mittagessen commented Oct 21, 2021 •

edited

Loading