Processor result object #8

kba · 2024-08-14T17:44:34Z

Just a quick draft, to be refined tomorrow.

…nto new-processor-api

kba · 2024-08-14T17:51:22Z

OCR-D/ocrd_kraken#44 is adapted.

bertsky

Thanks for starting this and spotting all these typing errors!

I would prefer calling the new class OcrdPageResult instead of OcrdProcessResult, because

this is about PcGts / OcrdPage objects primarily
this is about the single-page function, while "processing" and "processor" in general refer to workspace operations
we have no OcrdProcess

Also, I wonder if it is really necessary to raise this to ocrd_models – it's meant to be an internal interface between ocrd.Processor.process_page_pcgts and ocrd.Processor.process_page_file.

Instead of making images a list of tuples again, why not define a data class with members like pil / file_id / file_path?

More importantly, let's go one step further and

replace the file ID with just the file ID suffix to be added to the PcGts file ID (that way, process_page_pcgts does not need to know about the output file ID at all)
replace the image path with the reference to the generated/annotated AlternativeImageType in the resulting PcGtsType, so the calling process_page_file can simply set its pathname after writing the image file (that way, process_page_pcgts does not need to know the output file path in advance)

BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.

kba · 2024-08-15T11:10:53Z

BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.

~~Well, crap, I just git checkouted that away after switching branches. I'll rewrite it.~~

I would prefer calling the new class OcrdPageResult instead of OcrdProcessResult, because

~~Agree.~~ Done

Instead of making images a list of tuples again, why not define a data class with members like pil / file_id / file_path?

~~No reason really, I'll add OcrdPageResultImage~~ Done

replace the file ID with just the file ID suffix to be added to the PcGts file ID (that way, process_page_pcgts does not need to know about the output file ID at all)

~~Will do.~~ Done

replace the image path with the reference to the generated/annotated AlternativeImageType in the resulting PcGtsType, so the calling process_page_file can simply set its pathname after writing the image file (that way, process_page_pcgts does not need to know the output file path in advance)

~~Will also do.~~ Done

kba · 2024-08-15T11:19:13Z

Also, I wonder if it is really necessary to raise this to ocrd_models – it's meant to be an internal interface between ocrd.Processor.process_page_pcgts and ocrd.Processor.process_page_file.

No reason now, I just like that all the "dumb" data classes are in one spot. But I'll move it/reimplement it closer to processor.base.

kba · 2024-08-15T12:25:41Z

BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.

Well, crap, I just git checkouted that away after switching branches. I'll rewrite it.

I would prefer calling the new class OcrdPageResult instead of OcrdProcessResult, because

Agree

Instead of making images a list of tuples again, why not define a data class with members like pil / file_id / file_path?

Done and OCR-D/ocrd_kraken#44 adapted accordingly. Now to change the interface.

…e_id with OcrdPageResult.file_id_suffix

kba · 2024-08-15T13:09:03Z

OK, I think I have everything together now, interface-wise. Now adapting kraken, looking forward to simplify binarize in particular ;)

kba · 2024-08-15T13:14:29Z

I think we can go even further with simplifying the handling of alternative images, but I'll do that after the a1.

kba · 2024-08-15T13:23:23Z

OK, I think I have everything together now, interface-wise. Now adapting kraken, looking forward to simplify binarize in particular ;)

Done

bertsky

Excellent!

src/ocrd/processor/base.py

bertsky · 2024-08-15T14:38:49Z

src/ocrd/processor/base.py

+        input_pcgts : List[Optional[OcrdPage]] = [None] * len(input_files)
+        assert isinstance(input_files[0], (OcrdFile, ClientSideOcrdFile))


Why Optional (also in function prototype)?

Here: Because we're instantiating a list of None values, which are not OcrdPage.

In the function signature of process_page_pcgts: Same situation, there might be "holes" in the list of input_pcgts when any of the input_files in process_page_files cannot be parsed as PAGE-XML.

And for process_page_files: The input_files can be hole-y, if the workspace.download_file fails for any of the files (beyond the first?).

But really, I was trying to make sure that static type checking had no more complaints. I tried to add assert statements where I know that variables must be defined or of a certain type to mitigate the "everything might be None" problem somewhat.

Oh, right, I forgot about the holes returned by zip_input_files for multiple fileGrps but incomplete PAGE-XML coverage per page!

Maybe we should document this more loudly.

src/ocrd/processor/base.py

bertsky · 2024-08-15T14:40:27Z

src/ocrd/processor/base.py

-                input_pcgts[i] = page_from_file(input_file)
+                page_ = page_from_file(input_file)
+                assert isinstance(page_, PcGtsType)
+                input_pcgts[i] = page_
            except ValueError as e:


Suggested change

except ValueError as e:

except (AssertionError, ValueError) as e:

Can this ever happen, ie. can page_from_file(with_etree=False) ever return anything other than a PcGtsType? I think if that was ever the case, we'd want that AssertionError to be raised because then we'd have broken something.

You're right – it cannot happen. But then what is the assertion good for – satisfying the type checker?

First, my curiosity that I understand the behavior correctly. But secondly, yes, the type checker ;)

But reading this again, I should have used OcrdPage not PcGtsType, which is just an alias but we use OcrdPage in the method typing.

Yes. Feel free to change in OCR-D#1240.

src/ocrd/processor/base.py

src/ocrd/processor/builtin/dummy_processor.py

Co-authored-by: Robert Sachunsky <[email protected]>

…ive_image Co-authored-by: Robert Sachunsky <[email protected]>

… into processor-result-object

kba · 2024-08-15T16:53:22Z

Merged into OCR-D#1240 for the a1 release, we can still discuss the Optional weirdness and AssertionError here or there of course.

bertsky · 2024-08-15T22:21:57Z

we can still discuss the Optional weirdness and AssertionError here or there of course.

no need to – thanks!

update ocrd-cis-binarize to be compatible with bertsky/core#8

kba added 4 commits August 14, 2024 16:17

.

5117684

fix make spec

456cc6d

Merge branch 'new-processor-api' of https://github.com/bertsky/core i…

e03a906

…nto new-processor-api

process_page_pcgts must return OcrdProcessResult

90afb8a

kba added a commit to OCR-D/ocrd_kraken that referenced this pull request Aug 14, 2024

adapt to bertsky/core#8

c0c1eb7

bertsky reviewed Aug 15, 2024

View reviewed changes

bertsky mentioned this pull request Aug 15, 2024

Port to ocrd core version 3.0.0 bertsky/ocrd_cis#5

Open

kba added 3 commits August 15, 2024 13:30

reimplement OcrdPageResult

3d094d6

update spec to v3.25.0, ocrd_tool.schema.yml

72eb75b

process_page_file: fix handling of images

75cb20c

kba added 2 commits August 15, 2024 14:57

process_page_pcgts: remove output_file_id, replace OcrdPageResult.fil…

9a1c7ad

…e_id with OcrdPageResult.file_id_suffix

OcrdPageResultImage requires passing alternative_image w/o filename set

60ad424

kba marked this pull request as ready for review August 15, 2024 13:43

kba added a commit to kba/ocrd_ocropus that referenced this pull request Aug 15, 2024

update ocrd-cis-binarize to be compatible with bertsky/core#8

fbaafcb

bertsky approved these changes Aug 15, 2024

View reviewed changes

kba and others added 7 commits August 15, 2024 17:06

Processor.verify: handle -1 case

50dfdd6

Co-authored-by: Robert Sachunsky <[email protected]>

processor.base: remove obsolete FIXME

53f2634

Co-authored-by: Robert Sachunsky <[email protected]>

Processor.process_page_pcgts: update docstring for file_path/alternat…

d210afa

…ive_image Co-authored-by: Robert Sachunsky <[email protected]>

export OcrdPageResult{Image} from ocrd.processor

5718cf9

Processor.process.page_pcgts: simplify references in docstring

f5f3145

Merge branch 'processor-result-object' of https://github.com/OCR-D/core…

db68bb5

… into processor-result-object

allow "from ocrd_models import OcrdPage

7045318

kba merged commit a9dba73 into bertsky:new-processor-api Aug 15, 2024

kba deleted the processor-result-object branch August 15, 2024 16:53

MehmedGIT added a commit to MehmedGIT/ocrd_cis that referenced this pull request Aug 16, 2024

Merge pull request #3 from kba/port-to-v3-return-object

6b19f35

update ocrd-cis-binarize to be compatible with bertsky/core#8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processor result object #8

Processor result object #8

kba commented Aug 14, 2024

kba commented Aug 14, 2024

bertsky left a comment

kba commented Aug 15, 2024 •

edited

Loading

kba commented Aug 15, 2024

kba commented Aug 15, 2024

kba commented Aug 15, 2024

kba commented Aug 15, 2024

kba commented Aug 15, 2024

bertsky left a comment

bertsky Aug 15, 2024

kba Aug 15, 2024

bertsky Aug 15, 2024

bertsky Aug 15, 2024

kba Aug 15, 2024

bertsky Aug 15, 2024

kba Aug 16, 2024

kba Aug 16, 2024

bertsky Aug 16, 2024

kba commented Aug 15, 2024

bertsky commented Aug 15, 2024

		input_pcgts : List[Optional[OcrdPage]] = [None] * len(input_files)
		assert isinstance(input_files[0], (OcrdFile, ClientSideOcrdFile))

	except ValueError as e:
	except (AssertionError, ValueError) as e:

Processor result object #8

Processor result object #8

Conversation

kba commented Aug 14, 2024

kba commented Aug 14, 2024

bertsky left a comment

Choose a reason for hiding this comment

kba commented Aug 15, 2024 • edited Loading

kba commented Aug 15, 2024

kba commented Aug 15, 2024

kba commented Aug 15, 2024

kba commented Aug 15, 2024

kba commented Aug 15, 2024

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba commented Aug 15, 2024

bertsky commented Aug 15, 2024

kba commented Aug 15, 2024 •

edited

Loading