Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor result object #8

Merged
merged 16 commits into from
Aug 15, 2024

Conversation

kba
Copy link

@kba kba commented Aug 14, 2024

Just a quick draft, to be refined tomorrow.

kba added a commit to OCR-D/ocrd_kraken that referenced this pull request Aug 14, 2024
@kba
Copy link
Author

kba commented Aug 14, 2024

OCR-D/ocrd_kraken#44 is adapted.

Copy link
Owner

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this and spotting all these typing errors!

I would prefer calling the new class OcrdPageResult instead of OcrdProcessResult, because

  • this is about PcGts / OcrdPage objects primarily
  • this is about the single-page function, while "processing" and "processor" in general refer to workspace operations
  • we have no OcrdProcess

Also, I wonder if it is really necessary to raise this to ocrd_models – it's meant to be an internal interface between ocrd.Processor.process_page_pcgts and ocrd.Processor.process_page_file.

Instead of making images a list of tuples again, why not define a data class with members like pil / file_id / file_path?

More importantly, let's go one step further and

  • replace the file ID with just the file ID suffix to be added to the PcGts file ID (that way, process_page_pcgts does not need to know about the output file ID at all)
  • replace the image path with the reference to the generated/annotated AlternativeImageType in the resulting PcGtsType, so the calling process_page_file can simply set its pathname after writing the image file (that way, process_page_pcgts does not need to know the output file path in advance)

BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.

@kba
Copy link
Author

kba commented Aug 15, 2024

BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.

Well, crap, I just git checkouted that away after switching branches. I'll rewrite it.

I would prefer calling the new class OcrdPageResult instead of OcrdProcessResult, because

Agree. Done

Instead of making images a list of tuples again, why not define a data class with members like pil / file_id / file_path?

No reason really, I'll add OcrdPageResultImage Done

  • replace the file ID with just the file ID suffix to be added to the PcGts file ID (that way, process_page_pcgts does not need to know about the output file ID at all)

Will do. Done

  • replace the image path with the reference to the generated/annotated AlternativeImageType in the resulting PcGtsType, so the calling process_page_file can simply set its pathname after writing the image file (that way, process_page_pcgts does not need to know the output file path in advance)

Will also do. Done

@kba
Copy link
Author

kba commented Aug 15, 2024

Also, I wonder if it is really necessary to raise this to ocrd_models – it's meant to be an internal interface between ocrd.Processor.process_page_pcgts and ocrd.Processor.process_page_file.

No reason now, I just like that all the "dumb" data classes are in one spot. But I'll move it/reimplement it closer to processor.base.

@kba
Copy link
Author

kba commented Aug 15, 2024

BTW, I think you forgot to add ocrd_models.ocrd_process_result.py.

Well, crap, I just git checkouted that away after switching branches. I'll rewrite it.

I would prefer calling the new class OcrdPageResult instead of OcrdProcessResult, because

Agree

Instead of making images a list of tuples again, why not define a data class with members like pil / file_id / file_path?

Done and OCR-D/ocrd_kraken#44 adapted accordingly. Now to change the interface.

@kba
Copy link
Author

kba commented Aug 15, 2024

OK, I think I have everything together now, interface-wise. Now adapting kraken, looking forward to simplify binarize in particular ;)

@kba
Copy link
Author

kba commented Aug 15, 2024

I think we can go even further with simplifying the handling of alternative images, but I'll do that after the a1.

@kba
Copy link
Author

kba commented Aug 15, 2024

OK, I think I have everything together now, interface-wise. Now adapting kraken, looking forward to simplify binarize in particular ;)

Done

@kba kba marked this pull request as ready for review August 15, 2024 13:43
kba added a commit to kba/ocrd_ocropus that referenced this pull request Aug 15, 2024
Copy link
Owner

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

src/ocrd/processor/base.py Outdated Show resolved Hide resolved
Comment on lines +328 to +329
input_pcgts : List[Optional[OcrdPage]] = [None] * len(input_files)
assert isinstance(input_files[0], (OcrdFile, ClientSideOcrdFile))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Optional (also in function prototype)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here: Because we're instantiating a list of None values, which are not OcrdPage.

In the function signature of process_page_pcgts: Same situation, there might be "holes" in the list of input_pcgts when any of the input_files in process_page_files cannot be parsed as PAGE-XML.

And for process_page_files: The input_files can be hole-y, if the workspace.download_file fails for any of the files (beyond the first?).

But really, I was trying to make sure that static type checking had no more complaints. I tried to add assert statements where I know that variables must be defined or of a certain type to mitigate the "everything might be None" problem somewhat.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I forgot about the holes returned by zip_input_files for multiple fileGrps but incomplete PAGE-XML coverage per page!

Maybe we should document this more loudly.

src/ocrd/processor/base.py Outdated Show resolved Hide resolved
input_pcgts[i] = page_from_file(input_file)
page_ = page_from_file(input_file)
assert isinstance(page_, PcGtsType)
input_pcgts[i] = page_
except ValueError as e:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except ValueError as e:
except (AssertionError, ValueError) as e:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen, ie. can page_from_file(with_etree=False) ever return anything other than a PcGtsType? I think if that was ever the case, we'd want that AssertionError to be raised because then we'd have broken something.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right – it cannot happen. But then what is the assertion good for – satisfying the type checker?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, my curiosity that I understand the behavior correctly. But secondly, yes, the type checker ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But reading this again, I should have used OcrdPage not PcGtsType, which is just an alias but we use OcrdPage in the method typing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Feel free to change in OCR-D#1240.

src/ocrd/processor/base.py Show resolved Hide resolved
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
src/ocrd/processor/builtin/dummy_processor.py Show resolved Hide resolved
src/ocrd/processor/builtin/dummy_processor.py Show resolved Hide resolved
src/ocrd/processor/builtin/dummy_processor.py Show resolved Hide resolved
@kba kba merged commit a9dba73 into bertsky:new-processor-api Aug 15, 2024
@kba
Copy link
Author

kba commented Aug 15, 2024

Merged into OCR-D#1240 for the a1 release, we can still discuss the Optional weirdness and AssertionError here or there of course.

@kba kba deleted the processor-result-object branch August 15, 2024 16:53
@bertsky
Copy link
Owner

bertsky commented Aug 15, 2024

we can still discuss the Optional weirdness and AssertionError here or there of course.

no need to – thanks!

MehmedGIT added a commit to MehmedGIT/ocrd_cis that referenced this pull request Aug 16, 2024
update ocrd-cis-binarize to be compatible with bertsky/core#8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants