Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plausibilize and sanitize are too broad terms #18

Open
mikegerber opened this issue Nov 26, 2019 · 5 comments
Open

plausibilize and sanitize are too broad terms #18

mikegerber opened this issue Nov 26, 2019 · 5 comments

Comments

@mikegerber
Copy link

mikegerber commented Nov 26, 2019

ocrd-segment-repair has the optional operations "plausibilize" and "sanitize" – I have no idea what this exactly does :) I would prefer something like this:

  • shrink-regions-to-hull-of-lines
  • whatever-plausibilize-does

There seems to also be another thing ocrd-segment-repair does.

In other words: Make operations explicit.

@bertsky
Copy link
Collaborator

bertsky commented Nov 26, 2019

ocrd-segment-repair has the optional operations "plausibilize" and "sanitize" – I have no idea what this exactly does :)

I agree, these are not expressive enough, or even memorable (which is what...)

I would prefer something like this:

* shrink-regions-to-hull-of-lines

...or just shrink-regions?

* whatever-plausibilize-does

ATM all it does is remove regions fully contained by others or nearly equal to them (and fix the ReadingOrder afterwards).

It's intended to become much more though, like merging or shrinking overlapping neighbouring regions, or fixing reading order via basic heuristics (e.g. no arbitrary jumps back and forth).

Since this processor started out under the name repair but received a default behaviour of just warning about likely errors, we needed some verb for the actual action.

Maybe separate-neighbours?

@wrznr?

@wrznr
Copy link
Collaborator

wrznr commented Nov 26, 2019

Right, they have very common names since they are intended to do various things. Right now, they do not do very much and are not ready for productive use or even testing. I would rather keep the current names and see what the processors will become. Let us discuss about a proper name when implementation and documentation are finished. (ocrd_segment will be my main focus in December)

@mikegerber
Copy link
Author

@mikegerber
Copy link
Author

mikegerber commented Oct 16, 2020

Documentation from https://ocr-d.de/en/workflows:

  • plausibilize = Remove redundant (almost equal or almost contained) regions, and merge overlapping regions
  • sanitize = Shrink and/or expand a region in such a way that it coordinates include those of all its lines

@bertsky
Copy link
Collaborator

bertsky commented Oct 16, 2020

Documentation from https://ocr-d.de/en/workflows:

  • plausibilize = Remove redundant (almost equal or almost contained) regions, and merge overlapping regions
  • sanitize = Shrink and/or expand a region in such a way that it coordinates include those of all its lines

This is actually from the ocrd-tool json description of these parameters, see ocrd-segment-repair -h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants