Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd-segment-repair: handle case where points is empty #60

Open
stefanCCS opened this issue Jun 8, 2022 · 6 comments
Open

ocrd-segment-repair: handle case where points is empty #60

stefanCCS opened this issue Jun 8, 2022 · 6 comments

Comments

@stefanCCS
Copy link

Version 0.1.20, ocrd/core 2.33.0

I have a PAGE file, which does not have any real content - like this:

    <pc:Page imageFilename="OCR-D-IMG/0038_IMAGE000918_00001.tif" imageWidth="1420" imageHeight="2313" orientation="0.">
        <pc:AlternativeImage filename="OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png" comments=",binarized"/>
        <pc:TextRegion id="TR-1" orientation="0.">
            <pc:Coords points=""/>
        </pc:TextRegion>
    </pc:Page>

If I call ocrd-segment-extract-lines, I get an expection like this:

09:19:19.733 DEBUG ocrd.workspace.image_from_page - page 'P_0038_IMAGE000918_00001' has  orientation=0 skew=0.00
09:19:19.733 DEBUG ocrd.workspace.image_from_page - Using AlternativeImage 1 {'', 'binarized'} for page 'P_0038_IMAGE000918_00001'
09:19:19.734 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-BIN ID=OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN, mimetype=image/png, url=OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png, local_filename=OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png]/>  [_recursion_count=0]
09:19:19.735 DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
09:19:19.735 DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 65536
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/bin/ocrd-segment-extract-lines", line 8, in <module>
    sys.exit(ocrd_segment_extract_lines())
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_segment/cli.py", line 65, in ocrd_segment_extract_lines
    return ocrd_cli_wrap_processor(ExtractLines, *args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/decorators/__init__.py", line 88, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/processor/helpers.py", line 88, in run_processor
    processor.process()
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_segment/extract_lines.py", line 171, in process
    transparency=self.parameter['transparency'])
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/workspace.py", line 829, in image_from_segment
    fill=fill, transparency=transparency)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/workspace.py", line 1012, in _crop
    segment_polygon = coordinates_of_segment(segment, parent_image, parent_coords)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_utils/image.py", line 136, in coordinates_of_segment
    polygon = np.array(polygon_from_points(segment.get_Coords().points))
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_utils/image.py", line 148, in polygon_from_points
    polygon.append([float(x_y[0]), float(x_y[1])])
ValueError: could not convert string to float: 

My expection would be, that this PAGE file simply would be ignored.
--> please, clarify ...

@kba
Copy link
Member

kba commented Jun 8, 2022

The problem is that you have a text region with empty Coords - this is not allowed in the PAGE-XML schema, you should get

Value '' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.

as error message.

How was the empty PAGE generated? If it's by an OCR-D processor, we need to fix it.

@stefanCCS
Copy link
Author

This error I have made by my own ;-) - I know that I need to correct something in my code - but still as it only occurs once, I cannot go on with all the other regions ... Just, would be nice, if extract-lines would be a bit more robust ...

@bertsky
Copy link
Collaborator

bertsky commented Jun 8, 2022

We've discussed whether OCR-D processors should be robust to invalid or unconventional PAGE in the past. IIRC the general consensus was that it would overstretch both the coding effort (much more boilerplate and things one can do wrong or forget to do) and the performance.

So the idea is to selectively use ocrd-segment-repair if you know you have problems in your input (or after some processor's output). Not sure if your particular case (missing @points) is already covered though.

@stefanCCS
Copy link
Author

Understood, of course. In this special I already have fixed the root cause. Therefore, no need to do something like ocrd-segment-repair.
I will close this issue here, now.

@bertsky
Copy link
Collaborator

bertsky commented Jun 8, 2022

Therefore, no need to do something like ocrd-segment-repair.

Too bad – I was quite curious how it would handle that case, you know :-)

@bertsky
Copy link
Collaborator

bertsky commented Jun 8, 2022

I was quite curious how it would handle that case, you know :-)

You guessed it: it wouldn't work!

I created OCR-D/core#877 for the core side, but we also have to handle that case differently in the repair code here. So let's keep open, and I'll rename the issue.

@bertsky bertsky changed the title ocrd-segment-extract-lines creates expection, when PAGE file is "empty" ocrd-segment-repair: handle case where points is empty Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants