Port to v3 #44

kba · 2024-08-11T12:55:54Z

No description provided.

bertsky

Perfect!

bertsky

now needs update

ocrd_kraken/binarize.py

ocrd_kraken/recognize.py

ocrd_kraken/segment.py

ocrd_kraken/binarize.py

…o port-to-v3 # Conflicts: # ocrd_kraken/binarize.py

(after MacOS fails with `torch ... not supported on this platform` 🙄 )

…oc where needed

bertsky · 2024-08-30T13:47:19Z

tests/conftest.py

+CONFIGS = ['', 'pageparallel', 'metscache', 'pageparallel+metscache']
+
+@pytest.fixture(params=CONFIGS)
+def workspace(tmpdir, pytestconfig, request):
+    def _make_workspace(workspace_path):
+        initLogging()
+        if pytestconfig.getoption('verbose') > 0:
+            setOverrideLogLevel('DEBUG')
+        with pushd_popd(tmpdir):
+            directory = str(tmpdir)
+            resolver = Resolver()
+            workspace = resolver.workspace_from_url(workspace_path, dst_dir=directory, download=True)
+            config.OCRD_MISSING_OUTPUT = "ABORT"
+            if 'metscache' in request.param:
+                config.OCRD_METS_CACHING = True
+                print("enabled METS caching")
+            if 'pageparallel' in request.param:
+                config.OCRD_MAX_PARALLEL_PAGES = 4
+                print("enabled page-parallel processing")
+                def _start_mets_server(*args, **kwargs):
+                    print("running with METS server")
+                    server = OcrdMetsServer(*args, **kwargs)
+                    server.startup()
+                process = Process(target=_start_mets_server,
+                                  kwargs={'workspace': workspace, 'url': 'mets.sock'})
+                process.start()
+                sleep(1)
+                workspace = Workspace(resolver, directory, mets_server_url='mets.sock')
+                yield {'workspace': workspace, 'mets_server_url': 'mets.sock'}
+                process.terminate()
+            else:
+                yield {'workspace': workspace}
+        config.reset_defaults()
+    return _make_workspace
+
+
+@pytest.fixture
+def workspace_manifesto(workspace):
+    yield from workspace(assets.path_to('communist_manifesto/data/mets.xml'))
+
+@pytest.fixture
+def workspace_aufklaerung(workspace):
+    yield from workspace(assets.path_to('kant_aufklaerung_1784/data/mets.xml'))


BTW, this could be a template for all processor tests. Testing w/o METS Server and w/o is important IMO.

We can easily add more configuration scenarios there.

bertsky · 2024-08-30T13:47:51Z

tests/test_recognize.py

+def test_recognize(workspace_aufklaerung):
+    # some models (like default en) require binarized images
+    run_processor(KrakenBinarize,
+                  input_file_grp="OCR-D-GT-PAGE",
+                  output_file_grp="OCR-D-GT-PAGE-BIN",
+                  **workspace_aufklaerung,
+    )
+    run_processor(KrakenRecognize,
+                  # re-use layout, overwrite text:
+                  input_file_grp="OCR-D-GT-PAGE-BIN",
+                  output_file_grp="OCR-D-OCR-KRAKEN",
+                  parameter={'overwrite_text': True},
+                  **workspace_aufklaerung,
+    )
+    ws = workspace_aufklaerung['workspace']
+    ws.save_mets()
+    assert os.path.isdir(os.path.join(ws.directory, 'OCR-D-OCR-KRAKEN'))
+    results = ws.find_files(file_grp='OCR-D-OCR-KRAKEN', mimetype=MIMETYPE_PAGE)
+    result0 = next(results, False)
+    assert result0, "found no output PAGE file"
+    result0 = page_from_file(result0)
+    text0 = result0.etree.xpath('//page:Glyph/page:TextEquiv/page:Unicode', namespaces=NAMESPACES)
+    assert len(text0) > 0, "found no glyph text in output PAGE file"


And here the consumer part.

bertsky · 2024-08-30T13:49:29Z

Makefile

@@ -68,7 +68,7 @@ docker:

 # Run test
 test: tests/assets
-	$(PYTHON) -m pytest tests $(PYTEST_ARGS)
+	$(PYTHON) -m pytest  tests --durations=0 $(PYTEST_ARGS)


And with this we get to see what difference in performance these settings make:

93.35s call tests/test_recognize.py::test_recognize[pageparallel+metscache] 92.28s call tests/test_recognize.py::test_recognize[pageparallel] 76.19s call tests/test_recognize.py::test_recognize[] 74.83s call tests/test_recognize.py::test_recognize[metscache] 55.92s call tests/test_segment.py::test_run_blla[metscache] 55.11s call tests/test_segment.py::test_run_blla[] 48.43s call tests/test_segment.py::test_run_blla[pageparallel+metscache] 41.80s call tests/test_segment.py::test_run_blla[pageparallel]

(In this case, it was only 2 pages – the scaling factor is not so great.)

kba added 4 commits August 11, 2024 14:42

bump requirement to ocrd >= 3.0.0a1

901098a

port binarize to v3

78849a9

port segment to v3

30db9a4

port recognize to v3

9ea80c7

kba requested review from MehmedGIT and bertsky August 11, 2024 12:55

bertsky approved these changes Aug 11, 2024

View reviewed changes

MehmedGIT approved these changes Aug 12, 2024

View reviewed changes

bertsky mentioned this pull request Aug 12, 2024

New processor API OCR-D/core#1240

Open

bertsky requested changes Aug 13, 2024

View reviewed changes

ocrd_kraken/binarize.py Outdated Show resolved Hide resolved

ocrd_kraken/recognize.py Outdated Show resolved Hide resolved

ocrd_kraken/recognize.py Outdated Show resolved Hide resolved

ocrd_kraken/segment.py Outdated Show resolved Hide resolved

bertsky added 5 commits August 14, 2024 00:16

ocrd-tool.json: add cardinality specs

163ee7d

test_binarize.py: use stable API

41b0045

test_recognize.py: use stable API

340f513

test_segment.py: use stable API

cd0ce01

remove fileGrp cardinality assertions

4671e98

bertsky self-requested a review August 13, 2024 22:25

bertsky approved these changes Aug 13, 2024

View reviewed changes

MehmedGIT reviewed Aug 13, 2024

View reviewed changes

ocrd_kraken/binarize.py Outdated Show resolved Hide resolved

bertsky and others added 3 commits August 14, 2024 08:12

binarize: re-instate setup for logger

a497287

adapt to bertsky/core#8

c0c1eb7

Merge branch 'port-to-v3' of https://github.com/OCR-D/ocrd_kraken int…

712d1d3

…o port-to-v3 # Conflicts: # ocrd_kraken/binarize.py

kba mentioned this pull request Aug 14, 2024

Processor result object bertsky/core#8

Merged

kba and others added 8 commits August 15, 2024 14:24

require regex

e8ec7fe

update to OcrdPageResult change

e76d708

update to latest OcrdPageResult and process_page_pcgts

2832722

CI: switch back to Ubuntu

a8a859b

(after MacOS fails with `torch ... not supported on this platform` 🙄 )

self.logger: adapt to bertsky/core#10

0e30138

tests: migrate unittest→pytest, simplify

6d287b0

tests: base→conftest

316eedb

tests: also w/ METS server + page-parallel and w/ METS caching

43c600f

bertsky added 6 commits August 30, 2024 13:55

remove v2 tool facility

32b2e9c

tests: use workspace manifesto→aufklaerung (1→2 pages), binarize ad h…

c73b3ef

…oc where needed

tests: avoid running into 'too many failures'

a23d4c3

update v3 requirement

ae6445b

tests: add actual assertions

fd15e2a

update v3 requirement

43a88ea

bertsky reviewed Aug 30, 2024

View reviewed changes

bertsky mentioned this pull request Sep 16, 2024

migrate to core v3 OCR-D/ocrd_calamari#117

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port to v3 #44

Port to v3 #44

kba commented Aug 11, 2024

bertsky left a comment

bertsky left a comment

bertsky Aug 30, 2024

bertsky Aug 30, 2024

bertsky Aug 30, 2024

Port to v3 #44

Are you sure you want to change the base?

Port to v3 #44

Conversation

kba commented Aug 11, 2024

bertsky left a comment

Choose a reason for hiding this comment

bertsky left a comment

Choose a reason for hiding this comment

bertsky Aug 30, 2024

Choose a reason for hiding this comment

bertsky Aug 30, 2024

Choose a reason for hiding this comment

bertsky Aug 30, 2024

Choose a reason for hiding this comment