Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calamari 2.2 #61

Open
mikegerber opened this issue Mar 2, 2021 · 26 comments
Open

Calamari 2.2 #61

mikegerber opened this issue Mar 2, 2021 · 26 comments
Assignees
Labels
blocked dependencies Pull requests that update a dependency file

Comments

@mikegerber
Copy link
Collaborator

mikegerber commented Mar 2, 2021

Calamari 2.0 is out.

I don't see benefits from updating the dependency, other than staying uptodate/compatible.

@mikegerber mikegerber self-assigned this Mar 2, 2021
@mikegerber mikegerber added the dependencies Pull requests that update a dependency file label Mar 2, 2021
@kba
Copy link
Member

kba commented Mar 2, 2021

@maxnth @andbue @ChWick can answer this more accurately but there were quite a few refactorings and performance enhancements since the last 1.x release v1.0.5. It is not time-critical to upgrade as soon as possible but I think it would be good to get started assessing what the benefits are, what has changed and how to proceed with adapting.

@mikegerber
Copy link
Collaborator Author

#54 is possibly related

@andbue
Copy link

andbue commented Mar 2, 2021

Hi, I don't think it should be too complicated to update ocr_calamari to 2.0. The whole preprocessing stuff is loaded according to the definition in the model, so I was wrong to assume you're somehow circumventing it – sorry about that!
The cleanest and probably most efficient way would be to implement a custom DataReader class that somehow handles the workspace data. I'm not sure, however, if your data classes can be pickled and sent to worker threads without problems.

If it helps, you could have a look at my client that caches preprocessed lines in a hdf5 file. I wrote a reader and included it in the DataReaderFactory here. Prediction happens here. Other than setting up the dataset and the removal of the preprocessing (which you should avoid), this is in most parts taken from predict.py anyway.

@bertsky
Copy link
Contributor

bertsky commented Jun 19, 2021

Just wanted to note that Calamari 2 depends on tfaip which requires Python >= 3.7 – which would remove support for the default Python version on OCR-D's (still) default target Ubuntu 18.

@mikegerber
Copy link
Collaborator Author

mikegerber commented Feb 23, 2022

I have opened Calamari-OCR/calamari#304 because Calamari 2.1.x depended on TF 2.4.x (PyPI-incompatible with Python 3.9...), but @andbue already updated Calamari 2.2(!) to remove this restriction. 👍

With this TF version hell I think I'll first update the test rigging to test on all Python versions 3.7-3.9 (maybe even 3.10).

@mikegerber mikegerber changed the title Calamari 2.0 Calamari 2.2 Feb 23, 2022
@mikegerber
Copy link
Collaborator Author

Heads up: I'm working on this

@bertsky
Copy link
Contributor

bertsky commented Apr 27, 2022

@mikegerber, do already have something you could share (as a feature branch)? I guess there are multiple API changes to cope with, plus perhaps a need to deal with model migration?

@stefanCCS
Copy link

Hi everybody,
any update concerning this, as OCR-D now supports Python 3.7 ?

@mikegerber
Copy link
Collaborator Author

Sorry for not keeping anyone up to date: I plan to work on this further in the coming week!

@bertsky
Copy link
Contributor

bertsky commented May 19, 2022

Do you have any news for us @mikegerber? Have you looked at the native PAGE-XML output of Calamari 2 – is it re/usable?

@mikegerber
Copy link
Collaborator Author

Sorry... I have been neglecting this. I try to finish this soon after my vacation. PRs welcome though, if they come in the meantime

@mikegerber
Copy link
Collaborator Author

Combination of bad time management and serious illness (for months!) and the following back log lead to more delay...

@mikegerber
Copy link
Collaborator Author

Blocked by #84 (the GPL issues), as it seems. 🙄

@mikegerber
Copy link
Collaborator Author

@bertsky in #87 (#87 (comment)):

I would say that 2.x support is quite urgent (because most/best models are trained on 2.x). Given that Calamari 2.x now has good native PAGE support, this should actually be easy IIUC.

We have 2 workarounds for that:

1. extracting line pairs via ocrd-segment-extract-lines, running the 2.x calamari-predict on them, and then re-importing with ocrd-segment-replace-text
   ```
    ocrd-segment-extract-lines -I $IGRP -O LINES
    calamari-predict --pipeline.num_processes 4 --checkpoint /path/to/\*.json --data.images "LINES/*.png"
    ocrd-segment-replace-text -I $IGRP -O $OGRP -P file_glob "LINES/*.pred.txt"
   ```

2. running the 2.x calamari-predict on the PAGE files directly and then reimporting the resulting PAGE files into the METS via bulk-add
   ```
    calamari-predict --checkpoint /path/to/deep3_lsh4/\*.json --data PageXML --data.xml_files "$IGRP/*.xml" --data.images "$IMGGRP/*.png" --data.output_glyphs True --data.max_glyph_alternatives 5 --data.output_confidences True
    ocrd workspace find -m application/vnd.prima.page+xml -G $IGRP -k page_id -k file_id -k url | while read page_id file_id url; do out=${url%.xml}.pred.xml; file_id=${file_id//$IGRP/$OGRP}; url=${url//$IGRP/$OGRP}; url=${url//pred.}; mv $out $url; echo $page_id $file_id $url; done | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<fileid>.*) (?P<url>.*)' -G $OGRP -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' -
   ```

But in both cases we loose any information below the line level including confidence, and we get no model provenance here). Also, with these recipes we cannot use the regular specialised workflow formats.

This is all interesting, but does Calamari 2.x now have a valid license? Otherwise this is still blocked and I will not work on this.

@bertsky
Copy link
Contributor

bertsky commented Aug 16, 2023

This is all interesting, but does Calamari 2.x now have a valid license? Otherwise this is still blocked and I will not work on this.

I don't get why the switch to GPL would be such a blocker. Is it for you personally or by requirement?

Anyway, if you give me a definite answer then I could decide if I want to take over from here.

@mikegerber
Copy link
Collaborator Author

mikegerber commented Aug 16, 2023

There are multiple issues:

  • Calamari 2.x's Apache 2.0 license is invalid as it uses tfaip.
  • I do not think we can use it while not also violating licenses.
  • I would say even if this would be ok to do, legally, we would need to relicense to GPL.

IF Calamari 2.x

  1. CAN relicense to GPL (see License Calamari-OCR/calamari#3)
  2. and they DO relicense to GPL

THEN we could relicense to GPL.
IF my employer agrees. (Personally I would agree to this. Not sure if that even matters if I agree.)

(Calamari 1.x is fine as it does not use tfaip.)

@mikegerber
Copy link
Collaborator Author

I'll discuss abandoning maintainership of ocrd_calamari with @cneud, but this is going to wait until at least October (I have major surgery in September).

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.

@bertsky
Copy link
Contributor

bertsky commented Aug 16, 2023

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.

Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?

@mikegerber
Copy link
Collaborator Author

mikegerber commented Aug 16, 2023

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.

Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?

I'm not a lawyer, and perhaps we should discuss this (after my health stuff :)) in a video call soon, this is how I see the situation.

a. Calamari 2.x's license is invalid. It simply can't have an Apache license while using the GPL library tfaip.*
b. Therefore - I believe - I can't use it as a user (or depend on it, as a developer, in my "own" project ocrd_calamari)

In the hypothetical situation of Calamari going GPL**, I personally do not have a problem with a GPL'ed ocrd_calamari. There's some potentially blocking red tape involved (my employer and all contributors must agree), but I - as one of the main contributors - would do it.

* This is clear IMHO, and it doesn't matter that it's Python and not using ld
** If it can, given legacy licensing stuff from Kraken(?) or whatever the issue was

@mikegerber
Copy link
Collaborator Author

mikegerber commented Aug 16, 2023

(Deliberately avoiding terms like "enforcing", I think that was used in the wrong way in discussions

I'd also like stress I do not care that much about licenses, it just seems to be a serious and show-stopping "bug" that there's this problem. As long as it's open source it doesn't matter to me, except that I would avoid GPL for library code, because it would cause exactly this kind of legal situation.)

@mikegerber
Copy link
Collaborator Author

Another way around it: Isolate GPLy code. If you just os.system('gpl-binary') then you also have no problem, when gpl-binary is GPL. Maybe worth checking, but it will be a pain to maintain properly.

@mikegerber
Copy link
Collaborator Author

The situation with Calamari 2.x is such that I won't use it (and a future ocrd_calamari for 2.x) due to the unresolved licensing problems; I can't legally use it.
Just to be as precise as possible here: you cannot use it as long as there are legal inconsistencies, or as soon as GPL kicks in?

Short answer: Because the license is invalid. If it were GPL there would be the possibility of us (the developers of ocrd_calamari) to update IF we move to GPL too.

@bertsky
Copy link
Contributor

bertsky commented Jul 31, 2024

Thanks for being so precise and sharing your concerns!

I suggest we try to convince @andbue and @chreul to go GPL with Calamari-OCR, and then proceed with the OCR-D wrapper for 2.x here.

Native PAGE-XML support (via dataset type for input and output) does help, but I'm not sure how we can ensure that OCR-D's incremental annotation principle can be guaranteed – we must not throw away information, even if it's irrelevant to the OCR. Also complicating the matter is the fact that OCR-D requires using to AlternativeImage on all hierarchy levels and adhering to @orientation etc.

Can you please elaborate on the state of your migration (esp. around these issues) so far (or back when you were working on it)?

@mikegerber
Copy link
Collaborator Author

Thanks for being so precise and sharing your concerns!

I suggest we try to convince @andbue and @chreul to go GPL with Calamari-OCR, and then proceed with the OCR-D wrapper for 2.x here.

There seems to be another issue with that: Calamari-OCR/calamari#3

@mikegerber
Copy link
Collaborator Author

mikegerber commented Aug 20, 2024

Can you please elaborate on the state of your migration (esp. around these issues) so far (or back when you were working on it)?

This 100% blocked by these licensing issues, I will not work on it further until these are resolved.

@bertsky
Copy link
Contributor

bertsky commented Aug 20, 2024

There seems to be another issue with that: Calamari-OCR/calamari#3

Like I said on that thread:

I also don't think the licensing deviation from Ocropy is of concern. Calamari by being GPLed cannot in any way violate Apache'd old Ocropy.

So it's not another issue AFAICS.

This 100% blocked by these licensing issues, I will not work on it further until these are resolved.

I understood that much, but I would really like to know how far you got so far. (It would help in gauging what's the best way to proceed currently – bringing TF SavedModel format to old Calamari versions vs. bringing OCR-D to 2.x next.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked dependencies Pull requests that update a dependency file
Projects
None yet
Development

No branches or pull requests

5 participants