Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCR] Convert jupyter notebook (with bash script) to Python script #254

Open
cuducos opened this issue Jun 19, 2017 · 7 comments
Open

[OCR] Convert jupyter notebook (with bash script) to Python script #254

cuducos opened this issue Jun 19, 2017 · 7 comments

Comments

@cuducos
Copy link
Collaborator

cuducos commented Jun 19, 2017

Currently the code that generates a dataset with the text from CEAP receipts is in this notebook.

As it uses shell scripts here and there it would be great to have this as a standard src/ Python file, without shell script, to automate this data collection without some many dependencies such as binaries available in one's $PATH.

@jtemporal
Copy link
Collaborator

#207 documents how to generate the dataset that already is on our S3 but there's no python script for it. thanks @fgrehm for the data btw ;)

@tuliocasagrande
Copy link
Contributor

Hello @cuducos and @jtemporal

I took a quick look at the problem and it seems we're pending on the ReimbursementOCR to be implemented. Is that correct?
The google-cloud-vision python library can also be a good idea.

@cuducos
Copy link
Collaborator Author

cuducos commented Jun 20, 2017

I took a quick look at the problem and it seems we're pending on the ReimbursementOCR to be implemented. Is that correct?

Exactly — a bit of background just in case ; )

@fgrehm
Copy link
Contributor

fgrehm commented Jun 20, 2017

I actually did the OCR with python, it even run things in parallel 😄 :mindblown:

Just check the stuff I linked on #188 (comment) and LMK if u need any help with that!

@jandersoncoelho
Copy link

Have you seen http://ocrmypdf.readthedocs.io/ ? I use that in my research. It's a interface to Tesseract-ocr and the results It seems good.

@fgrehm
Copy link
Contributor

fgrehm commented Nov 27, 2017

Heads up: I've been hacking away on a better approach for OCRing receipts at https://github.com/fgrehm/serenata-ocr and one of the ideas is that it will have support for a "pluggable provider interface", meaning people can choose between Google Cloud Vision, https://ocr.space/, Microsoft Azure and maybe even some self hosted tesseract infra.

@fgrehm
Copy link
Contributor

fgrehm commented Nov 29, 2017

See #298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants