Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

Closed
fgrehm opened this issue Feb 11, 2017 · 29 comments
Closed

Comments

@fgrehm
Copy link
Contributor

fgrehm commented Feb 11, 2017

Some stats:

/cc @Irio @anaschwendler

@cuducos
Copy link
Collaborator

cuducos commented Feb 11, 2017

Wow… great progress, great data extracted from the PDFs ; )

Some general comments trying to help in the direction of PRs to this repo:

  • We already have a script to fetch all PDFs so we need to find a way to convert PDFs to PNG (or whatever) without using os.system("pdftoppm …") for greater compatibility (Windows users? Or even different *nix user (brew install vs. apt-gey install vs. macports install vs. yum install and so on) — but maybe that's to utopian…
  • A script to actually use the Google service and create the data/txt

What do you think?

@fgrehm
Copy link
Contributor Author

fgrehm commented Feb 11, 2017 via email

@cuducos
Copy link
Collaborator

cuducos commented Feb 12, 2017

No worries at all, mate. I think this Issue is pretty useful as it is. People interested in OCR can learn from your experience, try it, get access to data etc. I'll leave it as it is ; )

@fgrehm
Copy link
Contributor Author

fgrehm commented Feb 13, 2017

No worries, I've finished OCR of Publicity of parliamentary activity and Taxi, toll and parking since 2015 https://gist.github.com/fgrehm/572ba814d617e831f4b1faac5e0b9165

I still have $100 left 😱 but that's not enough to OCR the 140k of meal reimbursements which would be what @anaschwendler suggested me to do next on a chat over telegram. I guess for now I'll proceed with subquotas that have less reimbursements until I'm done with those credits and will share a single zip file with you guys with all that at some point this week 🤘

As a side note, I know that the text generated is not 100% accurate but I think we could still have this data on an elasticsearch instance somewhere for full text searching 💭

@fgrehm
Copy link
Contributor Author

fgrehm commented Feb 13, 2017

Also, I've "merged" the code from the 3 initial notebooks into a couple classes that u can find on the gist above. If that's good enough for a PR to toolbox LMK and I'll try to get it going when I have a chance

@pedrommone
Copy link
Contributor

@fgrehm there no way to request Google a free tier for using into open source projects? I guess they are very welcome in cases like that.

@fgrehm
Copy link
Contributor Author

fgrehm commented Feb 13, 2017

@pedrommone Probably yes, but that's something that serenata's core team will have to do once this has been wired up with rosie / jarbas / etc... My idea with this is to provide enough ammo for analysis to check if it is going to be worth the trouble having this in the first place 😄

@cuducos
Copy link
Collaborator

cuducos commented Feb 13, 2017

provide enough ammo for analysis to check if it is going to be worth the trouble having this in the first place

❤️

@fgrehm
Copy link
Contributor Author

fgrehm commented Mar 1, 2017

Here's the dataset with the receipts texts and some numbers about it:

  • https://we.tl/p1GdsMxPH1
  • Out of 57460 meal reimbursements since 2015, 56710 have OCR data
  • Out of 65455 fuels and lubricants reimbursements in 2016, 64989 have OCR data
  • Out of 70341 "publicity of parliamentary activity" and "taxi, toll and parking" since 2015, 56309 have OCR data

The process of obtaining those texts can be seen on the following gists:

I believe that @Irio already uploaded the zip file with raw JSON responses to S3 but I couldn't find the "easy to use CSV" version linked above. Once that file is uploaded to S3 I guess we can close this issue and GH-173. An example of usage is coming up as a PR in a bit 🎉 🍻


EDIT Here are the breakdown of OCRed stuff per subquota:

Fuels and lubricants                         64989
Congressperson meal                          56715
Taxi, toll and parking                       41922
Publicity of parliamentary activity          14387
Consultancy, research and technical work      5082
Flight tickets                                5010
Postal services                               4804
Terrestrial, maritime and fluvial tickets     1786
Aircraft renting or charter of aircraft        589
Watercraft renting or charter                   69

@cuducos
Copy link
Collaborator

cuducos commented Mar 10, 2017

@fgrehm I'm so sorry. I had a couple of unforeseen situations this week and I couldn't follow you in time. Now it looks like the file at WeTransfer is not available anymore. Can you re-upload it or send it via PVT so I can upload it (and the one from #173) to S3?

Also, I haven't check, but just as a reminder if that's the case: would you mind documenting this new datasets in the CONTRIBUTING.md? (just detailed that in a comment on #197).

@michelpereira
Copy link
Contributor

Hi, all. My recommendation is to run the API again because Google updated the version of API to v1.1 with new features for example Document Text Detection.

@fgrehm
Copy link
Contributor Author

fgrehm commented May 3, 2017

@michelpereira yeah, tks for the heads up. That feature got released after I was done with this initial processing and I only found out about the new feature while on the dataset documentation I have in the works 😄 I'm on a work trip right now but I'm going to wrap up that PR as soon as I'm back home

@silviodc
Copy link

silviodc commented May 7, 2017

Hi guys,
After to analyze some data and to check the CEAP:
3. O documento que comprova o pagamento não pode ter rasura, acréscimos, emendas ou entrelinhas, deve conter data e deve conter os serviços ou materiais descritos item por item, sem generalizações ou abreviaturas

So, we have some data like that:
http://www.camara.gov.br/cota-parlamentar/documentos/publ/2398/2015/5635048.pdf

The previous mentioned article the reimbursements can't be a generalization. So, could we consider it as a invalid reimbursement?

@cuducos
Copy link
Collaborator

cuducos commented May 10, 2017

The previous mentioned article the reimbursements can't be a generalization. So, could we consider it as a invalid reimbursement?

Sure thing. If we could identify all receipts with handwriting description such as that we would have thousands of new suspicions ; )

@silviodc
Copy link

Hi @cuducos

For these reimbursement with handwriting description, i did a classifier using DeepLearning Keras which archived interesting results. it is in the #238

@gustavo-momente
Copy link

@fgrehm there no way to request Google a free tier for using into open source projects? I guess they are very welcome in cases like that.

@fgrehm, @pedrommone Maybe you could try Tensorflow Research Cloud.

@fgrehm
Copy link
Contributor Author

fgrehm commented Jul 30, 2017

The datasets are on S3 and docs have been merged. Please follow up on ☝️ for additional efforts on this

@fgrehm fgrehm closed this as completed Jul 30, 2017
Irio pushed a commit that referenced this issue Feb 27, 2018
Hide availability in the Chamber's dataset while we don't update the db
@agzukoff
Copy link

is this dataset still available for download? the wetransfer (https://we.tl/i1C2z6sBJX) link looks expired.. would love to get my hands on this.

@cuducos
Copy link
Collaborator

cuducos commented Oct 16, 2018

is this dataset still available for download?

Yep.

the wetransfer (https://we.tl/i1C2z6sBJX) link looks expired

That's expected. We use sites such as WeTransfer just to quickly exchange files between collaborators and core developers. Later, as @fgrehm, it's uploaded to our file storage.

would love to get my hands on this.

Hell yeah 🤘 Just go ahead and use the toolboxx mentioned in the README.md to download any dataset we have : )

@Mageswaran1989
Copy link

Mageswaran1989 commented Jan 12, 2019

@cuducos the toolbox link in broken :(
how can i get the dataset? it would be a great help.

@Mageswaran1989
Copy link

@michaelyan-coupa
Copy link

Hello! Can someone direct me to a link containing the JSON files for the 57K meal reimbursements? Also where can I find the images corresponding to the JSON files? Thanks!!

@cuducos
Copy link
Collaborator

cuducos commented Jun 28, 2019

Can someone direct me to a link containing the JSON files for the 57K meal reimbursements?

@michaelyan-coupa, you can create easily create a CSV using serenata-toolbox to download all reimbursements and then filtering out the ones not related to meals. We do not make tools to parse data in JSON in this project.

Also where can I find the images corresponding to the JSON files?

Yes. With data from the CSV mentioned in my lats paragraph, you can download from the source concatenating the URL as we do in Jarbas.

@liminghao1630
Copy link

@fgrehm
Hi, I have tried the serenata-toolbox and I can download the pdf now. But I can't find the OCR result in the json file "2018-01-05-reimbursements.csv". There is no other related file in the file list of serenata-toolbox. Could you please tell me where can I get the OCR result of receipts?

@cuducos
Copy link
Collaborator

cuducos commented Sep 10, 2019

@liminghao1630 the file you're looking for is 2017-02-15-receipts-texts.xz ; )

@sks4world
Copy link

Is it possible to download these receipt images?

@cuducos
Copy link
Collaborator

cuducos commented Oct 25, 2019

Is it possible to download these receipt images?

Yes. I described the process three messages above your question, @sks4world.

@imadcat
Copy link

imadcat commented Mar 2, 2020

Some stats:

/cc @Irio @anaschwendler

Could you or anyone please make that raw cloud vision JSON api responses available for download again? I have tried datasets.downloader.download('2017-02-15-receipts-texts-raw.tar.xz') method from serenata_toolbox.datasets but the resource is no longer there.
And the WeTransfer link has expired.

@dotsinspace
Copy link

how can i download this dataset..all recites in pdf format ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests