New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

fgrehm · 2017-02-11T00:32:47Z

Some stats:

Dataset is made of about 56k receipts
The whole process took around 8 hours to finish in a Droplet with 8CPUs and 16Gb Ram
https://gist.github.com/fgrehm/d4f489404e7e8a4184cd97cc7268ee82 - execution logs of generating 5.1Gb worth of PDFs
https://gist.github.com/fgrehm/8b840b4d07f2cab198bf393a936523df - converting the PDFs to 30Gb images
https://gist.github.com/fgrehm/b7bd2e51dc08fcf2bbedc8ec61c29ba2 - sent images to google, result is 1.1Gb of cloud vision JSON api responses
I've zipped that final 1.1Gb directory into a 100Mb file and uploaded to wetransfer https://we.tl/i1C2z6sBJX
Here's an example of a few responses: https://gist.github.com/fgrehm/f7f5b24947e169c3249542caf958f8cd

/cc @Irio @anaschwendler

cuducos · 2017-02-11T14:43:34Z

Wow… great progress, great data extracted from the PDFs ; )

Some general comments trying to help in the direction of PRs to this repo:

We already have a script to fetch all PDFs so we need to find a way to convert PDFs to PNG (or whatever) without using os.system("pdftoppm …") for greater compatibility (Windows users? Or even different *nix user (brew install vs. apt-gey install vs. macports install vs. yum install and so on) — but maybe that's to utopian…
A script to actually use the Google service and create the data/txt

What do you think?

fgrehm · 2017-02-11T15:42:26Z

Hey, my idea here was really just to share the data in case u guys or others wanna analyse it. Also I have the 300 dollars credit for signing up for the Google cloud platform that I need to spend over the next 2 weeks (got 200 left after this first batch of OCRs now). Making this more robust / making it work on other platforms is "out of scope" for me for now so feel free to close this issue to avoid the noise. My plan is to do the same for a few other reimbursement categories that Ana pointed out to me over Telegram, I'll do my best to get the code in a good shape but no promises yet as I need to wrap up other PRs I have open ;). If u believe there are better ways to share that dataset instead of a GitHub issue LMK.

…

-- Fábio Rehm Sent from my phone On Feb 11, 2017 12:43 PM, "Eduardo Cuducos" <[email protected]> wrote: Wow… great progress, great data extracted from the PDFs ; ) Some general comments trying to help in the direction of PRs to this repo: - We already have a script to fetch all PDFs <https://github.com/datasciencebr/serenata-de-amor/blob/master/src/fetch_receipts.py> so we need to find a way to convert PDFs to PNG (or whatever) without using os.system("pdftoppm …") for greater compatibility (Windows users? Or even different *nix user (brew install vs. apt-gey install vs. macports install vs. yum install and so on) — but maybe that's to utopian… - A script to actually use the Google service and create the data/txt What do you think? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#188 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAE_w-VN35s57kMDcM4tPn_jqlX116xaks5rbckWgaJpZM4L-AHh> .

cuducos · 2017-02-12T22:28:46Z

No worries at all, mate. I think this Issue is pretty useful as it is. People interested in OCR can learn from your experience, try it, get access to data etc. I'll leave it as it is ; )

fgrehm · 2017-02-13T14:44:18Z

No worries, I've finished OCR of Publicity of parliamentary activity and Taxi, toll and parking since 2015 https://gist.github.com/fgrehm/572ba814d617e831f4b1faac5e0b9165

I still have $100 left 😱 but that's not enough to OCR the 140k of meal reimbursements which would be what @anaschwendler suggested me to do next on a chat over telegram. I guess for now I'll proceed with subquotas that have less reimbursements until I'm done with those credits and will share a single zip file with you guys with all that at some point this week 🤘

As a side note, I know that the text generated is not 100% accurate but I think we could still have this data on an elasticsearch instance somewhere for full text searching 💭

fgrehm · 2017-02-13T14:45:58Z

Also, I've "merged" the code from the 3 initial notebooks into a couple classes that u can find on the gist above. If that's good enough for a PR to toolbox LMK and I'll try to get it going when I have a chance

pedrommone · 2017-02-13T17:00:34Z

@fgrehm there no way to request Google a free tier for using into open source projects? I guess they are very welcome in cases like that.

fgrehm · 2017-02-13T17:05:04Z

@pedrommone Probably yes, but that's something that serenata's core team will have to do once this has been wired up with rosie / jarbas / etc... My idea with this is to provide enough ammo for analysis to check if it is going to be worth the trouble having this in the first place 😄

cuducos · 2017-02-13T20:18:22Z

provide enough ammo for analysis to check if it is going to be worth the trouble having this in the first place

❤️

fgrehm · 2017-03-01T22:42:26Z

Here's the dataset with the receipts texts and some numbers about it:

https://we.tl/p1GdsMxPH1
Out of 57460 meal reimbursements since 2015, 56710 have OCR data
Out of 65455 fuels and lubricants reimbursements in 2016, 64989 have OCR data
Out of 70341 "publicity of parliamentary activity" and "taxi, toll and parking" since 2015, 56309 have OCR data

The process of obtaining those texts can be seen on the following gists:

OCR of meal reimbursements since 2015
- Part 1
- Part 2
- Part 3
Refactoring the code and OCR of Publicity of parliamentary activity, and Taxi, toll and parking since 2015
OCR of fuels and lubricants on 2016
Building the CSV with raw responses

I believe that @Irio already uploaded the zip file with raw JSON responses to S3 but I couldn't find the "easy to use CSV" version linked above. Once that file is uploaded to S3 I guess we can close this issue and GH-173. An example of usage is coming up as a PR in a bit 🎉 🍻

EDIT Here are the breakdown of OCRed stuff per subquota:

Fuels and lubricants                         64989
Congressperson meal                          56715
Taxi, toll and parking                       41922
Publicity of parliamentary activity          14387
Consultancy, research and technical work      5082
Flight tickets                                5010
Postal services                               4804
Terrestrial, maritime and fluvial tickets     1786
Aircraft renting or charter of aircraft        589
Watercraft renting or charter                   69

cuducos · 2017-03-10T12:49:45Z

@fgrehm I'm so sorry. I had a couple of unforeseen situations this week and I couldn't follow you in time. Now it looks like the file at WeTransfer is not available anymore. Can you re-upload it or send it via PVT so I can upload it (and the one from #173) to S3?

Also, I haven't check, but just as a reminder if that's the case: would you mind documenting this new datasets in the CONTRIBUTING.md? (just detailed that in a comment on #197).

michelpereira · 2017-05-02T01:31:45Z

Hi, all. My recommendation is to run the API again because Google updated the version of API to v1.1 with new features for example Document Text Detection.

fgrehm · 2017-05-03T17:11:45Z

@michelpereira yeah, tks for the heads up. That feature got released after I was done with this initial processing and I only found out about the new feature while on the dataset documentation I have in the works 😄 I'm on a work trip right now but I'm going to wrap up that PR as soon as I'm back home

silviodc · 2017-05-07T11:01:51Z

Hi guys,
After to analyze some data and to check the CEAP:
3. O documento que comprova o pagamento não pode ter rasura, acréscimos, emendas ou entrelinhas, deve conter data e deve conter os serviços ou materiais descritos item por item, sem generalizações ou abreviaturas

So, we have some data like that:
http://www.camara.gov.br/cota-parlamentar/documentos/publ/2398/2015/5635048.pdf

The previous mentioned article the reimbursements can't be a generalization. So, could we consider it as a invalid reimbursement?

cuducos · 2017-05-10T19:54:17Z

The previous mentioned article the reimbursements can't be a generalization. So, could we consider it as a invalid reimbursement?

Sure thing. If we could identify all receipts with handwriting description such as that we would have thousands of new suspicions ; )

silviodc · 2017-05-27T23:10:24Z

Hi @cuducos

For these reimbursement with handwriting description, i did a classifier using DeepLearning Keras which archived interesting results. it is in the #238

gustavo-momente · 2017-06-07T20:32:19Z

@fgrehm there no way to request Google a free tier for using into open source projects? I guess they are very welcome in cases like that.

@fgrehm, @pedrommone Maybe you could try Tensorflow Research Cloud.

fgrehm · 2017-07-30T21:35:10Z

The datasets are on S3 and docs have been merged. Please follow up on ☝️ for additional efforts on this

Hide availability in the Chamber's dataset while we don't update the db

agzukoff · 2018-10-15T23:10:36Z

is this dataset still available for download? the wetransfer (https://we.tl/i1C2z6sBJX) link looks expired.. would love to get my hands on this.

cuducos · 2018-10-16T11:14:21Z

is this dataset still available for download?

Yep.

the wetransfer (https://we.tl/i1C2z6sBJX) link looks expired

That's expected. We use sites such as WeTransfer just to quickly exchange files between collaborators and core developers. Later, as @fgrehm, it's uploaded to our file storage.

would love to get my hands on this.

Hell yeah 🤘 Just go ahead and use the toolboxx mentioned in the README.md to download any dataset we have : )

Mageswaran1989 · 2019-01-12T17:02:08Z

@cuducos the toolbox link in broken :(
how can i get the dataset? it would be a great help.

Mageswaran1989 · 2019-01-12T17:18:59Z

Toolbox link : https://github.com/okfn-brasil/serenata-toolbox

michaelyan-coupa · 2019-06-28T00:14:19Z

Hello! Can someone direct me to a link containing the JSON files for the 57K meal reimbursements? Also where can I find the images corresponding to the JSON files? Thanks!!

cuducos · 2019-06-28T01:28:02Z

Can someone direct me to a link containing the JSON files for the 57K meal reimbursements?

@michaelyan-coupa, you can create easily create a CSV using serenata-toolbox to download all reimbursements and then filtering out the ones not related to meals. We do not make tools to parse data in JSON in this project.

Also where can I find the images corresponding to the JSON files?

Yes. With data from the CSV mentioned in my lats paragraph, you can download from the source concatenating the URL as we do in Jarbas.

liminghao1630 · 2019-09-10T04:34:11Z

@fgrehm
Hi, I have tried the serenata-toolbox and I can download the pdf now. But I can't find the OCR result in the json file "2018-01-05-reimbursements.csv". There is no other related file in the file list of serenata-toolbox. Could you please tell me where can I get the OCR result of receipts?

cuducos · 2019-09-10T13:05:00Z

@liminghao1630 the file you're looking for is 2017-02-15-receipts-texts.xz ; )

sks4world · 2019-10-25T18:12:26Z

Is it possible to download these receipt images?

cuducos · 2019-10-25T18:48:32Z

Is it possible to download these receipt images?

Yes. I described the process three messages above your question, @sks4world.

imadcat · 2020-03-02T02:45:02Z

Some stats:

Dataset is made of about 56k receipts

The whole process took around 8 hours to finish in a Droplet with 8CPUs and 16Gb Ram

https://gist.github.com/fgrehm/d4f489404e7e8a4184cd97cc7268ee82 - execution logs of generating 5.1Gb worth of PDFs

https://gist.github.com/fgrehm/8b840b4d07f2cab198bf393a936523df - converting the PDFs to 30Gb images

https://gist.github.com/fgrehm/b7bd2e51dc08fcf2bbedc8ec61c29ba2 - sent images to google, result is 1.1Gb of cloud vision JSON api responses

I've zipped that final 1.1Gb directory into a 100Mb file and uploaded to wetransfer https://we.tl/i1C2z6sBJX

Here's an example of a few responses: https://gist.github.com/fgrehm/f7f5b24947e169c3249542caf958f8cd

/cc @Irio @anaschwendler

Could you or anyone please make that raw cloud vision JSON api responses available for download again? I have tried datasets.downloader.download('2017-02-15-receipts-texts-raw.tar.xz') method from serenata_toolbox.datasets but the resource is no longer there.
And the WeTransfer link has expired.

dotsinspace · 2022-02-06T14:06:26Z

how can i download this dataset..all recites in pdf format ?

This was referenced Mar 1, 2017

Analysis on alcohol expenses using the new dataset with receipts texts #197

Closed

Add analysis on meals paid for more than one person #198

Closed

Short analysis on meals paid for kids #199

Closed

cuducos added the data collection label Mar 24, 2017

silviodc mentioned this issue May 11, 2017

Detect multiple reimbursements using the same receipt #32

Open

fgrehm mentioned this issue Jun 20, 2017

[OCR] Convert jupyter notebook (with bash script) to Python script #254

Open

fgrehm closed this as completed Jul 30, 2017

Irio pushed a commit that referenced this issue Feb 27, 2018

Merge pull request #188 from datasciencebr/cuducos-hide-availability

1d32d79

Hide availability in the Chamber's dataset while we don't update the db

michaelyan-coupa mentioned this issue Jun 28, 2019

Where do the datasets download? okfn-brasil/serenata-toolbox#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

fgrehm commented Feb 11, 2017

cuducos commented Feb 11, 2017

fgrehm commented Feb 11, 2017 via email

cuducos commented Feb 12, 2017

fgrehm commented Feb 13, 2017

fgrehm commented Feb 13, 2017

pedrommone commented Feb 13, 2017

fgrehm commented Feb 13, 2017

cuducos commented Feb 13, 2017

fgrehm commented Mar 1, 2017 •

edited

Loading

cuducos commented Mar 10, 2017 •

edited

Loading

michelpereira commented May 2, 2017

fgrehm commented May 3, 2017

silviodc commented May 7, 2017

cuducos commented May 10, 2017

silviodc commented May 27, 2017

gustavo-momente commented Jun 7, 2017

fgrehm commented Jul 30, 2017

agzukoff commented Oct 15, 2018

cuducos commented Oct 16, 2018

Mageswaran1989 commented Jan 12, 2019 •

edited

Loading

Mageswaran1989 commented Jan 12, 2019

michaelyan-coupa commented Jun 28, 2019

cuducos commented Jun 28, 2019

liminghao1630 commented Sep 10, 2019

cuducos commented Sep 10, 2019

sks4world commented Oct 25, 2019

cuducos commented Oct 25, 2019

imadcat commented Mar 2, 2020 •

edited

Loading

dotsinspace commented Feb 6, 2022

New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

Comments

fgrehm commented Feb 11, 2017

cuducos commented Feb 11, 2017

fgrehm commented Feb 11, 2017 via email

cuducos commented Feb 12, 2017

fgrehm commented Feb 13, 2017

fgrehm commented Feb 13, 2017

pedrommone commented Feb 13, 2017

fgrehm commented Feb 13, 2017

cuducos commented Feb 13, 2017

fgrehm commented Mar 1, 2017 • edited Loading

cuducos commented Mar 10, 2017 • edited Loading

michelpereira commented May 2, 2017

fgrehm commented May 3, 2017

silviodc commented May 7, 2017

cuducos commented May 10, 2017

silviodc commented May 27, 2017

gustavo-momente commented Jun 7, 2017

fgrehm commented Jul 30, 2017

agzukoff commented Oct 15, 2018

cuducos commented Oct 16, 2018

Mageswaran1989 commented Jan 12, 2019 • edited Loading

Mageswaran1989 commented Jan 12, 2019

michaelyan-coupa commented Jun 28, 2019

cuducos commented Jun 28, 2019

liminghao1630 commented Sep 10, 2019

cuducos commented Sep 10, 2019

sks4world commented Oct 25, 2019

cuducos commented Oct 25, 2019

imadcat commented Mar 2, 2020 • edited Loading

dotsinspace commented Feb 6, 2022

fgrehm commented Mar 1, 2017 •

edited

Loading

cuducos commented Mar 10, 2017 •

edited

Loading

Mageswaran1989 commented Jan 12, 2019 •

edited

Loading

imadcat commented Mar 2, 2020 •

edited

Loading