Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect multiple reimbursements using the same receipt #32

Open
Irio opened this issue Sep 1, 2016 · 14 comments
Open

Detect multiple reimbursements using the same receipt #32

Irio opened this issue Sep 1, 2016 · 14 comments

Comments

@Irio
Copy link
Collaborator

Irio commented Sep 1, 2016

For instance, we detected the same receipt being used in 2 distinct reimbursements:
http://www.camara.gov.br/cota-parlamentar/documentos/publ/2437/2015/5645173.pdf
http://www.camara.gov.br/cota-parlamentar/documentos/publ/2437/2015/5645177.pdf

@cuducos
Copy link
Collaborator

cuducos commented Sep 2, 2016

If you remember by heart (otherwise I look for it in the .ipynb): Do they have exactly the same document_number in the dataset? Or this number was mocked?

Just asking because if the real document_number differs (image vs dataset) we'll have to rely on OCR and stuff. If they are the same I think it's easier to spot.

@Irio
Copy link
Collaborator Author

Irio commented Sep 6, 2016

@cuducos In this example they have distinct document_id's.

Before going to OCR, I'd try SIFT, which I believe is much faster since does not depend on a vocabulary of words, just plain linear algebra.

@cuducos
Copy link
Collaborator

cuducos commented Sep 6, 2016

Sound great. SIFT is new for me but looks like something very effective for this kind of stuff. Awesome!

@paulo-raca
Copy link

I feel like SIFT is great for find similar stuff (e.g., receipts with the same layout), but is probably not going to be a good option to decide if 2 receipts are the same or not.

@Irio
Copy link
Collaborator Author

Irio commented Oct 18, 2016

Check the paper "Region Duplication Forgery Detection Technique Based on SURF and HAC" for references (https://sci-hub.cc/ is your friend). Here's an example of Python code to run SIFT.

@weslleymberg
Copy link

weslleymberg commented Oct 18, 2016

Came across 2 examples where 2 distinct reimbursements have the same document_number, but do not have the same receipt.

On the first one the value that is presented as the document_number is acctualy the congressperson's subscription number on the water company that issued the bills.

Here are the document_ids: 5886345 and 5886361. And the document_number is 0010100910378000. You can see this is the same number that is in the field "Inscrição" on both documents.

A similar thing happens with these other 2 documents: 5780419 and 5880166. Where the operator's number (t00408151) of a highway toll is used as the document_number. Note that these two documents also have distinct applicant_ids (3044 and 1133)

@cuducos
Copy link
Collaborator

cuducos commented Oct 18, 2016

Came across 2 examples where 2 distinct reimbursements have the same document_number, but do not have the same receipt.

I'm not sure this is a problem per se. I mean, AFAIK the document_number is the number of the receipt, the number controlled by the supplier (each supplier, each company have their own control of receipts sequential numbering). In other words it can be just a coincidence. But… coinciding the document_number and the supplier is strange…

That said, it seems to me that it's a matter of typing the wrong data, not sure if it's compromising… 

@weslleymberg
Copy link

Understand. I didn't know that the number of the receipt can be duplicated just by coincidence.

My thought at the time was that typing the wrong data might be a very common mistake thus making document_number not very reliable to spot a possible fraud with duplicated receipt.

@cuducos
Copy link
Collaborator

cuducos commented Oct 18, 2016

My thought at the time was that typing the wrong data might be a very common mistake thus making document_number not very reliable to spot a possible fraud with duplicated receipt.

Good point!

@silviodc
Copy link

Good news for the detection of duplicate reimbursements. I did a notebook to convert pdf files to png and then to detect common regions with sift.

Recipes used: 5645173 and 5645177.

Recipe in png:
5645173

Sift keypoints:
sift_keypoints

Match regions:

macht_keypoints

So, with some experiments i found that use only sift will give us a lot of false positive.
Look at this case mentioned by @weslleymberg

Here are the document_ids: 5886345 and 5886361.

Recipe in png
5886345

Sift keypoints:

sift_keypoints

Common regions between them:

macht_keypoints

So, I still working in the script to predict multiple reimbursements, i will try to combine sift output with the OCR data with have in this issue #188 to archive better results.

As soon as possible i will share my other news with you guys :D

@cuducos
Copy link
Collaborator

cuducos commented May 12, 2017

That's awesome progress @silviodc! Many thanks for that. Even if the results are still lots of false positives IMHO it would be great to have this notebook of yours in our master branch. Just add in the conclusions the issues your analysis raised for future researchers ; ) Do you fancy opening a PR?
Cheers

@silviodc
Copy link

Hi @cuducos

Yes i can open a PR, just let me play a little with this data in this weekend :D
After that i will do the PR.
I spent too much time to find an away to convert the pdf :/
I just got the insights to play with the prediction right now.

@silviodc
Copy link

silviodc commented May 18, 2017

Hi everyone,

The PR #238 about the conversion of pdf to image and the use of SIFT is up.
I also put a plain which i think could be interesting to follow to build the ML approach to detect duplicates.
In near future i will try to do the steps 3 and 4 i mentioned there. However, if someone feel motivated just go, i want to see it working !!

Irio pushed a commit that referenced this issue Feb 27, 2018
…work-3.5.1-to-3.5.2

Update djangorestframework to 3.5.2
cuducos pushed a commit that referenced this issue Feb 28, 2018
…print-improvements

Simplifying InvalidCnpjCpfClassifier implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants