Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

Open
jboarman opened this issue Jul 26, 2021 · 10 comments
Open

Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

jboarman opened this issue Jul 26, 2021 · 10 comments
Labels
discussion Open dialog about how we can approach and solve various issues sample images

Comments

@jboarman
Copy link
Member

jboarman commented Jul 26, 2021

In general, Augraphy is trying to simplify the process of creating synthetic realistic datasets using only ground truth documents.

Often, training data is not accompanied by clean ground truth sources, which leads to inaccurate training and severely limited volumes of available training data. By starting with clean ground truth data, training sets can be created that represent printed, scanned, copied and faxed documents encountered in the real world AND have 100% accurate training data.

In order to recreate data from these real-world scenarios, we need to create a validation set that is inspired by examples from the real world. Below are sources that may serve as useful source material for attempting to use Augraphy to reproduce the styles and detail seen in these data sets.

Real-World Data Sets

Synthentic Data Sets

@jboarman jboarman added discussion Open dialog about how we can approach and solve various issues test labels Jul 26, 2021
@kwcckw
Copy link
Collaborator

kwcckw commented Jul 27, 2021

Creating the style should be feasible, take this example:
https://www.kaggle.com/patrickaudriaz/tobacco3482jpg
image

From here we can see several matching augmentations, such as page borders, ink bleed, dusty ink and etc. But the content itself could be a problem since it might not be possible if we want to create each of those contents on our own. So maybe we can find some open source digital documents and apply the relevant augmentations to them so that their style is matched?

For example, apply similar style's agumentations to the letter below :

image

@jboarman
Copy link
Member Author

we can see several matching augmentations, such as page borders, ink bleed, dusty ink and etc. But the content itself could be a problem since it might not be possible if we want to create each of those contents on our own

Good point. We definitely want to have simple and clean inputs that would be representative of something one might have in their ground truth dataset. In theory, a practitioner would generate the input from text so that the ground truth can be programmatically known in advance, much like you are doing in your "Lorem ipsum dolo" sample notebook. But, for our test suite, we just need some simple and clean images that could plausibly be generated from a script.

The other consideration is the inclusion of images like the provided sample.

@kwcckw
Copy link
Collaborator

kwcckw commented Jul 27, 2021

Okay, so i think we can do it either way. If the source document is having certain format, such as formal email, quotation, newspaper and etc, then we need to get digital copy of those documents, or at least having lines or squares to reproduce the document in similar format. Otherwise, generate text from code would be a better choice since it would be more straightforward.

@proofconstruction
Copy link
Contributor

It shouldn't be too hard to write some functions that produce different documents given some strings. We can already easily write strings into images, and if we had a decent specification of some different kinds of documents (where lines should be drawn, where different areas of text should go), we could turn these into classes that produce ground-truthed documents, which could then be fed into Augraphy to produce the larger training set, which will all have ground-truths.

This collection of document-generation classes should probably be its own package though.

@jboarman
Copy link
Member Author

I need your help.

We need to identify 5-10 archetypes, or sample documents, that we aim to reproduce using Augraghy. For each type of sample document, we need to then create a clean version that closely resembles the original document. This will then allow us to validate how well Augraphy can reproduce the noise we hope to be producing.

For each sample archetype, we can produce a notebook that demonstrates using Augraphy to go full circle, showing how that noise can be synthesized. The initial notebooks might not look perfect, but they could serve as a sort of starting point for a competition that betters Augraphy towards faithful reproductions of each of these archetypes.

HELP: Can you help identify example documents that show variety in the types of noise and issue seen in these documents?

(Don't worry about creating the clean reproduction -- let's sort out the archetypes first!)

Original Target Document

This example document comes from the RVL-CDIP dataset and represents the kind of noisy document that we intend to reproduce with Augraphy:

image

Clean Reproduction

This is a reproduction made using Google Documents of what the original document might look like prior to being deteriorated by copying and scanning, etc:

image

@kwcckw
Copy link
Collaborator

kwcckw commented Aug 23, 2021

So right now we are only looking at RVL-CDIP dataset right? I need some time to download it since the dataset size is large. As a first step, I will check through those images and do a compilation of augmentations and their sample image in the dataset. Once i getting sufficient number of samples, i will share the details again here.

@jboarman
Copy link
Member Author

So right now we are only looking at RVL-CDIP dataset right?

I’d say any of the sources listed in this issue could be used as a reference for identifying sample documents. It will likely take a lot of process of elimination work to narrow it down.

@proofconstruction
Copy link
Contributor

proofconstruction commented Aug 24, 2021

I started a collection of representative images we might reproduce. The five currently there were pulled from the Randomly Collected Documents set.

They're numbered 1.jpg, 2.jpg, and so on in the folder, and the notes below correspond to those file names.

Images

Here's why I like them, as a list of features we can aim to reproduce:

Image 1

  1. The text is faded in many places and to varying degrees
  2. the source document was slightly rotated when scanning
  3. there is a page border effect on one side
  4. some hand-written text is visible on the bottom, and faded on the side.

Image 2

  1. distortion in the top-border
  2. hand-written redaction done with a marker (new effect?)
  3. visible print lines
  4. "snow"/random noise in places around the image

Image 3

  1. the bleedthrough effect is very pronounced here, and importantly the reverse side does not contain the same text as the front
  2. distorted page edges
  3. lighting gradient on page edges
  4. stains
  5. noise along the bottom edge
  6. the texture of the page is apparent

Image 4

  1. tons of noise
  2. redactions
  3. complex page border geometry
  4. multiple styles of handwritten text
  5. stamps!

Image 5

  1. page borders
  2. redactions
  3. punched holes on the top edge!
  4. noise along the top and bottom, noise in the top right
  5. more handwritten text, some of which overlaps printed text

Replication

I think most of what I've listed here is reproducible with existing augmentations; in particular, I think all of the handwriting can be produced by (some straightforward generalization of) the new pencil augmentation, and we can already do serviceable page borders, fading, noise, brightness texturization, bleedthrough, and so on wherever we like.

What we don't have though, is a way to mimic stamps, redaction, or binder holes. We should be able to do binder holes synthetically rather easily, and the challenge with redaction would be correctly placing the mark on the text, but I'm not sure about how to approach stamps without having a collection of them included like the paper textures are. Maybe stamps are out of scope.

@kwcckw
Copy link
Collaborator

kwcckw commented Aug 25, 2021

Here's the compiled list of images by looking only at rvl-cdip dataset imagesA folder :
https://drive.google.com/drive/folders/17Z4VRqc1w9ZKpjoO3dulSqXTpxPVutMN?usp=sharing

Some are not fully reproducible, i included them here since they look interesting enough.

img_67.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1xJFoZ3XTA45MZ6LtC7sWMWOUGeUhHSKR/view?usp=sharing

  1. Badphotocopy
  2. Dirtydrum (half page with faded effect)
  3. Page border (unique in this case because it is not in the page border)
  4. Scribbles (Numbers) (we don't have this effect yet)
  5. Punch holes (we don't have this effect yet)
  6. Clips' marks (we don't have this effect yet)

img_299.png (Fully reproducible with some changes in the current code)
https://drive.google.com/file/d/17ev5mpuqMWd3l90nPCH9wbkNyekAhzB7/view?usp=sharing

  1. Badphotocopy (Very random noises)
  2. Page border on top

img_34123.png (Fully reproducible with some changes in the current code)
https://drive.google.com/file/d/1T34aLs58_V6xaHB0WfpttFMyyXYDQFB5/view?usp=sharing

  1. Badphotocopy (surrounding the page at page border only)

img_34212.png (Fully reproducible)
https://drive.google.com/file/d/1ctNWE9Fx7zOlBNfKoFCnRRhDuABu-OAL/view?usp=sharing

  1. Dirty roller
  2. Page borders

img_34462.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1IKAlc0ejGLFKiuKMhnSSPbOfXtM95hEU/view?usp=sharing

  1. Scribbles (words)(we don't have this effect yet)
  2. Page borders

img_109076.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1u1FSbsNYAXcfp3GLD_WkjmSuJpbSt6aB/view?usp=sharing

  1. Incomplete page (torn off?) (we don't have this effect yet)
  2. Page border
  3. Bad photocopy
  4. Punch holes (we don't have this effect yet)

img_136566.png (Fully reproducible)
https://drive.google.com/file/d/1MbcKbPDUNQlcVwFw4GoIPCZLJHcmUXdt/view?usp=sharing

  1. Book binding
  2. Dirty oller (minor)

img_166297.png (Fully reproducible)
https://drive.google.com/file/d/1Z5RAP_DkUMnVAqd2R0g4m2tEGHvjEIGK/view?usp=sharing

  1. Letterpress
  2. Page border (very minor at bottom)

img_171248.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1iYzC3NbfIMMx7GcJ6JQUgOJouC_NM1ek/view?usp=sharing

  1. Blobs of noises (we don't have this effect yet)

img_179298.png (Fully reproducible)
https://drive.google.com/file/d/1kRQarbuQF96tVUndHQUAyBqSFfW3qs0T/view?usp=sharing

  1. Page border
  2. Bookbinding

@shaheryar1
Copy link
Contributor

shaheryar1 commented Aug 26, 2021

By Looking at Resume category of Tobacco3482 dataset, I analyzed that almost all of the resumes are grayscale images with plain text on them therefore the Ink phase is very useful for reproducing such documents.

Here is a small list of some resumes extracted from dataset which collectively represents the noise in overall dataset.

https://drive.google.com/file/d/1OXdjEhbNE6moWaia0ORxUGPeU9tx69J8/view?usp=sharing

  1. InkBleed (with high intensity ranges)
  2. LowInkLine (Periodic)
  3. DustyInk
  4. PageBorder (Left)

https://drive.google.com/file/d/108OO9dGNF3FZmv-PpMFrrDYUfDBGZzD9/view?usp=sharing

  1. InkBleed (low intensity)
  2. Punch Holes ( not present in augraphy, discussion continues in Add new augmentation - Binding holes, punch holes, clip/pin mark #62 )
  3. Page Border (top and bottom)

https://drive.google.com/file/d/1AdZyOpMtecIPRswVd6kwDooToKjY8G3u/view?usp=sharing

  1. InkBleed (with high intensity ranges)
  2. Jpeg Compression
  3. Subtle Noise
  4. PencilScribbles (of very small size and full-black color)
  5. PageBorder on top (with a margin and diminish effect)
  6. Applying noisy-blobs in a straight vertical line of the page (similar to Folding effect)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Open dialog about how we can approach and solve various issues sample images
Projects
None yet
Development

No branches or pull requests

4 participants