Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

jboarman · 2021-07-26T21:33:19Z

In general, Augraphy is trying to simplify the process of creating synthetic realistic datasets using only ground truth documents.

Often, training data is not accompanied by clean ground truth sources, which leads to inaccurate training and severely limited volumes of available training data. By starting with clean ground truth data, training sets can be created that represent printed, scanned, copied and faxed documents encountered in the real world AND have 100% accurate training data.

In order to recreate data from these real-world scenarios, we need to create a validation set that is inspired by examples from the real world. Below are sources that may serve as useful source material for attempting to use Augraphy to reproduce the styles and detail seen in these data sets.

Real-World Data Sets

RVL-CDIP dataset consists of 400,000 B/W low-resolution (~100 DPI) images in 16 classes, with 25,000 images per class
https://www.cs.cmu.edu/~aharley/rvl-cdip/
NIST-SFRS (Structured Forms Reference Set) consists of 5,590 pages of binary, black-and-white images of synthesized documents from 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE.
https://www.nist.gov/srd/nist-special-database-2
Tobacco3482 dataset from Kaggle offers 10 different classes of forms, letters, reports, etc.
https://www.kaggle.com/patrickaudriaz/tobacco3482jpg
FUNSD (Form Understanding Noisy Scanned Documents) dataset on Kaggle comprises 199 real, fully annotated, scanned forms that are noisy and vary widely in appearance.
https://www.kaggle.com/sharmaharsh/form-understanding-noisy-scanned-documentsfunsd
Randomly Collected Documents is a Google Drive share that contains randomly selected public domain documents.
https://drive.google.com/drive/folders/1JMwmRko1gZ_VYtwXkP7CXPPztsNa_3nv?usp=sharing

Synthentic Data Sets

NoisyOffice data set from University of California, Irvine contains noisy grayscale printed text images and their corresponding ground truth for both real and simulated documents with 4 types of noise: folded sheets, wrinkled sheets, coffee stains, and footprints. For each type of font, one type of Noise: 17 files * 4 types of noise = 72 images.
https://archive.ics.uci.edu/ml/datasets/NoisyOffice
DDI-100 (Distorted Document Images) is a synthetic dataset by Ilia Zharikov ([email protected]) et al based on 7000 real unique document pages and consists of more than 100000 augmented images. Ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations.
https://arxiv.org/abs/1912.11658
https://github.com/machine-intelligence-laboratory/DDI-100/tree/master/dataset
https://paperswithcode.com/paper/ddi-100-dataset-for-text-detection-and

kwcckw · 2021-07-27T00:42:46Z

Creating the style should be feasible, take this example:
https://www.kaggle.com/patrickaudriaz/tobacco3482jpg

From here we can see several matching augmentations, such as page borders, ink bleed, dusty ink and etc. But the content itself could be a problem since it might not be possible if we want to create each of those contents on our own. So maybe we can find some open source digital documents and apply the relevant augmentations to them so that their style is matched?

For example, apply similar style's agumentations to the letter below :

jboarman · 2021-07-27T01:26:30Z

we can see several matching augmentations, such as page borders, ink bleed, dusty ink and etc. But the content itself could be a problem since it might not be possible if we want to create each of those contents on our own

Good point. We definitely want to have simple and clean inputs that would be representative of something one might have in their ground truth dataset. In theory, a practitioner would generate the input from text so that the ground truth can be programmatically known in advance, much like you are doing in your "Lorem ipsum dolo" sample notebook. But, for our test suite, we just need some simple and clean images that could plausibly be generated from a script.

The other consideration is the inclusion of images like the provided sample.

kwcckw · 2021-07-27T01:56:35Z

Okay, so i think we can do it either way. If the source document is having certain format, such as formal email, quotation, newspaper and etc, then we need to get digital copy of those documents, or at least having lines or squares to reproduce the document in similar format. Otherwise, generate text from code would be a better choice since it would be more straightforward.

proofconstruction · 2021-07-28T03:22:01Z

It shouldn't be too hard to write some functions that produce different documents given some strings. We can already easily write strings into images, and if we had a decent specification of some different kinds of documents (where lines should be drawn, where different areas of text should go), we could turn these into classes that produce ground-truthed documents, which could then be fed into Augraphy to produce the larger training set, which will all have ground-truths.

This collection of document-generation classes should probably be its own package though.

jboarman · 2021-08-22T18:16:39Z

I need your help.

We need to identify 5-10 archetypes, or sample documents, that we aim to reproduce using Augraghy. For each type of sample document, we need to then create a clean version that closely resembles the original document. This will then allow us to validate how well Augraphy can reproduce the noise we hope to be producing.

For each sample archetype, we can produce a notebook that demonstrates using Augraphy to go full circle, showing how that noise can be synthesized. The initial notebooks might not look perfect, but they could serve as a sort of starting point for a competition that betters Augraphy towards faithful reproductions of each of these archetypes.

HELP: Can you help identify example documents that show variety in the types of noise and issue seen in these documents?

(Don't worry about creating the clean reproduction -- let's sort out the archetypes first!)

Original Target Document

This example document comes from the RVL-CDIP dataset and represents the kind of noisy document that we intend to reproduce with Augraphy:

Clean Reproduction

This is a reproduction made using Google Documents of what the original document might look like prior to being deteriorated by copying and scanning, etc:

kwcckw · 2021-08-23T01:34:36Z

So right now we are only looking at RVL-CDIP dataset right? I need some time to download it since the dataset size is large. As a first step, I will check through those images and do a compilation of augmentations and their sample image in the dataset. Once i getting sufficient number of samples, i will share the details again here.

jboarman · 2021-08-23T01:55:50Z

So right now we are only looking at RVL-CDIP dataset right?

I’d say any of the sources listed in this issue could be used as a reference for identifying sample documents. It will likely take a lot of process of elimination work to narrow it down.

proofconstruction · 2021-08-24T11:10:11Z

I started a collection of representative images we might reproduce. The five currently there were pulled from the Randomly Collected Documents set.

They're numbered 1.jpg, 2.jpg, and so on in the folder, and the notes below correspond to those file names.

Images

Here's why I like them, as a list of features we can aim to reproduce:

Image 1

The text is faded in many places and to varying degrees
the source document was slightly rotated when scanning
there is a page border effect on one side
some hand-written text is visible on the bottom, and faded on the side.

Image 2

distortion in the top-border
hand-written redaction done with a marker (new effect?)
visible print lines
"snow"/random noise in places around the image

Image 3

the bleedthrough effect is very pronounced here, and importantly the reverse side does not contain the same text as the front
distorted page edges
lighting gradient on page edges
stains
noise along the bottom edge
the texture of the page is apparent

Image 4

tons of noise
redactions
complex page border geometry
multiple styles of handwritten text
stamps!

Image 5

page borders
redactions
punched holes on the top edge!
noise along the top and bottom, noise in the top right
more handwritten text, some of which overlaps printed text

Replication

I think most of what I've listed here is reproducible with existing augmentations; in particular, I think all of the handwriting can be produced by (some straightforward generalization of) the new pencil augmentation, and we can already do serviceable page borders, fading, noise, brightness texturization, bleedthrough, and so on wherever we like.

What we don't have though, is a way to mimic stamps, redaction, or binder holes. We should be able to do binder holes synthetically rather easily, and the challenge with redaction would be correctly placing the mark on the text, but I'm not sure about how to approach stamps without having a collection of them included like the paper textures are. Maybe stamps are out of scope.

kwcckw · 2021-08-25T08:29:34Z

Here's the compiled list of images by looking only at rvl-cdip dataset imagesA folder :
https://drive.google.com/drive/folders/17Z4VRqc1w9ZKpjoO3dulSqXTpxPVutMN?usp=sharing

Some are not fully reproducible, i included them here since they look interesting enough.

img_67.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1xJFoZ3XTA45MZ6LtC7sWMWOUGeUhHSKR/view?usp=sharing

Badphotocopy
Dirtydrum (half page with faded effect)
Page border (unique in this case because it is not in the page border)
Scribbles (Numbers) (we don't have this effect yet)
Punch holes (we don't have this effect yet)
Clips' marks (we don't have this effect yet)

img_299.png (Fully reproducible with some changes in the current code)
https://drive.google.com/file/d/17ev5mpuqMWd3l90nPCH9wbkNyekAhzB7/view?usp=sharing

Badphotocopy (Very random noises)
Page border on top

img_34123.png (Fully reproducible with some changes in the current code)
https://drive.google.com/file/d/1T34aLs58_V6xaHB0WfpttFMyyXYDQFB5/view?usp=sharing

Badphotocopy (surrounding the page at page border only)

img_34212.png (Fully reproducible)
https://drive.google.com/file/d/1ctNWE9Fx7zOlBNfKoFCnRRhDuABu-OAL/view?usp=sharing

Dirty roller
Page borders

img_34462.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1IKAlc0ejGLFKiuKMhnSSPbOfXtM95hEU/view?usp=sharing

Scribbles (words)(we don't have this effect yet)
Page borders

img_109076.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1u1FSbsNYAXcfp3GLD_WkjmSuJpbSt6aB/view?usp=sharing

Incomplete page (torn off?) (we don't have this effect yet)
Page border
Bad photocopy
Punch holes (we don't have this effect yet)

img_136566.png (Fully reproducible)
https://drive.google.com/file/d/1MbcKbPDUNQlcVwFw4GoIPCZLJHcmUXdt/view?usp=sharing

Book binding
Dirty oller (minor)

img_166297.png (Fully reproducible)
https://drive.google.com/file/d/1Z5RAP_DkUMnVAqd2R0g4m2tEGHvjEIGK/view?usp=sharing

Letterpress
Page border (very minor at bottom)

img_171248.png (Not fully reproducible, some augmentations not exist)
https://drive.google.com/file/d/1iYzC3NbfIMMx7GcJ6JQUgOJouC_NM1ek/view?usp=sharing

Blobs of noises (we don't have this effect yet)

img_179298.png (Fully reproducible)
https://drive.google.com/file/d/1kRQarbuQF96tVUndHQUAyBqSFfW3qs0T/view?usp=sharing

Page border
Bookbinding

shaheryar1 · 2021-08-26T11:00:24Z

By Looking at Resume category of Tobacco3482 dataset, I analyzed that almost all of the resumes are grayscale images with plain text on them therefore the Ink phase is very useful for reproducing such documents.

Here is a small list of some resumes extracted from dataset which collectively represents the noise in overall dataset.

https://drive.google.com/file/d/1OXdjEhbNE6moWaia0ORxUGPeU9tx69J8/view?usp=sharing

InkBleed (with high intensity ranges)
LowInkLine (Periodic)
DustyInk
PageBorder (Left)

https://drive.google.com/file/d/108OO9dGNF3FZmv-PpMFrrDYUfDBGZzD9/view?usp=sharing

InkBleed (low intensity)
Punch Holes ( not present in augraphy, discussion continues in Add new augmentation - Binding holes, punch holes, clip/pin mark #62 )
Page Border (top and bottom)

https://drive.google.com/file/d/1AdZyOpMtecIPRswVd6kwDooToKjY8G3u/view?usp=sharing

InkBleed (with high intensity ranges)
Jpeg Compression
Subtle Noise
PencilScribbles (of very small size and full-black color)
PageBorder on top (with a margin and diminish effect)
Applying noisy-blobs in a straight vertical line of the page (similar to Folding effect)

jboarman added discussion Open dialog about how we can approach and solve various issues test labels Jul 26, 2021

proofconstruction mentioned this issue Aug 26, 2021

Add new Augmentation - Text Strikethrough #63

Closed

jboarman added sample images and removed test labels Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

jboarman commented Jul 26, 2021 •

edited

Loading

kwcckw commented Jul 27, 2021

jboarman commented Jul 27, 2021

kwcckw commented Jul 27, 2021

proofconstruction commented Jul 28, 2021

jboarman commented Aug 22, 2021

kwcckw commented Aug 23, 2021

jboarman commented Aug 23, 2021

proofconstruction commented Aug 24, 2021 •

edited

Loading

kwcckw commented Aug 25, 2021

shaheryar1 commented Aug 26, 2021 •

edited

Loading

Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43

Comments

jboarman commented Jul 26, 2021 • edited Loading

Real-World Data Sets

Synthentic Data Sets

kwcckw commented Jul 27, 2021

jboarman commented Jul 27, 2021

kwcckw commented Jul 27, 2021

proofconstruction commented Jul 28, 2021

jboarman commented Aug 22, 2021

Original Target Document

Clean Reproduction

kwcckw commented Aug 23, 2021

jboarman commented Aug 23, 2021

proofconstruction commented Aug 24, 2021 • edited Loading

Images

Image 1

Image 2

Image 3

Image 4

Image 5

Replication

kwcckw commented Aug 25, 2021

shaheryar1 commented Aug 26, 2021 • edited Loading

jboarman commented Jul 26, 2021 •

edited

Loading

proofconstruction commented Aug 24, 2021 •

edited

Loading

shaheryar1 commented Aug 26, 2021 •

edited

Loading