-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample Validaton Datasets -- What is Augraphy Trying to Reproduce? #43
Comments
Creating the style should be feasible, take this example: From here we can see several matching augmentations, such as page borders, ink bleed, dusty ink and etc. But the content itself could be a problem since it might not be possible if we want to create each of those contents on our own. So maybe we can find some open source digital documents and apply the relevant augmentations to them so that their style is matched? For example, apply similar style's agumentations to the letter below : |
Good point. We definitely want to have simple and clean inputs that would be representative of something one might have in their ground truth dataset. In theory, a practitioner would generate the input from text so that the ground truth can be programmatically known in advance, much like you are doing in your "Lorem ipsum dolo" sample notebook. But, for our test suite, we just need some simple and clean images that could plausibly be generated from a script. The other consideration is the inclusion of images like the provided sample. |
Okay, so i think we can do it either way. If the source document is having certain format, such as formal email, quotation, newspaper and etc, then we need to get digital copy of those documents, or at least having lines or squares to reproduce the document in similar format. Otherwise, generate text from code would be a better choice since it would be more straightforward. |
It shouldn't be too hard to write some functions that produce different documents given some strings. We can already easily write strings into images, and if we had a decent specification of some different kinds of documents (where lines should be drawn, where different areas of text should go), we could turn these into classes that produce ground-truthed documents, which could then be fed into Augraphy to produce the larger training set, which will all have ground-truths. This collection of document-generation classes should probably be its own package though. |
I need your help. We need to identify 5-10 archetypes, or sample documents, that we aim to reproduce using Augraghy. For each type of sample document, we need to then create a clean version that closely resembles the original document. This will then allow us to validate how well Augraphy can reproduce the noise we hope to be producing. For each sample archetype, we can produce a notebook that demonstrates using Augraphy to go full circle, showing how that noise can be synthesized. The initial notebooks might not look perfect, but they could serve as a sort of starting point for a competition that betters Augraphy towards faithful reproductions of each of these archetypes. HELP: Can you help identify example documents that show variety in the types of noise and issue seen in these documents? (Don't worry about creating the clean reproduction -- let's sort out the archetypes first!) Original Target DocumentThis example document comes from the RVL-CDIP dataset and represents the kind of noisy document that we intend to reproduce with Augraphy: Clean ReproductionThis is a reproduction made using Google Documents of what the original document might look like prior to being deteriorated by copying and scanning, etc: |
So right now we are only looking at RVL-CDIP dataset right? I need some time to download it since the dataset size is large. As a first step, I will check through those images and do a compilation of augmentations and their sample image in the dataset. Once i getting sufficient number of samples, i will share the details again here. |
I’d say any of the sources listed in this issue could be used as a reference for identifying sample documents. It will likely take a lot of process of elimination work to narrow it down. |
I started a collection of representative images we might reproduce. The five currently there were pulled from the Randomly Collected Documents set. They're numbered 1.jpg, 2.jpg, and so on in the folder, and the notes below correspond to those file names. ImagesHere's why I like them, as a list of features we can aim to reproduce: Image 1
Image 2
Image 3
Image 4
Image 5
ReplicationI think most of what I've listed here is reproducible with existing augmentations; in particular, I think all of the handwriting can be produced by (some straightforward generalization of) the new pencil augmentation, and we can already do serviceable page borders, fading, noise, brightness texturization, bleedthrough, and so on wherever we like. What we don't have though, is a way to mimic stamps, redaction, or binder holes. We should be able to do binder holes synthetically rather easily, and the challenge with redaction would be correctly placing the mark on the text, but I'm not sure about how to approach stamps without having a collection of them included like the paper textures are. Maybe stamps are out of scope. |
Here's the compiled list of images by looking only at rvl-cdip dataset Some are not fully reproducible, i included them here since they look interesting enough.
|
By Looking at Resume category of Tobacco3482 dataset, I analyzed that almost all of the resumes are grayscale images with plain text on them therefore the Here is a small list of some resumes extracted from dataset which collectively represents the noise in overall dataset. https://drive.google.com/file/d/1OXdjEhbNE6moWaia0ORxUGPeU9tx69J8/view?usp=sharing
https://drive.google.com/file/d/108OO9dGNF3FZmv-PpMFrrDYUfDBGZzD9/view?usp=sharing
https://drive.google.com/file/d/1AdZyOpMtecIPRswVd6kwDooToKjY8G3u/view?usp=sharing
|
In general, Augraphy is trying to simplify the process of creating synthetic realistic datasets using only ground truth documents.
Often, training data is not accompanied by clean ground truth sources, which leads to inaccurate training and severely limited volumes of available training data. By starting with clean ground truth data, training sets can be created that represent printed, scanned, copied and faxed documents encountered in the real world AND have 100% accurate training data.
In order to recreate data from these real-world scenarios, we need to create a validation set that is inspired by examples from the real world. Below are sources that may serve as useful source material for attempting to use Augraphy to reproduce the styles and detail seen in these data sets.
Real-World Data Sets
RVL-CDIP dataset consists of 400,000 B/W low-resolution (~100 DPI) images in 16 classes, with 25,000 images per class
https://www.cs.cmu.edu/~aharley/rvl-cdip/
NIST-SFRS (Structured Forms Reference Set) consists of 5,590 pages of binary, black-and-white images of synthesized documents from 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE.
https://www.nist.gov/srd/nist-special-database-2
Tobacco3482 dataset from Kaggle offers 10 different classes of forms, letters, reports, etc.
https://www.kaggle.com/patrickaudriaz/tobacco3482jpg
FUNSD (Form Understanding Noisy Scanned Documents) dataset on Kaggle comprises 199 real, fully annotated, scanned forms that are noisy and vary widely in appearance.
https://www.kaggle.com/sharmaharsh/form-understanding-noisy-scanned-documentsfunsd
Randomly Collected Documents is a Google Drive share that contains randomly selected public domain documents.
https://drive.google.com/drive/folders/1JMwmRko1gZ_VYtwXkP7CXPPztsNa_3nv?usp=sharing
Synthentic Data Sets
NoisyOffice data set from University of California, Irvine contains noisy grayscale printed text images and their corresponding ground truth for both real and simulated documents with 4 types of noise: folded sheets, wrinkled sheets, coffee stains, and footprints. For each type of font, one type of Noise: 17 files * 4 types of noise = 72 images.
https://archive.ics.uci.edu/ml/datasets/NoisyOffice
DDI-100 (Distorted Document Images) is a synthetic dataset by Ilia Zharikov ([email protected]) et al based on 7000 real unique document pages and consists of more than 100000 augmented images. Ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations.
https://arxiv.org/abs/1912.11658
https://github.com/machine-intelligence-laboratory/DDI-100/tree/master/dataset
https://paperswithcode.com/paper/ddi-100-dataset-for-text-detection-and
The text was updated successfully, but these errors were encountered: