Introduction

Google Vision provides 2 options for optical character recognition(OCR).

- Option 1: TEXT_DETECTION - Words with coordinates
- Option 2: DOCUMENT_TEXT_DETECTION - OCR on dense text to extract lines and paragraph information

The second option is suitable for data extraction from articles (Dense Text such as News Papers/Books). This option has an intelligent segmentation method to merge words which are nearby and form lines and paragraphs.

This feature is not desirable for images with sparse text content such as retail invoices, where the data relevant to the same line resides in two corners (A huge gap/whitespace between the product name and price). For these images the OCR segments the lines in a different order. If the distance of two words in a single line is too far apart then google vision identifies them as two separate paragraphs/lines.

The below images shows the sample output for a typical invoice from google vision.

This behaviour creates a problem in information extraction scenarios. For example, to extract a price of a product from a retail invoice the system needs to find a way to match the words in the same line. The algorithm proposed below performs line segmentation based on characters polygon coordinates for data extraction.

Usage Guide

Usage instruction for each programing language is located in the ReadMe files inside the relevant folders.

Proposed Algorithm

The implemented algorithm runs in two stages

Stage 1 - Groups nearby words to generate a longer strip of line
Stage 2 - Connects words which are far apart using the bounding polygon approach

Explanation.

Stage one helps to reduce the computations needed for the second phase of the algorithm. In the first phase the algorithms tries to merge words/characters which are very near. Stage 1 should be completed because for price related text like $3.40 is presented as 2 words by Google Vision (word 1: $3. word 2:,40). The first stage helps to concat nearby characters to form a text-block/word. This step helps reduces the computation needed for the second phase.

The stage 2 algorithm draws an imaginary bounding polygon (with a threshold) over the words and computes the words which belongs to each line.

Issues.

The algorithm successfully works for most of the slanted and slightly crumpled images. But it will fail to highly crumpled or folded images.

Test

Node JS

cd nodejs
npm install
npm test

Future Work

Try to implement the water-flow algorithm for line segmentation and measure accuracies with bounding polygon approach.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
images		images
json		json
kotlin		kotlin
nodejs		nodejs
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Usage Guide

Proposed Algorithm

Explanation.

Issues.

Test

Node JS

Future Work

About

Releases

Packages

Languages

License

jamesrowe08/line-segmentation-algorithm-to-gcp-vision

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage Guide

Proposed Algorithm

Explanation.

Issues.

Test

Node JS

Future Work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages