Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Latest commit

 

History

History
61 lines (40 loc) · 2.32 KB

ullah2017ocr.md

File metadata and controls

61 lines (40 loc) · 2.32 KB

OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach

Rafi Ullah, Ali Sohani, Faraz Ali, Athaul Rai

Browse

@inproceedings{ullah2017ocr,
author = {Ullah, Rafi and Sohani, Ali and Ali, Faraz and Rai, Athaul},
year = {2017},
month = {12},
pages = {},
title = {OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach}
}

Pipeline

Receipt detection Receipt localization Receipt normalization Text line segmentation Optical character recognition Semantic analysis
✔️ ✔️ ✔️

Receipt localization

  • Image Background Removal

Receipt normalization

Optical character recognition

  • Tesseract OCR Library

Semantic analysis

  • Fields extracted:
    • item names,
    • item prices,
    • item quantities.
  • Regular expressions for the pattern matching I-e item names, item prices and items quantities.
  • Change due, walmart, total etc. words are removed.
  • Words and line containing “constant words” such as total, discount, sub total etc. will be removed.
  • Heuristics that if line contain item name and number (price and quantity), we consider those lines and discard rest of the lines.
  • But some of the receipts we observed, having different structure. Our heuristic is applied, but very weak over these kind of receipts. Items and prices on separate lines.

Notes

  • Before localization, Image stitching is performed if more than one photo of receipt is available.

  • Generic receipt parser.

  • image-20200214174359773