OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach
@inproceedings{ullah2017ocr,
author = {Ullah, Rafi and Sohani, Ali and Ali, Faraz and Rai, Athaul},
year = {2017},
month = {12},
pages = {},
title = {OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach}
}
Receipt detection | Receipt localization | Receipt normalization | Text line segmentation | Optical character recognition | Semantic analysis |
---|---|---|---|---|---|
❌ | ✔️ | ✔️ | ❌ | ❗ | ✔️ |
- Image Background Removal
-
Otsu’s Binarization
-
Image de-skewing with https://www.pyimagesearch.com/2017/02/20/text-skew-correction-opencv-python/
-
Image resizing
Image having DPI (Dots per Inch) greater than 300 has been observed good results.
bicubic interpolation
- Tesseract OCR Library
- Fields extracted:
- item names,
- item prices,
- item quantities.
- Regular expressions for the pattern matching I-e item names, item prices and items quantities.
- Change due, walmart, total etc. words are removed.
- Words and line containing “constant words” such as total, discount, sub total etc. will be removed.
- Heuristics that if line contain item name and number (price and quantity), we consider those lines and discard rest of the lines.
- But some of the receipts we observed, having different structure. Our heuristic is applied, but very weak over these kind of receipts. Items and prices on separate lines.