Intelligent document processing with AI

ML-based PDF infomartion extraction system with storage and search functions.

Currently pretrained for Hungarian EKR documents (some official national contracts), but you can train 6 different models with your own data.

Services:

upload pdfs
- store pdf file in AWS S3
- exract text with Tesseract OCR
- exract information using the model service
- store text data in Elastic Search
search
- return pdf text data by query (match/levenhstein/regex/...)
download
- download pdf file by filename

Tech: JavaScript, Express.js, Pdf-Poppler, Tesseract-OCR, Elastic Search, AWS S3

The backend can run any .py and .ipynb files as with the excepted input/output formats

predict
- batch text information extraction with CRFSuite ML model (Conditional Random Fields)
- many other models have been tried, but those reached lower accuracy for this amount of data
train
- todo
tested models (dataset):
- Custom neural networks:
  - Embedding + bi-LSTM
  - Embedding + bi-LSTM + LSTM
  - Embedding + bi-LSTM + LSTM + CRF
- Bert
- XGBoost
- CRFSuite

Tech: Python, Flask, Keras, PyTorch, Bert, XGboost, PyCRFSuite

Tech: JavaScript, React

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
backend		backend
frontend		frontend
model		model
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md