ML-based PDF infomartion extraction system with storage and search functions.
Currently pretrained for Hungarian EKR documents (some official national contracts), but you can train 6 different models with your own data.
upload pdfs
- store pdf file in AWS S3
- exract text with Tesseract OCR
- exract information using the model service
- store text data in Elastic Search
search
- return pdf text data by query (match/levenhstein/regex/...)
download
- download pdf file by filename
Tech: JavaScript, Express.js, Pdf-Poppler, Tesseract-OCR, Elastic Search, AWS S3
The backend can run any .py and .ipynb files as with the excepted input/output formats
-
predict
- batch text information extraction with CRFSuite ML model (Conditional Random Fields)
- many other models have been tried, but those reached lower accuracy for this amount of data
-
train
- todo
-
tested models (dataset):
- Custom neural networks:
- Embedding + bi-LSTM
- Embedding + bi-LSTM + LSTM
- Embedding + bi-LSTM + LSTM + CRF
- Bert
- XGBoost
- CRFSuite
- Custom neural networks:
Tech: Python, Flask, Keras, PyTorch, Bert, XGboost, PyCRFSuite
- draft
Tech: JavaScript, React