Skip to content

This repo contains experiments from the article about pdf text layer correctness

Notifications You must be signed in to change notification settings

alexander1999-hub/txt_layer_correctness

Repository files navigation

Text layer correctness experiments

In this repository you can run experiments with all methods described in paper.

Requirements

This repository requires python==3.9
You can create virtual environment with requirements.txt

In order to use RuBert you need to install torch and torchvision with versions that suit your GPU and cuda.

Dataset

Synthetic dataset for training and benchmark dataset will download automatically when running main.py.
All data will be stored in a ./data folder that will also be created automatically.

Experiments

You can run experiments with XGBoost, Random Forest, Logistic Regression, N-Gram, Rubert with following command:

python main.py

By default, it runs experiments with all methods, except RuBert, using TF-IDF feature extractor

  • You can select models for experiments by changing the corresponding list models in main.py
  • You can also select feature extractor for experiments by changing the value of final_feature_extractor in main.py

About

This repo contains experiments from the article about pdf text layer correctness

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages