qdrl

This repository contains training code used for my master thesis titled Joint Multi-Modal Query-Document Representation Learning which you can read here.

Training

Model training is configured by config.yaml file with training parameters.

Model training is done on Google Cloud Platform using Vertex AI Training with custom image. Configs, datasets and models are stored on Google Cloud Storage, gcsfuse is required.

The training flow is the following:

Create local training config - local_config.yaml
Create training docker image using dedicated script. Point the training script to local_config.yaml, as well as your GCP project. The script will output your ${image_name}.
Publish the docker image to gcr: docker push ${image_name}
Upload the training config to GCS .
Run the gcp training script with correct CONTAINER_IMAGE_URI and config.yaml gcs path.

Models checkpoint is saved after each epoch, model from last epoch is saved separately.

Evaluation metrics - recall@k and mrr@k are saved and can be visualized on tensorboard.

Embedding visualization can be optionally turned on if you want to play with it on TB projector.

Datasets

There are 3 required datasets for training and evaluation (details are in the thesis).

Training dataset - pairs of query, relevant document
Evaluation queries dataset (recall_validation_queries_dataset) - pairs of query, relevant document id
Evaluation documents dataset (recall_validation_items_dataset) - candidate pool for evaluation

Config

Example config

Supported training parameters

task_id
run_id
num_epochs
dataset_dir
batch_size
learning_rate
reuse_epoch
dataloader_workers
dataset - structure of training features
loss - can be batch_softmax or triplet
text_vectorizer - path to the token dictionary and tokenization config (word_unigram, word_bigram, char_trigram + oov)
model - can be SimpleTextEncoder, TwoTower, or MultiModalTwoTower
recall_validation - for what 'k' validation should be run and whether to generate dataset with typos

Acknowledgments

Research papers can be found in the thesis. For the code part special thanks goes to:

FAQ

1. The codebase is awful and does not have tests, why?

Best engineering practices do not apply to master thesis, sorry

2. What does 'qdrl' mean?

qdrl stands for Query Document Representation Learning

3. No distributed training?

No.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
configs		configs
qdrl		qdrl
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements_dev.txt		requirements_dev.txt
run_train_gcp_gpu.sh		run_train_gcp_gpu.sh
run_train_local.sh		run_train_local.sh
setup.py		setup.py
thesis.pdf		thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qdrl

Training

Datasets

Config

Acknowledgments

FAQ

1. The codebase is awful and does not have tests, why?

2. What does 'qdrl' mean?

3. No distributed training?

About

Releases

Packages

Languages

moscicky/qdrl

Folders and files

Latest commit

History

Repository files navigation

qdrl

Training

Datasets

Config

Acknowledgments

FAQ

1. The codebase is awful and does not have tests, why?

2. What does 'qdrl' mean?

3. No distributed training?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages