Skip to content

HOWTO: Training the Self Supervised Model

Mokanarangan Thayaparan edited this page Oct 31, 2022 · 1 revision

The configuration parameters for training can be found in settings.toml. The code directly loads from this file for training/inference path. If you want to train on a custom dataset, you need to copy settings.toml to settings.local.toml and specify the following parameters:

[default.path]
    data_path='/path/to/the/csv/train'
    checkpoint_path='/path/to/store/the/model'
    logging_dir='/path/to/tensorboard/logging_dir'

The data path to train should be a csv file separated by |, with the the following columns:

  • sentence: the original sentence to replace
  • verb: the lemmatised verb
  • verbSeedStart: the starting position of the verb
  • verbSeedEnd: the ending position of the verb

Please note that the verbSeedStart and verbSeedEnd should denote the actual usage of the verb in sentence not the lemmatised verb.

To run the training, execute the following line:

python -m verb_cluster.flows.train_flow

Once the training is done you can find the saved model in checkpoint_path. The code will store top 3 models with the highest entailment score. You can choose which you want to use for inference.

The logging_dir stores the the logged data including validation loss, entailment percentage and validation loss. You can keep an eye on the training by using the tensorboard as follows:

tensorboard --logdir=<logging_dir>

If you want to store multiple different configs at the same time. You can add new sections in the settings.local.toml file with different <env_var> as follows:

[<env_var>.path]
    data_path='/path/to/the/csv/train'
    checkpoint_path='/path/to/store/the/model'
    logging_dir='/path/to/tensorboard/logging_dir'

and run the training code as follows:

ENV_FOR_DYNACONF=<env_var> python -m verb_cluster.flows.train_flow
Clone this wiki locally