HOWTO: Training the Self Supervised Model

The configuration parameters for training can be found in settings.toml. The code directly loads from this file for training/inference path. If you want to train on a custom dataset, you need to copy settings.toml to settings.local.toml and specify the following parameters:

[default.path]
    data_path='/path/to/the/csv/train'
    checkpoint_path='/path/to/store/the/model'
    logging_dir='/path/to/tensorboard/logging_dir'

The data path to train should be a csv file separated by |, with the the following columns:

sentence: the original sentence to replace
verb: the lemmatised verb
verbSeedStart: the starting position of the verb
verbSeedEnd: the ending position of the verb

Please note that the verbSeedStart and verbSeedEnd should denote the actual usage of the verb in sentence not the lemmatised verb.

To run the training, execute the following line:

python -m verb_cluster.flows.train_flow

Once the training is done you can find the saved model in checkpoint_path. The code will store top 3 models with the highest entailment score. You can choose which you want to use for inference.

The logging_dir stores the the logged data including validation loss, entailment percentage and validation loss. You can keep an eye on the training by using the tensorboard as follows:

tensorboard --logdir=<logging_dir>

If you want to store multiple different configs at the same time. You can add new sections in the settings.local.toml file with different <env_var> as follows:

[<env_var>.path]
    data_path='/path/to/the/csv/train'
    checkpoint_path='/path/to/store/the/model'
    logging_dir='/path/to/tensorboard/logging_dir'

and run the training code as follows:

ENV_FOR_DYNACONF=<env_var> python -m verb_cluster.flows.train_flow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOWTO: Training the Self Supervised Model

Clone this wiki locally