-
Notifications
You must be signed in to change notification settings - Fork 0
HOWTO: Training the Self Supervised Model
The configuration parameters for training can be found in settings.toml
. The code directly loads from this file for training/inference path. If you want to train on a custom dataset, you need to copy settings.toml
to settings.local.toml
and specify the following parameters:
[default.path]
data_path='/path/to/the/csv/train'
checkpoint_path='/path/to/store/the/model'
logging_dir='/path/to/tensorboard/logging_dir'
The data path to train should be a csv file separated by |
, with the the following columns:
-
sentence
: the original sentence to replace -
verb
: the lemmatised verb -
verbSeedStart
: the starting position of the verb -
verbSeedEnd
: the ending position of the verb
Please note that the verbSeedStart
and verbSeedEnd
should denote the actual usage of the verb in sentence not the lemmatised verb.
To run the training, execute the following line:
python -m verb_cluster.flows.train_flow
Once the training is done you can find the saved model in checkpoint_path
. The code will store top 3 models with the highest entailment score. You can choose which you want to use for inference.
The logging_dir
stores the the logged data including validation loss, entailment percentage and validation loss. You can keep an eye on the training by using the tensorboard as follows:
tensorboard --logdir=<logging_dir>
If you want to store multiple different configs at the same time. You can add new sections in the settings.local.toml
file with different <env_var>
as follows:
[<env_var>.path]
data_path='/path/to/the/csv/train'
checkpoint_path='/path/to/store/the/model'
logging_dir='/path/to/tensorboard/logging_dir'
and run the training code as follows:
ENV_FOR_DYNACONF=<env_var> python -m verb_cluster.flows.train_flow