English vs French validation losses (French training losses in the zoom-in, early loss values cropped for readability).
Install requirements using pip install -r requirements.txt
.
NOTE : On Windows, doing this might install torch without CUDA support. If this is the case, first install pytorch CUDA following instruction on the official website, then run pip install -r requirements.txt
.
Read the following section to learn how to reproduce experiments.
We recommend to always use the latest commit on the 'main' branch, as it will always be the cleanest and most updated branch. Tags are present on the git tree to restore the project to previous versions.
Main branch, may be regularly updated to increase code readability/usability. Use this branch to run similar experiments to the natural language ones in the paper. Will contain only a few example .json
files for training, to get the exact specifications of previous experiments see the tags below.
This tag restores the codebase to a snapshot as it was for the first submission of the paper. Uses an earlier version of torchenhanced (custom library used for training, akin to pytorch lightning). Contains the .json
files for all the natural language experiments
This tag restores the codebase to a snapshot as it was for the submission of the rebuttal, during the paper review. Contains slightly update code, as well as additional .json
files for experiments on Arabic, Hebrew and Tagalog, as well reversed French
The script tokenize_to_h5.py
can be used to prepare a dataset for training. Given a .txt file, it will train a BPE tokenizer on it, then use it to tokenize the text, and save the tokenized dataset in .h5
format.
CC100 datasets can be downloaded here.
To use tokenize_to_h5.py
, first put a standalone .txt
file inside a folder. Then, use tokenize_to_h5.py
using the following arguments
usage: tokenize_to_h5.py [-h] --txt_path TXT_PATH
Script for preparing a .txt cc-100 dataset for training. Creates the
custom tokenizer, and tokenizes the text with it to generate the .h5
file for training.
To make one of those things independently (e.g., only make the custom
tokenizer), see modules/tok_utils.
options:
-h, --help show this help message and exit
--txt_path TXT_PATH, -t TXT_PATH
The input file to be tokenized. This script will save the following
items:
1) given the path of a source plain text file, a folder of the same
name as the containing folder of txt_path, with '_h5' appended at the
end, as well as raw Pytorch tensors. Example:
-t my_dataset/input.txt -> my_dataset_h5/input.h5
my_dataset_pt/input_tokenized.pt
2) a tokenizer in modules/tokenizers called after the folder containing
the txt dataset. Example:
-t code_dataset/input.txt -> modules/tokenizers/code_dataset_tokenizer/
Then run the script.
NOTE : tokenization of large .txt files (>100GB) might take a while (1,2 days). This script is NOT designed to pick up where it left off if it crashes. For bigger datasets, consider making a script (include from modules.tok_utils import *
), and run, subsequently :
create_tokenizer(txt_path, tokenizer_folder,tokenizer_name)
: Will train the BPE tokenizer on the given .txt file, and save it in <tokenizer_folder>/<tokenizer_name>tokenize_folder(os.path.dirname(txt_path), os.path.join(tokenizer_folder,tokenizer_name))
: Will tokenize the text file, splitting it into subfiles if necessary for memory reasons. Saved the tokenized tensors as.pt
. If it crashes mid-way, can be restarted, and will pickup from last checkpointmake_h5(os.path.dirname(txt_path)+'_pt', dataset_name, destination_folder,toki)
: Will convert a folder containing.pt
files into a single.h5
dataset, ready for training.toki
is anAutoTokenizer
instance, used only for visualization of the process.
For more informations on these functions, look at docstring comments in modules/tok_utils
The tokenizer class we use throughout the project is defined in modules/tokenizer.py
. It is a wrapper on top of the Huggingface tokenizer.
Here is all you need to know to use the tokenizers :
from modules import tokenizer
toki = tokenizer.get_tokenizer(m_path='modules/tokenizers/en_tokenizer') # Load a saved tokenizer by specifying saved folder
# A saved tokenizer is created by using create_tokenizer in modules/tok_utils/create_custom_tokenizer.py
tokenized = toki.tokenize("Hello, world!") # Tokenize a string
print(tokenized) # Get a tensor of ints shape [1, seq_len]
print(toki.detokenize(tokenized)) # Detokenize a tensor of ints, prints "Hello, world!"
Note: the scripts in modules/tok_utils/
can, to some degree, be run independently, provided the path to the module folder is added to the PYTHONPATH: PYTHONPATH="/path/to/modules:$PYTHONPATH" python modules/tok_utils/pt_to_h5.py --help
.
For training, 4 scripts are provided. All are designed to train models on the dataset generated with the above method.
train_gpt.py
: Trains GPT model on a single GPU.train_gru.py
: Trains GRU model.train_lstm.py
: Trains LSTM model.train_parallel.py
: Trains GPT model on multiple GPUs, usingtorch.nn.Dataparallel
For all 4 scripts, usage is as follows :
usage: train_xxx.py [-h] [-d DEVICE] [-t TOKENIZER_PATH] [-p PROJECT_NAME] [-s] file_location
Starts training of Predictor model given a JSON config file.
positional arguments:
file_location Path to the JSON config file. Relative to where you launch the script from.
options:
-h, --help show this help message and exit
-d DEVICE, --device DEVICE
Device string, e.g. 'cuda:0' or 'cpu'. For parallel, list of devices.
-t TOKENIZER_PATH, --tokenizer_path TOKENIZER_PATH
Path for the tokenizer to use (only used for logging snippets). Relative to the script folder.
-p PROJECT_NAME, --project_name PROJECT_NAME
Name of the project to log to.
-r RUN_NAME, --run_name RUN_NAME
Name of the run. Defaults to the '.json' filename
-s, --no_step_pickup
If set, train steps_to_train steps more. Otherwise, will train UP TO steps_to_train TOTAL steps."
Example :
python train_script.py path/to/config.json -d cuda:0 -t path/to/tokenizer -p MyTrainingProject -s
To run the training script, we need to provide it with a path to the JSON config file. Their format slightly depends if training a GPT, GRU or LSTM model. In a nutshell, they contain all the necessary hyperparameters for a training run.
Here is a description of each entry :
{
"model_params": { # Model parameters
"vocab_size": 50257, # Vocabulary size
"n_layers": 12, # Number of Transformer Blocks
"n_heads": 12, # Number of attention heads
"embed_dim": 768, # Number of hidden/embedding dimensions
"attn_length": 256, # Attention Length
"mlp_ratio": 4.0, # MLP ratio
"dropout": 0.1, # Dropout inside tranformer blocks
"embd_dropout": null # Dropout for the token embeddings. Defaults to 0.
},
"training_params": {
"dataset_folder": "english/english.h5", # Location of .h5 dataset to train on
"batch_size": 180, # Batch size
"aggregate": 1, # Number of times to aggregate gradients before gradient step. (effective batch_size = aggregate*batch_size)
"backwards": false, # Whether to train in the backwards direction
"steps_to_train": null, # Number of gradient steps to train. Defaults to one epoch of the dataset.
"save_every": 3000, # Number of steps between each save of the training state.
"backup_every": 15000, # Number of steps between a backup of the training state.
"step_log": 400, # Number of steps between each log of training loss in wandb
"valid_steps": 1000, # Number of batches seen during one validation.
"save_loc": "datavol/vassilis/runs" # folder in which to save the training state.,
"fast_scrambling": true # if true, will increase dataset scrambling speed, with the price of cutting the dataset if it has more than 3 billion examples (otherwise, RAM usage > 100 Gb)
},
"optim_params": {
"lr": 0.0001, # Base learning rate
"warmup_steps": 4000, # Number of batches until learning rate warms up
"oscil_steps": 300000, # Number of steps between warm restarts
"lr_shrink": 0.85, # Shrinking factor of lr between warm restarts
"lr_init": 1e-07, # Initial learning rate, for warmup
"lr_min": 1e-06 # Minimum learning rate reached during cosine annealing.
}
}