Update: this paper is accepted by AAAI 2020.
LaNMT implements a latent-variable framework for non-autoregressive neural machine translation. As you can guess from the code, it's has a simple architecture but powerful performance. For the details of this model, you can check our paper on Arxiv https://arxiv.org/abs/1908.07181 . To cite the paper:
@article{Shu2020LaNMT,
title={Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior},
author={Raphael Shu and Jason Lee and Hideki Nakayama and Kyunghyun Cho},
journal={AAAI},
year={2020}
}
In conventional neural machine translation modes, the decoder side is a language model. That means the model generate a single word in each time step. So you have to compute the neural network by times in order to get a translation of words. See the illustration below:
Such models can't fully exploit the parallelizability of GPU as you have to wait preceeding words to be generated to find the next word. In constrast, non-autoregressive models generate all target words in just one run of neural computation. As all target tokens are predicted simutaneously, the translation speed can be much faster.
We learn a set of continuous latent variables to capture the information and intra-word dependencies of the target tokens. Intuitively, if the model is perfectly trained and the target sequence can be fully reconstructed from the latent variables without error, then the translation problem becomes a problem of finding adequate . This is illustrated in the picture below, which shows the relations among , and .
In practice, we force the latent variables to have very low dimensions such as 8. Obviously, handling things in a low-dimension countinuous space is easier comparing to a high-dimension discrete space.
Our model is trained by maximizing the following objective, which is a lower bound of log-likehood. We call it evidence lower bound (ELBO). The first part is a reconstruction loss that makes sure you can predict target sequence from . The second part is a KL divergence, which makes the more predictable given the source sequence.
Now for the parameterization, the model is implemented with the architecture in the picture below. Does it appear to be more complicated comparing to a standard Transformer? Well, you are now computing four probabilities instead of only . However, as the model is basically reusing the Transformer modules such as self-attention and cross-attention, it's still pretty easy to implement.
One thing special about this model is that the number of latent variables is always identical to the source tokens, as you can guess from the second figure in this post. As each is a continuous vector, is a matrix, where is the length of the source sequence, and D is the dimension of latent variables. For the Transformer decoder to predict target tokens that have a length longer or shorter than , we need a funtion to adjust the length of latent variables, just like this:
As a result, will be a matrix. The implementation of this length transforming function can be found in lib_lanmt_modules.py
(class LengthConverter) .
The code depends on PyTorch, torchtext for data loading, nmtlab for Transformer modules and horovod for multi-gpu training.
Note that although you can train the model on a single GPU, but for a large dataset such as WMT14, the training takes a lot of time without multi-gpu support. We recommend you to get 4 ~ 8 GPUs for this task.
We recommend installing with conda.
-1. (If you don't have conda) Download and Install Miniconda for Python 3
mkdir ~/apps; cd ~/apps
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Reload the bash/zsh and run python
to check it's using the python in Miniconda.
-2. Install pytorch following https://pytorch.org/get-started/locally/
-3. (Only for multi-gpu training) Install horovod following https://github.com/horovod/horovod#install
mkdir ~/apps; cd ~/apps
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar xzvf openmpi-4.0.1.tar.gz
cd openmpi-4.0.1
# Suppose you have Miniconda3 in your home directory
./configure --prefix=$HOME/miniconda3 --disable-mca-dso
make -j 8
make install
Check whether the openmpi is correctly installed by running mpirun
. Then install horovod with:
conda install -y gxx_linux-64
# If you don't have NCCL
pip install horovod
# If you have NCCL in /usr/local/nccl
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/usr/local/nccl pip install horovod
Check horovod by running horovodrun
.
-5. Run pip install torchtext nmtlab
-6. Clone this github repo, run
cd ~/
git clone https://github.com/zomux/lanmt
cd lanmt
We pre-processed the WMT14 dataset with sentencepiece and make the vocabulary size 32k for both source and target sides. For knowledge distllation, we use a baseline Transformer model to generate translations for the whole dataset. To save time, you can just download the pre-processed dataset from our link.
-1. Create mydata
folder if it's not there
mkdir mydata
cd mydata
-2. Download pre-processed WMT14 dataset from https://drive.google.com/file/d/16w3ZmxbiRzRG8vtBh26am-GUldHYYvLv/view . After download, uncompress the dataset in side mydata
folder.
./gdown.pl https://drive.google.com/file/d/16w3ZmxbiRzRG8vtBh26am-GUldHYYvLv/view lanmt_wmt14.tgz
tar xzvf lanmt_wmt14.tgz
-3. (Option) Download pre-processed ASPEC Ja-En dataset. Due to copyright issue, we only provide test dataset and extracted vocabularies
./gdown.pl https://drive.google.com/file/d/1PhjJS1-NycqbW-LRSLiAZLVvZ00Xh5GW/view lanmt_aspec.tgz
tar xzvf lanmt_aspec.tgz
-4. Download teacher Transformer models (735MB) for rescoring candidate translations when performing latent search.
./gdown.pl https://drive.google.com/file/d/1xB81cmSQ7l66zZjWPEBhoc4nzjgFSWZW/view lanmt_teacher_models.tgz
tar xzvf lanmt_teacher_models.tgz
Here, we start to train the non-autoregressive model. Note that if you don't have time and just want to play with pre-trained model, please jump to https://github.com/zomux/lanmt#use-our-pre-trained-models .
-1. Go back to lanmt
folder
-2. (Single GPU) Run this command:
# If you have 16GB GPU memory
python run.py --opt_dtok wmt14_ende --opt_batchtokens 4092 --opt_distill --opt_annealbudget --train
# If you have 32GB GPU memory
python run.py --opt_dtok wmt14_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget --train
-2. (Multi-GPU) Run this command if you have 8 GPUs:
# If you have 16GB GPU memory
horovodrun -np 8 -H localhost:8 python run.py --opt_dtok wmt14_ende --opt_batchtokens 4096 \
--opt_distill --opt_annealbudget --train
# If you have 32GB GPU memory
horovodrun -np 8 -H localhost:8 python run.py --opt_dtok wmt14_ende --opt_batchtokens 8192 \
--opt_distill --opt_annealbudget --train
There are some options you can use for training the model:
--opt_batchtokens
specifies the number of tokens in a batch
--opt_distill
enabling knowledge distillation, which means the model will predict the output of a teacher Transformer
--opt_annealbudget
enabling annealing of the budget of KL divergence
In our experiments, we train the model with 8 GPUs, putting 8192 tokens in each batch. If the script is successfully launched, you will see outputs like this:
[nmtlab] Running with 8 GPUs (Tesla V100-SXM2-32GB)
[valid] len_loss=2.77 len_acc=0.12 loss=194.92 word_acc=0.16 KL_budget=1.00
kl=27.87 tok_kl=1.00 nll=164.28 * (epoch 1, step 471)
...
[valid] len_loss=1.57 len_acc=0.40 loss=69.53 word_acc=0.66 KL_budget=1.00 k
l=28.41 tok_kl=1.02 nll=39.55 * (epoch 1, step 3761)
[nmtlab] Ending epoch 1, spent 53 minutes
...
In the training log, loss
showes the total loss value, nll
shows the cross-entropy value, kl
shows the KL divergence, tok_kl
shows the average KL value for each token and len_loss
and len_acc
shows the loss and prediction accuracy of the length predictor.
After finishing the model training, we also find it helpful to fix the KL budget at zero, and finetune the model for only one epoch. You can do this by running
# Single GPU
python run.py --opt_dtok wmt14_ende --opt_batchtokens 4092 --opt_distill --opt_annealbudget \
--opt_finetune --train
# Multi-GPU
horovodrun -np 8 -H localhost:8 python run.py --opt_dtok wmt14_ende --opt_batchtokens 4096 \
--opt_distill --opt_annealbudget --opt_finetune --train
To generate translations and measure the decoding time, simply run
python run.py --opt_dtok wmt14_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget \
--opt_finetune --test --evaluate
You will see the decoding time and evaluated BLEU scores at the end of lines. Then, let's try to refine the latent variables with deterministic inference for only one step
python run.py --opt_dtok wmt14_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget \
--opt_finetune --opt_Trefine_steps 1 --test --evaluate
We can also sample multiple latent variables from the prior, getting multiple candidate translations then use an autoregressive Transformer model to rescore them, you can do this by running
python run.py --opt_dtok wmt14_ende --opt_batchtokens 8192 --opt_distill --opt_annealbudget \
--opt_finetune --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate
With the --evaluate
option, the script will evalaute the BLEU scores with sacrebleu. Once the script finishes you shall see the decoding time and BLEU scores like this
Average decoding time: 89ms, std: 22
BLEU = 25.166677019716257
If you just want to test out the model and check the decoding speed and quality of translations, you can download our pre-trained models. By running the script with these models, you will get exactly the same BLEU scores as we reported in the paper.
-1. Download the pre-trained models (1GB)
cd mydata
./gdown.pl https://drive.google.com/file/d/1DcTHZYuhJhxxh0153qRx6BkBNHDK_f3b/view lanmt_pretrained_models.tgz
tar xzvf lanmt_pretrained_models.tgz
cd ..
-2. Translate using pre-trained models
# Lightning fast translation
python run.py --opt_dtok wmt14_ende --use_pretrain --test --evaluate
# With one refinement step
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --test --evaluate
# With latent search and teacher rescoring
python run.py --opt_dtok wmt14_ende --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate
-3. (Option) Evaluate the pre-trained model on ASPEC Ja-En dataset
# Lightning fast translation
python run.py --opt_dtok aspec_jaen --use_pretrain --test --evaluate
# With one refinement step
python run.py --opt_dtok aspec_jaen --use_pretrain --opt_Trefine_steps 1 --test --evaluate
# With latent search and teacher rescoring
python run.py --opt_dtok aspec_jaen --use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore --test --evaluate
Dataset | Options | BLEU | Decode Time (avg/std) | Speedup |
---|---|---|---|---|
WMT14 En-De | Our baseline Transformer (beam size=3) | 26.10 | 602ms / 274 | |
--use_pretrain |
22.30 | 18ms / 4 | 33.4x | |
--use_pretrain --opt_Trefine_steps 1 |
24.14 | 46ms / 4 | 13.0x | |
--use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search |
25.01 | 67ms / 18 | 8.9x | |
--use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore |
25.16 | 89ms / 22 | 6.7x | |
ASPEC Ja-En | Our baseline Transformer (beam size=3) | 27.15 | 415ms / 159 | |
--use_pretrain |
25.28 | 21ms / 4 | 19.7x | |
--use_pretrain --opt_Trefine_steps 1 |
27.53 | 47ms / 8 | 8.8x | |
--use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search |
28.08 | 69ms / 18 | 6.0x | |
--use_pretrain --opt_Trefine_steps 1 --opt_Tlatent_search --opt_Tteacher_rescore |
28.26 | 93ms / 23 | 4.5x |
- Training is slow
Try to install horovod with nccl support. Training will be much faster with nccl for gradient synchronization.
- Support half precision training
- Validation with BLEU criteria
- Update the distillation data with a new baseline model