Machine Translation¶
Models¶
Currently we support following models:
Model description | SacreBLEU(cased) | Config file | Checkpoint |
---|---|---|---|
Transformer-big | 28.0 | transformer-nvgrad.py | link |
Transformer | 26.4 | transformer-base.py | link |
ConvS2S | 25.0 | en-de-convs2s-8-gpu.py | link |
GNMT | 23.0 | en-de-gnmt-like-4GPUs.py | TBD |
These models have been trained with BPE vocabulary used for text tokenization, available in wmt16.tar.gz . Note that to use pretrained model you will need the same vocabulary which was used for training. The model and training parameters can be found in the corresponding config file. We measure BLEU scores using SacreBLEU on detokenized output (cased).
Getting started¶
For a simplest example using toy-data (string reversal task) please refer to toy models.
Next let’s build a small English-German translation model. This model should train in a reasonable time on a single GPU.
Get data¶
Download (this will take some time):
scripts/get_en_de.sh
This script will download English-German training data from WMT, clean it, and tokenize using Google’s Sentencepiece library . By default, the vocabulary size we use is 32,768 for both English and German.
You can also download the pre-processed dataset which we used for training: wmt16.tar.gz .
Training¶
To train a small English-German model:
- change
data_root
insideen-de-nmt-small.py
to the WMT data location - adjust
num_gpus
to train on more than one GPU (if available).
Start training:
python run.py --config_file=example_configs/text2text/en-de-nmt-small.py --mode=train_eval
If your GPU does not have enough memory, reduce the batch_size_per_gpu
. Also, you might want to disable parallel evaluation by using --mode=train
.
Inference¶
Once training is done (this can take a while on a single GPU), you can run inference:
python run.py --config_file=example_configs/text2text/en-de-nmt-small.py --mode=infer --infer_output_file=raw.txt --num_gpus=1
Note that the model output is tokenized. In our case it will output BPE segments instead of words. Therefore, the next step is to de-tokenize:
python tokenizer_wrapper.py --mode=detokenize --model_prefix=.../Data/wmt16_de_en/m_common --decoded_output=result.txt --text_input=raw.txt
Computing BLEU scores¶
We measure BLEU scores using SacreBLEU package: (A Call for Clarity in Reporting BLEU Scores) Run SacreBleu on detokenized data:
cat result.txt | sacrebleu -t wmt14 -l en-de > result.txt.BLEU
Using pretrained models¶
All models have been trained with specific version of tokenizer. So first step would be copy m_common.model and m_common.vocab to current folder.
To translate your English text source_txt
to German you should
1.tokenize source.txt
into source.tok
:
python tokenizer_wrapper.py --mode=encode --model_prefix=m_common --text_input=source.txt --tokenized_output=source.tok --vocab_size=32768
modify model
config.py
:base_params = { "use_horovod": False, "num_gpus": 1, ... "logdir": "checkpoint/model", } ... infer_params = { "batch_size_per_gpu": 256, "data_layer": ParallelTextDataLayer, "data_layer_params": { "src_vocab_file": "m_common.vocab", "tgt_vocab_file": "m_common.vocab", "source_file": "source.tok", "target_file": "source.tok", # this line will be ignored "delimiter": " ", "shuffle": False, "repeat": False, "max_length": 1024, }, } ...
2.translate source.tok
into output.tok
:
python run.py --config_file=config.py --mode=infer --logdir=checkpoint/model --infer_output_file=output.tok --num_gpus=1
3.detokenize output.tok
:
python tokenizer_wrapper.py --mode=detokenize --model_prefix=m_common --text_input=output.tok --decoded_output=output.txt