Machine Translation¶

Models¶

Currently we support following models:

Model description	SacreBLEU(cased)	Config file	Checkpoint
Transformer-big	28.0	transformer-nvgrad.py	link
Transformer	26.4	transformer-base.py	link
ConvS2S	25.0	en-de-convs2s-8-gpu.py	link
GNMT	23.0	en-de-gnmt-like-4GPUs.py	TBD

These models have been trained with BPE vocabulary used for text tokenization, available in wmt16.tar.gz . Note that to use pretrained model you will need the same vocabulary which was used for training. The model and training parameters can be found in the corresponding config file. We measure BLEU scores using SacreBLEU on detokenized output (cased).

Getting started¶

For a simplest example using toy-data (string reversal task) please refer to toy models.

Next let’s build a small English-German translation model. This model should train in a reasonable time on a single GPU.

Get data¶

Download (this will take some time):

scripts/get_en_de.sh

This script will download English-German training data from WMT, clean it, and tokenize using Google’s Sentencepiece library . By default, the vocabulary size we use is 32,768 for both English and German.

You can also download the pre-processed dataset which we used for training: wmt16.tar.gz .

Training¶

To train a small English-German model:

change data_root inside en-de-nmt-small.py to the WMT data location
adjust num_gpus to train on more than one GPU (if available).

Start training:

python run.py --config_file=example_configs/text2text/en-de-nmt-small.py --mode=train_eval

If your GPU does not have enough memory, reduce the batch_size_per_gpu. Also, you might want to disable parallel evaluation by using --mode=train.

Inference¶

Once training is done (this can take a while on a single GPU), you can run inference:

python run.py --config_file=example_configs/text2text/en-de-nmt-small.py --mode=infer --infer_output_file=raw.txt --num_gpus=1

Note that the model output is tokenized. In our case it will output BPE segments instead of words. Therefore, the next step is to de-tokenize:

python tokenizer_wrapper.py --mode=detokenize --model_prefix=.../Data/wmt16_de_en/m_common --decoded_output=result.txt --text_input=raw.txt

Computing BLEU scores¶

We measure BLEU scores using SacreBLEU package: (A Call for Clarity in Reporting BLEU Scores) Run SacreBleu on detokenized data:

cat result.txt | sacrebleu -t wmt14 -l en-de > result.txt.BLEU

Using pretrained models¶

All models have been trained with specific version of tokenizer. So first step would be copy m_common.model and m_common.vocab to current folder.

To translate your English text source_txt to German you should

1.tokenize source.txt into source.tok:

python tokenizer_wrapper.py --mode=encode --model_prefix=m_common  --text_input=source.txt --tokenized_output=source.tok --vocab_size=32768

modify model config.py:

base_params = {
  "use_horovod": False,
  "num_gpus": 1,
  ...
  "logdir": "checkpoint/model",
}
...
infer_params = {
  "batch_size_per_gpu": 256,
  "data_layer": ParallelTextDataLayer,
  "data_layer_params": {
    "src_vocab_file": "m_common.vocab",
    "tgt_vocab_file": "m_common.vocab",
    "source_file": "source.tok",
    "target_file": "source.tok", # this line will be ignored
    "delimiter":   " ",
    "shuffle":     False,
    "repeat":      False,
    "max_length":  1024,
  },
}
...

2.translate source.tok into output.tok:

python run.py --config_file=config.py --mode=infer --logdir=checkpoint/model  --infer_output_file=output.tok --num_gpus=1

3.detokenize output.tok:

python tokenizer_wrapper.py --mode=detokenize --model_prefix=m_common --text_input=output.tok --decoded_output=output.txt