GNMT¶

Model¶

We have 2 models based on RNNs:

small NMT (config en-de-nmt-small.py ) model:
- the embedding size for source and target is 512
- 2 birectional LSTM layers in encoder, and 2 LSTM layers in decoder with state 512
- the attention mechanism with size 512
GNMT-like model based on Google NMT (config en-de-gnmt-like-4GPUs.py ):
- the embedding size for source and target is 1024
- 8 LSTM layers in encoder, and 8 LSTM layers in decoder with state 1024
- residual connections in encoders and decoders
- first layer of encoder is bi-directional
- GNMTv2 attention mechanism
- the attention layer size 1024

Both models have been trained with Adam. The small model has following training parameters:

The large model was trained with following parameters:

learning rate starting from 0.0008 with staircase decay 0.5 (aka Luong10 scheme)
dropout 0.2

GNMT-like model convergense in float32 and Mixed Precision is almost exactly the same.