Multi-GPU and Distributed Training

OpenSeq2Seq supports two modes for parallel training: simple multi-tower approach and Horovod-based approach.

Standard Tensorflow distributed training

For multi-GPU training with native Distributed Tensorflow approach , you need to set use_horovod: False and num_gpus= in the configuration file. To start training use run.py script:

python run.py --config_file=... --mode=train_eval

Horovod

To use Horovod you will need to set use_horovod: True in the config and use mpirun:

mpiexec -np <num_gpus> python run.py --config_file=... --mode=train_eval --use_horovod=True --enable_logs

You can use Horovod both for multi-GPU and for multi-node training.

Note

num_gpus parameter will be ignored when use_horovod is set to True. In that case the number of GPUs to use is specified in the command line with mpirun arguments.