Multi-GPU and Distributed Training¶
OpenSeq2Seq supports two modes for parallel training: simple multi-tower approach and Horovod-based approach.
Standard Tensorflow distributed training¶
For multi-GPU training with native Distributed Tensorflow approach ,
you need to set use_horovod: False
and num_gpus=
in the configuration file. To start training use run.py
script:
python run.py --config_file=... --mode=train_eval
Horovod¶
To use Horovod you will need to set use_horovod: True
in the config and use mpirun:
mpiexec -np <num_gpus> python run.py --config_file=... --mode=train_eval --use_horovod=True --enable_logs
You can use Horovod both for multi-GPU and for multi-node training.
Note
num_gpus
parameter will be ignored when use_horovod
is set to True.
In that case the number of GPUs to use is specified in the command line with
mpirun
arguments.