Image Classification

Model

Our ResNet-50 v2 model is a mixed precison replica of TensorFlow ResNet-50 , which corresponds to the model defined in the paper Identity Mappings in Deep Residual Networks by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Jul 2016.

This model was trained with different optimizers to state-of-the art accuracy for ResNet-50 model. Our best model reached top-1=77.63%, top-5=93.73 accuracy for Imagenet classification task.

Get data

You will need to download the ImageNet dataset and convert it to TFRecord format as described in `TensorFlow ResNet <https://github.com/tensorflow/models/tree/master/official/resnet`_

Training

Let’s train a model using SGD with momentum. To train model with 1 GPU with float precision:

python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval

If your GPU does not have enough memory, you can reduce the batch_size_per_gpu:

python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --batch_size_per_gpu=32

Multi-GPU training

If you have 2 GPUs, then you can use “native” Tensorflow multi-GPU training by setting num_gpus:

python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --use_horovod=False --num_gpus=2

or you can use Horovod (-np flag defines number of GPUs):

mpirun --allow-run-as-root --mca orte_base_help_aggregate 0 -mca btl ^openib -np 2 -H localhost:8 -bind-to none --map-by slot -x LD_LIBRARY_PATH python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --use_horovod=True

Training with Mixed Precicion

If you have Volta or Turing GPU which supports float16, you can speed-up training by using mixed precision:

python run.py --config_file=example_configs/image2label/resnet-50-v2-mp.py --mode=train_eval --use_horovod=False --num_gpus=2

Checkpoints

We have trained ResNet-50 with 3 optimizers:

Optimizer Training epochs top-1, % top-5, % Config file Checkpoint log
SGD with momentum 100 76.38 93.08 sgd_100 checkpoint log
AdamW 100 76.36 93.01 adamw_100 checkpoint log
NovoGrad 100 77.00 93.37 nvgrad_100 checkpoint log
NovoGrad 300 77.63 93.73 nvgrad_300 checkpoint log

Detailed training parameters are in corresponding configuration file.