Image Classification¶
Model¶
Our ResNet-50 v2 model is a mixed precison replica of TensorFlow ResNet-50 , which corresponds to the model defined in the paper Identity Mappings in Deep Residual Networks by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Jul 2016.
This model was trained with different optimizers to state-of-the art accuracy for ResNet-50 model. Our best model reached top-1=77.63%, top-5=93.73 accuracy for Imagenet classification task.
Get data¶
You will need to download the ImageNet dataset and convert it to TFRecord format as described in `TensorFlow ResNet <https://github.com/tensorflow/models/tree/master/official/resnet`_
Training¶
Let’s train a model using SGD with momentum. To train model with 1 GPU with float precision:
python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval
If your GPU does not have enough memory, you can reduce the batch_size_per_gpu
:
python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --batch_size_per_gpu=32
Multi-GPU training¶
If you have 2 GPUs, then you can use “native” Tensorflow multi-GPU training by setting num_gpus
:
python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --use_horovod=False --num_gpus=2
or you can use Horovod (-np
flag defines number of GPUs):
mpirun --allow-run-as-root --mca orte_base_help_aggregate 0 -mca btl ^openib -np 2 -H localhost:8 -bind-to none --map-by slot -x LD_LIBRARY_PATH python run.py --config_file=example_configs/image2label/resnet-50-v2.py --mode=train_eval --use_horovod=True
Training with Mixed Precicion¶
If you have Volta or Turing GPU which supports float16, you can speed-up training by using mixed precision:
python run.py --config_file=example_configs/image2label/resnet-50-v2-mp.py --mode=train_eval --use_horovod=False --num_gpus=2
Checkpoints¶
We have trained ResNet-50 with 3 optimizers:
Optimizer | Training epochs | top-1, % | top-5, % | Config file | Checkpoint | log |
---|---|---|---|---|---|---|
SGD with momentum | 100 | 76.38 | 93.08 | sgd_100 | checkpoint | log |
AdamW | 100 | 76.36 | 93.01 | adamw_100 | checkpoint | log |
NovoGrad | 100 | 77.00 | 93.37 | nvgrad_100 | checkpoint | log |
NovoGrad | 300 | 77.63 | 93.73 | nvgrad_300 | checkpoint | log |
Detailed training parameters are in corresponding configuration file.