[1]:
# Copyright 2020 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

0c89fd1b91e7470bb17027c901cd4cb3

Object Detection with TRTorch (SSD)


Overview

In PyTorch 1.0, TorchScript was introduced as a method to separate your PyTorch model from Python, make it portable and optimizable.

TRTorch is a compiler that uses TensorRT (NVIDIA’s Deep Learning Optimization SDK and Runtime) to optimize TorchScript code. It compiles standard TorchScript modules into ones that internally run with TensorRT optimizations.

TensorRT can take models from any major framework and specifically tune them to perform better on specific target hardware in the NVIDIA family, and TRTorch enables us to continue to remain in the PyTorch ecosystem whilst doing so. This allows us to leverage the great features in PyTorch, including module composability, its flexible tensor implementation, data loaders and more. TRTorch is available to use with both PyTorch and LibTorch.

To get more background information on this, we suggest the lenet-getting-started notebook as a primer for getting started with TRTorch.

Learning objectives

This notebook demonstrates the steps for compiling a TorchScript module with TRTorch on a pretrained SSD network, and running it to test the speedup obtained.

Contents

  1. Requirements

  2. SSD Overview

  3. Creating TorchScript modules

  4. Compiling with TRTorch

  5. Running Inference

  6. Measuring Speedup

  7. Conclusion


## 1. Requirements

Follow the steps in notebooks/README to prepare a Docker container, within which you can run this demo notebook.

In addition to that, run the following cell to obtain additional libraries specific to this demo.

[ ]:
# Known working versions
!pip install numpy==1.21.2 scipy==1.5.2 Pillow==6.2.0 scikit-image==0.17.2 matplotlib==3.3.0

## 2. SSD

Single Shot MultiBox Detector model for object detection

_

_

alt

alt

PyTorch has a model repository called the PyTorch Hub, which is a source for high quality implementations of common models. We can get our SSD model pretrained on COCO from there.

Model Description

This SSD300 model is based on the SSD: Single Shot MultiBox Detector paper, which describes SSD as “a method for detecting objects in images using a single deep neural network”. The input size is fixed to 300x300.

The main difference between this model and the one described in the paper is in the backbone. Specifically, the VGG model is obsolete and is replaced by the ResNet-50 model.

From the Speed/accuracy trade-offs for modern convolutional object detectors paper, the following enhancements were made to the backbone: * The conv5_x, avgpool, fc and softmax layers were removed from the original classification model. * All strides in conv4_x are set to 1x1.

The backbone is followed by 5 additional convolutional layers. In addition to the convolutional layers, we attached 6 detection heads: * The first detection head is attached to the last conv4_x layer. * The other five detection heads are attached to the corresponding 5 additional layers.

Detector heads are similar to the ones referenced in the paper, however, they are enhanced by additional BatchNorm layers after each convolution.

More information about this SSD model is available at Nvidia’s “DeepLearningExamples” Github here .

[3]:
import torch
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
[4]:
# List of available models in PyTorch Hub from Nvidia/DeepLearningExamples
torch.hub.list('NVIDIA/DeepLearningExamples:torchhub')
Downloading: "https://github.com/NVIDIA/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
[4]:
['nvidia_ssd',
 'nvidia_ssd_processing_utils',
 'nvidia_tacotron2',
 'nvidia_tts_utils',
 'nvidia_waveglow']
[5]:
# load SSD model pretrained on COCO from Torch Hub
precision = 'fp32'
ssd300 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math=precision);
Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/ssd_pyt_ckpt_amp/versions/20.06.0/files/nvidia_ssdpyt_amp_200703.pt

Setting precision="fp16" will load a checkpoint trained with mixed precision into architecture enabling execution on Tensor Cores. Handling mixed precision data requires the Apex library.

Sample Inference

We can now run inference on the model. This is demonstrated below using sample images from the COCO 2017 Validation set.

[6]:
# Sample images from the COCO validation set
uris = [
    'http://images.cocodataset.org/val2017/000000397133.jpg',
    'http://images.cocodataset.org/val2017/000000037777.jpg',
    'http://images.cocodataset.org/val2017/000000252219.jpg'
]

# For convenient and comprehensive formatting of input and output of the model, load a set of utility methods.
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd_processing_utils')

# Format images to comply with the network input
inputs = [utils.prepare_input(uri) for uri in uris]
tensor = utils.prepare_tensor(inputs, False)

# The model was trained on COCO dataset, which we need to access in order to
# translate class IDs into object names.
classes_to_labels = utils.get_coco_object_dictionary()
Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
[7]:
# Next, we run object detection
model = ssd300.eval().to("cuda")
detections_batch = model(tensor)

# By default, raw output from SSD network per input image contains 8732 boxes with
# localization and class probability distribution.
# Let’s filter this output to only get reasonable detections (confidence>40%) in a more comprehensive format.
results_per_input = utils.decode_results(detections_batch)
best_results_per_input = [utils.pick_best(results, 0.40) for results in results_per_input]
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ../c10/core/TensorImpl.h:1153.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

Visualize results

[8]:
from matplotlib import pyplot as plt
import matplotlib.patches as patches

# The utility plots the images and predicted bounding boxes (with confidence scores).
def plot_results(best_results):
    for image_idx in range(len(best_results)):
        fig, ax = plt.subplots(1)
        # Show original, denormalized image...
        image = inputs[image_idx] / 2 + 0.5
        ax.imshow(image)
        # ...with detections
        bboxes, classes, confidences = best_results[image_idx]
        for idx in range(len(bboxes)):
            left, bot, right, top = bboxes[idx]
            x, y, w, h = [val * 300 for val in [left, bot, right - left, top - bot]]
            rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
            ax.add_patch(rect)
            ax.text(x, y, "{} {:.0f}%".format(classes_to_labels[classes[idx] - 1], confidences[idx]*100), bbox=dict(facecolor='white', alpha=0.5))
    plt.show()

[9]:
# Visualize results without TRTorch/TensorRT
plot_results(best_results_per_input)
../_images/_notebooks_ssd-object-detection-demo_18_0.png
../_images/_notebooks_ssd-object-detection-demo_18_1.png
../_images/_notebooks_ssd-object-detection-demo_18_2.png

Benchmark utility

[10]:
import time
import numpy as np

import torch.backends.cudnn as cudnn
cudnn.benchmark = True

# Helper function to benchmark the model
def benchmark(model, input_shape=(1024, 1, 32, 32), dtype='fp32', nwarmup=50, nruns=1000):
    input_data = torch.randn(input_shape)
    input_data = input_data.to("cuda")
    if dtype=='fp16':
        input_data = input_data.half()

    print("Warm up ...")
    with torch.no_grad():
        for _ in range(nwarmup):
            features = model(input_data)
    torch.cuda.synchronize()
    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(1, nruns+1):
            start_time = time.time()
            pred_loc, pred_label  = model(input_data)
            torch.cuda.synchronize()
            end_time = time.time()
            timings.append(end_time - start_time)
            if i%10==0:
                print('Iteration %d/%d, avg batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))

    print("Input shape:", input_data.size())
    print("Output location prediction size:", pred_loc.size())
    print("Output label prediction size:", pred_label.size())
    print('Average batch time: %.2f ms'%(np.mean(timings)*1000))

We check how well the model performs before we use TRTorch/TensorRT

[11]:
# Model benchmark without TRTorch/TensorRT
model = ssd300.eval().to("cuda")
benchmark(model, input_shape=(128, 3, 300, 300), nruns=100)
Warm up ...
Start timing ...
Iteration 10/100, avg batch time 382.30 ms
Iteration 20/100, avg batch time 382.72 ms
Iteration 30/100, avg batch time 382.63 ms
Iteration 40/100, avg batch time 382.83 ms
Iteration 50/100, avg batch time 382.90 ms
Iteration 60/100, avg batch time 382.86 ms
Iteration 70/100, avg batch time 382.88 ms
Iteration 80/100, avg batch time 382.86 ms
Iteration 90/100, avg batch time 382.95 ms
Iteration 100/100, avg batch time 382.97 ms
Input shape: torch.Size([128, 3, 300, 300])
Output location prediction size: torch.Size([128, 4, 8732])
Output label prediction size: torch.Size([128, 81, 8732])
Average batch time: 382.97 ms

## 3. Creating TorchScript modules

To compile with TRTorch, the model must first be in TorchScript . TorchScript is a programming language included in PyTorch which removes the Python dependency normal PyTorch models have. This conversion is done via a JIT compiler which given a PyTorch Module will generate an equivalent TorchScript Module. There are two paths that can be used to generate TorchScript: Tracing and Scripting . - Tracing follows execution of PyTorch generating ops in TorchScript corresponding to what it sees. - Scripting does an analysis of the Python code and generates TorchScript, this allows the resulting graph to include control flow which tracing cannot do.

Tracing however due to its simplicity is more likely to compile successfully with TRTorch (though both systems are supported).

[12]:
model = ssd300.eval().to("cuda")
traced_model = torch.jit.trace(model, [torch.randn((1,3,300,300)).to("cuda")])

If required, we can also save this model and use it independently of Python.

[13]:
# This is just an example, and not required for the purposes of this demo
torch.jit.save(traced_model, "ssd_300_traced.jit.pt")
[14]:
# Obtain the average time taken by a batch of input with Torchscript compiled modules
benchmark(traced_model, input_shape=(128, 3, 300, 300), nruns=100)
Warm up ...
Start timing ...
Iteration 10/100, avg batch time 382.67 ms
Iteration 20/100, avg batch time 382.54 ms
Iteration 30/100, avg batch time 382.73 ms
Iteration 40/100, avg batch time 382.53 ms
Iteration 50/100, avg batch time 382.56 ms
Iteration 60/100, avg batch time 382.50 ms
Iteration 70/100, avg batch time 382.54 ms
Iteration 80/100, avg batch time 382.54 ms
Iteration 90/100, avg batch time 382.57 ms
Iteration 100/100, avg batch time 382.62 ms
Input shape: torch.Size([128, 3, 300, 300])
Output location prediction size: torch.Size([128, 4, 8732])
Output label prediction size: torch.Size([128, 81, 8732])
Average batch time: 382.62 ms

## 4. Compiling with TRTorch TorchScript modules behave just like normal PyTorch modules and are intercompatible. From TorchScript we can now compile a TensorRT based module. This module will still be implemented in TorchScript but all the computation will be done in TensorRT.

[15]:
import trtorch

# The compiled module will have precision as specified by "op_precision".
# Here, it will have FP16 precision.
trt_model = trtorch.compile(traced_model, {
    "inputs": [trtorch.Input((3, 3, 300, 300))],
    "enabled_precisions": {torch.float, torch.half}, # Run with FP16
    "workspace_size": 1 << 20
})

## 5. Running Inference

Next, we run object detection

[16]:
# using a TRTorch module is exactly the same as how we usually do inference in PyTorch i.e. model(inputs)
detections_batch = trt_model(tensor.to(torch.half)) # convert the input to half precision

# By default, raw output from SSD network per input image contains 8732 boxes with
# localization and class probability distribution.
# Let’s filter this output to only get reasonable detections (confidence>40%) in a more comprehensive format.
results_per_input = utils.decode_results(detections_batch)
best_results_per_input_trt = [utils.pick_best(results, 0.40) for results in results_per_input]

Now, let’s visualize our predictions!

[17]:
# Visualize results with TRTorch/TensorRT
plot_results(best_results_per_input_trt)
../_images/_notebooks_ssd-object-detection-demo_35_0.png
../_images/_notebooks_ssd-object-detection-demo_35_1.png
../_images/_notebooks_ssd-object-detection-demo_35_2.png

We get similar results as before!


6. Measuring Speedup

We can run the benchmark function again to see the speedup gained! Compare this result with the same batch-size of input in the case without TRTorch/TensorRT above.

[18]:
batch_size = 128

# Recompiling with batch_size we use for evaluating performance
trt_model = trtorch.compile(traced_model, {
    "inputs": [trtorch.Input((batch_size, 3, 300, 300))],
    "enabled_precisions": {torch.float, torch.half}, # Run with FP16
    "workspace_size": 1 << 20
})

benchmark(trt_model, input_shape=(batch_size, 3, 300, 300), nruns=100, dtype="fp16")
Warm up ...
Start timing ...
Iteration 10/100, avg batch time 72.90 ms
Iteration 20/100, avg batch time 72.95 ms
Iteration 30/100, avg batch time 72.92 ms
Iteration 40/100, avg batch time 72.94 ms
Iteration 50/100, avg batch time 72.99 ms
Iteration 60/100, avg batch time 73.01 ms
Iteration 70/100, avg batch time 73.04 ms
Iteration 80/100, avg batch time 73.04 ms
Iteration 90/100, avg batch time 73.04 ms
Iteration 100/100, avg batch time 73.06 ms
Input shape: torch.Size([128, 3, 300, 300])
Output location prediction size: torch.Size([128, 4, 8732])
Output label prediction size: torch.Size([128, 81, 8732])
Average batch time: 73.06 ms

7. Conclusion

In this notebook, we have walked through the complete process of compiling a TorchScript SSD300 model with TRTorch, and tested the performance impact of the optimization. We find that using the TRTorch compiled model, we gain significant speedup in inference without any noticeable drop in performance!

Details

For detailed information on model input and output, training recipies, inference and performance visit: github and/or NGC

References