Introduction

Overview

In the realm of object detection model training, there is usually a centerness loss term, which measures how close the center of predicted bounding box is to that of the ground truth one. GaussianFocalLoss is employed to calculate this centerness loss term, as the following formula shows:

\[\begin{split}loss(pred, target) = - \left( 1 - p_t \right)^\alpha \cdot log(p_t) \cdot g_t, where \left\{ \begin{aligned} p_t & = pred, \ g_t = 1, \ when \ target = 1 \\ p_t & = 1 - pred, \ g_t = (1 - target)^\gamma, \ when \ target \ != 1 \end{aligned} \right.\end{split}\]

Both pred and target tensors are of shape (num_heatmaps, height, width) and their values are within the range (0, 1) (open interval for pred, and closed interval for target). The target tensor is actually a Gaussian heatmap which is drawn based on ground truth bounding boxes using a Gaussian 2d kernel.

The existing implementation of drawing Gaussian heatmap (e.g., mmdet.models.utils.gaussian_target) involves CPU operation and handles each bounding box in each heatmap sequentially, which is inefficient and of low GPU utilization. In this repo, we implement a GPU kernel for drawing Gaussian heatmaps and port it as a PyTorch GPU operator, which calculates the target tensor based on centers and radii of bounding boxes. The GPU kernel can batchify the operation and achieve significant speedup compared to the existing implementation.

Benchmark

Python Wrapper

This package provides convenient Python wrappers for our GPU kernel. In this section, we benchmark the performance of the python wrapper and PyTorch implementation. Generally, this package provides two implementations for drawing the heatmaps:

draw_heatmap(): this implementation is designed for the concatenated input format.
draw_heatmap_batched(): this implementation is designed for the batched input format.

The benchmark results are shown in the following table.

Note

The shape of heatmap is (batch_size, height, width), and the performance is measured on a single NVIDIA A100 GPU.

Implementation	Heatmap shape	Performance	Speedup
PyTorch	48x20x50	201.1 ms	—
draw_heatmap (concat)	48x20x50	0.0482 ms	4189.42x
draw_heatmap_batched	48x20x50	0.0366 ms	5494.32x

Other than that, this package also provides a classwise implementation for drawing the heatmaps, which is to draw one heatmap image for each class. This is only directly supported by draw_heatmap_batched().

The benchmark results for this implementation are shown in the following table.

Note

The shape of heatmap is (batch_size, num_classes, height, width), and the performance is measured on a single NVIDIA A100 GPU.

Implementation	Heatmap shape	Performance	Speedup
PyTorch	48x20x20x50	245.1 ms	—
draw_heatmap_batched	48x20x20x50	0.059 ms	4154.24x

Note

While the classwise implementation is only directly supported by draw_heatmap_batched(), it can be also achieved in draw_heatmap(), as here, the indices of images to draw into are set manually for each bounding box, so that the indices could be set up to map to both different samples and different classes within a sample.

C++ Benchmark Implementation

There is also a C++ benchmark for directly measuring the performance of the C++ implementation without the PyTorch wrapper. It can be built by running the following commands:

cd packages/draw_heatmap/benchmark_cpp
mkdir build
cd build
cmake ..
make

Then, the performance measurements can be obtained by running the following commands.

./benchmark_flattened
./benchmark_batched
./benchmark_batched_classwise

Installation

This package is installed as part of the accvlab package. Please refer to the Installation Guide for more details.

Important

This package has a runtime dependency on the accvlab.batching_helpers package. Please install it before installing this package.