Evaluation

This page summarizes the performance impact of batching the loss computation using accvlab.batching_helpers in a StreamPETR training setup.

Setup

Experiment Setup

For this evaluation, the training is performed for the NuScenes mini dataset, using the StreamPETR model. The adaptations to the original StreamPETR implementation are as follows:

StreamPETR – Optimization Overview

  • HungarianAssigner3D & HungarianAssigner2D: Original matchers operate on a per‑sample basis.

    • Cost matrix computation (pre‑requisite for actual matching) → Optimized (following the approach outlined in the Example)

    • Matching itself (SciPy implementation on the CPU).

    • HungarianAssigner3D: nan_to_num() was a bottleneck; moved to GPU → Changed in both reference & optimized.

  • StreamPETR Head

    • Loss computation is batched over samples → Optimized.

    • Loss computation is also batched over the decoder layers → Optimized (using multiple batch dimensions).

    • Can use a batched assigner → Optimized.

  • Focal Head

    • Loss computation is already batched over samples & camera images

    • Can use a batched assigner → Optimized.

    • Added use of the custom Gaussian heatmap generator → Changed in both reference & optimized.

Note

Some of the changes/optimizations are done in both the reference and optimized implementation (indicated in the overview above). These changes are not specific to the batching optimization, and are therefore applicable to both implementations. Applying them in both ensures a fair comparison, with the obtained speedup reflecting the effect of the batching optimizations.

See also

For a general overview on how to use the batching helpers to optimize the loss computation, please refer to the Example.

The evaluation is performed for a batch size of 8 (in contrast to the original configuration of 2) to obtain a realistic setup and highlight the performance impact of the batching in this case.

Note

We are planning to add a demo for the Batching Helpers package in the future, including the implementation of the experiments performed in this evaluation.

Hardware Setup

System Configuration

GPU

CPU

NVIDIA A100-SXM4-80GB

2x AMD EPYC 7742 64-core Processors

Results

In the following table, the runtime breakdown of the training iteration is shown. Note that the grayed out cells do not contain any optimized code. While theoretically, the Optimization step contains changes (due to the deifferent implementation of the loss, leading to different steps in the packward propagation), the resulting runtime differences in this step are negligible. The Remaining columns in the table show the runtime of the parts of the implementation for which the runtime is not measured directly. For the Forward pass, this mostly corresponds to the forward pass through the network itself (as opposed to the loss computation). For the Training Iteration, this may correspond to additional overhead such as e.g. obtaining/waiting for the next batch of data.

Runtime Baseline → Runtime Optimized [ms] (Speedup ×-fold)
Training Iteration
760 → 615 (× 1.24)
Forward (including Loss) Optimization Remaining
363 → 221 (× 1.64) 318 → 314 79 → 80
Remaining Loss
180 → 180 183 → 41 (× 4.46)
StreamPETR Head Focal Head
82 → 25 (× 3.28) 100 → 16 (× 6.25)

The batching of the loss computation leads to a speedup of × 4.46 for the loss computation itself, with different speedups achieved for the different types of loss. The loss optimization translates to an overall speedup of × 1.64 for the forward pass and × 1.24 for the training iteration. Note that the expected speedup strongly depends on the used batch size.