Evaluation

This page summarizes the performance impact of batching the loss computation using accvlab.batching_helpers in a StreamPETR training setup.

Setup

Experiment Setup

For this evaluation, the training is performed for the NuScenes mini dataset, using the StreamPETR model. The adaptations to the original StreamPETR implementation are as follows:

StreamPETR – Optimization Overview

HungarianAssigner3D & HungarianAssigner2D: Original matchers operate on a per‑sample basis.
- Cost matrix computation (pre‑requisite for actual matching) → Optimized (following the approach outlined in the Example)
- Matching itself (SciPy implementation on the CPU).
- HungarianAssigner3D: nan_to_num() was a bottleneck; moved to GPU → Changed in both reference & optimized.
StreamPETR Head
- Loss computation is batched over samples → Optimized.
- Loss computation is also batched over the decoder layers → Optimized (using multiple batch dimensions).
- Can use a batched assigner → Optimized.
Focal Head
- Loss computation is already batched over samples & camera images
- Can use a batched assigner → Optimized.
- Added use of the custom Gaussian heatmap generator → Changed in both reference & optimized.

Note

Some of the changes/optimizations are done in both the reference and optimized implementation (indicated in the overview above). These changes are not specific to the batching optimization, and are therefore applicable to both implementations. Applying them in both ensures a fair comparison, with the obtained speedup reflecting the effect of the batching optimizations.

Hardware Setup

System Configuration
GPU	CPU
NVIDIA A100-SXM4-80GB	2x AMD EPYC 7742 64-core Processors

Results

In the following table, the runtime breakdown of the training iteration is shown. Note that the grayed out cells do not contain any optimized code. While theoretically, the Optimization step contains changes (due to the deifferent implementation of the loss, leading to different steps in the packward propagation), the resulting runtime differences in this step are negligible. The Remaining columns in the table show the runtime of the parts of the implementation for which the runtime is not measured directly. For the Forward pass, this mostly corresponds to the forward pass through the network itself (as opposed to the loss computation). For the Training Iteration, this may correspond to additional overhead such as e.g. obtaining/waiting for the next batch of data.

Runtime Baseline → Runtime Optimized [ms] (Speedup ×-fold)
Training Iteration
760 → 615 (× 1.24)
Forward (including Loss)			Optimization	Remaining
363 → 221 (× 1.64)			318 → 314	79 → 80
Remaining	Loss
180 → 180	183 → 41 (× 4.46)
	StreamPETR Head	Focal Head
	82 → 25 (× 3.28)	100 → 16 (× 6.25)

The batching of the loss computation leads to a speedup of × 4.46 for the loss computation itself, with different speedups achieved for the different types of loss. The loss optimization translates to an overall speedup of × 1.64 for the forward pass and × 1.24 for the training iteration. Note that the expected speedup strongly depends on the used batch size.