Evaluation
This page summarizes the performance impact of batching the loss computation using
accvlab.batching_helpers in a StreamPETR training setup.
Setup
Experiment Setup
For this evaluation, the training is performed for the NuScenes mini dataset, using the StreamPETR model. The adaptations to the original StreamPETR implementation are as follows:
StreamPETR – Optimization Overview
HungarianAssigner3D & HungarianAssigner2D: Original matchers operate on a per‑sample basis.
Cost matrix computation (pre‑requisite for actual matching) → Optimized (following the approach outlined in the Example)
Matching itself (SciPy implementation on the CPU).
HungarianAssigner3D: nan_to_num() was a bottleneck; moved to GPU → Changed in both reference & optimized.
StreamPETR Head
Loss computation is batched over samples → Optimized.
Loss computation is also batched over the decoder layers → Optimized (using multiple batch dimensions).
Can use a batched assigner → Optimized.
Focal Head
Loss computation is already batched over samples & camera images
Can use a batched assigner → Optimized.
Added use of the custom Gaussian heatmap generator → Changed in both reference & optimized.
Note
Some of the changes/optimizations are done in both the reference and optimized implementation (indicated in the overview above). These changes are not specific to the batching optimization, and are therefore applicable to both implementations. Applying them in both ensures a fair comparison, with the obtained speedup reflecting the effect of the batching optimizations.
See also
For a general overview on how to use the batching helpers to optimize the loss computation, please refer to the Example.
The evaluation is performed for a batch size of 8 (in contrast to the original configuration of 2) to obtain a realistic setup and highlight the performance impact of the batching in this case.
Note
We are planning to add a demo for the Batching Helpers package in the future, including the implementation of the experiments performed in this evaluation.
Hardware Setup
GPU |
CPU |
|---|---|
NVIDIA A100-SXM4-80GB |
2x AMD EPYC 7742 64-core Processors |
Results
In the following table, the runtime breakdown of the training iteration is shown.
Note that the grayed out cells do not contain any optimized code. While theoretically, the Optimization
step contains changes (due to the deifferent implementation of the loss, leading to different steps
in the packward propagation), the resulting runtime differences in this step are negligible.
The Remaining columns in the table show the runtime of the parts of the implementation for
which the runtime is not measured directly. For the Forward pass, this mostly corresponds to the forward
pass through the network itself (as opposed to the loss computation). For the Training Iteration, this
may correspond to additional overhead such as e.g. obtaining/waiting for the next batch of data.
| Runtime Baseline → Runtime Optimized [ms] (Speedup ×-fold) | ||||
|---|---|---|---|---|
| Training Iteration | ||||
| 760 → 615 (× 1.24) | ||||
| Forward (including Loss) | Optimization | Remaining | ||
| 363 → 221 (× 1.64) | 318 → 314 | 79 → 80 | ||
| Remaining | Loss | |||
| 180 → 180 | 183 → 41 (× 4.46) | |||
| StreamPETR Head | Focal Head | |||
| 82 → 25 (× 3.28) | 100 → 16 (× 6.25) | |||
The batching of the loss computation leads to a speedup of × 4.46 for the loss computation itself, with different speedups achieved for the different types of loss. The loss optimization translates to an overall speedup of × 1.64 for the forward pass and × 1.24 for the training iteration. Note that the expected speedup strongly depends on the used batch size.