Evaluation ========== This page summarizes the performance impact of batching the loss computation using ``accvlab.batching_helpers`` in a StreamPETR training setup. Setup ----- Experiment Setup ~~~~~~~~~~~~~~~~ For this evaluation, the training is performed for the NuScenes mini dataset, using the StreamPETR model. The adaptations to the `original StreamPETR implementation `_ are as follows: .. rubric:: StreamPETR – Optimization Overview - **HungarianAssigner3D & HungarianAssigner2D**: Original matchers operate on a per‑sample basis. - Cost matrix computation (pre‑requisite for actual matching) → **Optimized** (following the approach outlined in the :doc:`example`) - Matching itself (SciPy implementation on the CPU). - `HungarianAssigner3D`: `nan_to_num()` was a bottleneck; moved to GPU → Changed in both reference & optimized. - **StreamPETR Head** - Loss computation is batched over samples → **Optimized**. - Loss computation is also batched over the decoder layers → **Optimized** (using multiple batch dimensions). - Can use a batched assigner → **Optimized**. - **Focal Head** - Loss computation is already batched over samples & camera images - Can use a batched assigner → **Optimized**. - Added use of the custom Gaussian heatmap generator → Changed in both reference & optimized. .. note:: Some of the changes/optimizations are done in both the reference and optimized implementation (indicated in the overview above). These changes are not specific to the batching optimization, and are therefore applicable to both implementations. Applying them in both ensures a fair comparison, with the obtained speedup reflecting the effect of the batching optimizations. .. seealso:: For a general overview on how to use the batching helpers to optimize the loss computation, please refer to the :doc:`example`. The evaluation is performed for a **batch size of 8** (in contrast to the original configuration of 2) to obtain a realistic setup and highlight the performance impact of the batching in this case. .. note:: We are planning to add a demo for the Batching Helpers package in the future, including the implementation of the experiments performed in this evaluation. Hardware Setup ~~~~~~~~~~~~~~ .. list-table:: System Configuration :header-rows: 1 * - GPU - CPU * - NVIDIA A100-SXM4-80GB - 2x AMD EPYC 7742 64-core Processors Results ------- .. only:: html In the following table, the runtime breakdown of the training iteration is shown. Note that the grayed out cells do not contain any optimized code. While theoretically, the ``Optimization`` step contains changes (due to the deifferent implementation of the loss, leading to different steps in the packward propagation), the resulting runtime differences in this step are negligible. The ``Remaining`` columns in the table show the runtime of the parts of the implementation for which the runtime is not measured directly. For the ``Forward`` pass, this mostly corresponds to the forward pass through the network itself (as opposed to the loss computation). For the ``Training Iteration``, this may correspond to additional overhead such as e.g. obtaining/waiting for the next batch of data. .. raw:: html
Runtime Baseline → Runtime Optimized [ms] (Speedup ×-fold)
Training Iteration
760 → 615 (× 1.24)
Forward (including Loss) Optimization Remaining
363 → 221 (× 1.64) 318 → 314 79 → 80
Remaining Loss
180 → 180 183 → 41 (× 4.46)
StreamPETR Head Focal Head
82 → 25 (× 3.28) 100 → 16 (× 6.25)
.. only:: not html In the following table, the runtime breakdown of the training iteration is shown. Note that the individual lines in the table correspond to different parts of the implementation, with increasing level of detail for the lower entries, which correspond to parts of the implementation of upper entries (containing implementation indicated as `[within <...>]`). .. note:: A structured table showing the runtime breakdown of the training iteration visually is available in the HTML version of this document. .. list-table:: Runtime Summary (Baseline → Optimized) :header-rows: 1 * - Component - Baseline [ms] - Optimized [ms] - Speedup * - Training Iteration - 760 - 615 - × 1.24 * - Forward (including Loss) [within Training Iteration] - 363 - 221 - × 1.64 * - Optimization [within Training Iteration] - 318 - 314 - — * - Remaining [within Training Iteration] - 79 - 80 - — * - Remaining [within Forward] - 180 - 180 - — * - Loss [within Forward] - 183 - 41 - × 4.46 * - StreamPETR Head [within Loss] - 82 - 25 - × 3.28 * - Focal Head [within Loss] - 100 - 16 - × 6.25 The batching of the loss computation leads to a speedup of **× 4.46** for the loss computation itself, with different speedups achieved for the different types of loss. The loss optimization translates to an overall speedup of **× 1.64** for the forward pass and **× 1.24** for the training iteration. Note that the expected speedup strongly depends on the used batch size.