Evaluation

This page summarizes the performance impact of using the DALI Pipeline Framework as a replacement for a conventional PyTorch Dataset + DataLoader when training StreamPETR on NuScenes mini.

Experiment Setup

Configuration

The training is performed for the NuScenes mini dataset, using the StreamPETR model. In order to evaluate a realistic setup with a high pre-processing overhead, we increase both the image resolution and the batch size compared to the original configuration. We use the following configuration:

  • Image size: 1024 × 372

  • Batch size: 8 per GPU

Data

  • NuScenes mini

  • For the multi‑GPU training, each GPU re‑uses the whole dataset instead of sharding (NuScenes mini is otherwise too small)

Adaptation to DALI

  • Replace the PyTorch DataLoader by a DALI pipeline

  • Code changes are limited to the pipeline setup vs. PyTorch Dataset and DataLoader

  • The training loop implementation remains unchanged

Note

We are planning to add a demo for the DALI Pipeline Framework package in the future, including the implementation of the experiments performed in this evaluation.

Hardware Setup

Test Device

GPU

CPU

8 × NVIDIA A100‑SXM4‑80GB

2 × AMD EPYC 7742 64‑Core Processors

Results & Discussion

Results

The numbers report average run time per batch after the warm‑up phase.

Runtime and CPU Usage

Method

Runtime [ms] (2‑GPU)

CPU usage [%] (2‑GPU)

Runtime [ms] (8‑GPU)

CPU usage [%] (8‑GPU)

Reference (PyTorch DataLoader)

935

3.3

1110

12.3

DALI pipeline

829

1.5

868

5.4

Speedup / Savings

× 1.13

55%

× 1.28

56%

Discussion

The results show that the DALI pipeline framework leads to a speedup of × 1.13 for the 2-GPU configuration and × 1.28 for the 8-GPU configuration, with the speedup being larger for the 8-GPU configuration. The CPU usage is reduced by around 55% in both configurations. Note that for both DALI And the reference implementation, the runtime increases for the 8-GPU configuration compared to the 2-GPU configuration. However, the increase is much smaller for the DALI pipeline, indicating that while the CPU is not fully utilized in the reference approach, there are already bottlenecks present which can be overcome by the DALI pipeline.