Evaluation
This page summarizes the performance of multi-tensor-copier compared to standard PyTorch
.to() calls when copying training meta-data from CPU to GPU.
Setup
The benchmark uses the same data structure as the example: per-sample meta-data from a multi-camera 3D object detection pipeline, containing variable-size bounding boxes, class IDs, active flags, depths, and projection matrices for 6 cameras, plus ground truth 3D bounding boxes with associated attributes. See the example for the full data structure description.
Parameter |
Value |
|---|---|
Batch size |
16 samples |
Total tensors per batch |
528 |
Total transfer size per batch |
~150 KB |
Runs |
10 |
Warmup iterations (per run) |
100 |
Measured iterations (per run) |
1000 |
Two baselines are compared against multi-tensor-copier:
``.to()`` hardcoded – per-tensor
.to(device)calls with the data structure known at development time (representative of a manual implementation in a training pipeline).``.to()`` generic – a recursive traversal that copies all tensors in an arbitrary nested structure using
.to(device), withisinstancechecks and dictionary key iteration at each level.
Note
The evaluation measures only the copy time itself, without any concurrent work. In practice,
multi_tensor_copier allows overlapping the copy with other computation (see the
example), which can hide some of the latency. The speedups
reported here therefore reflect the improvement in raw copy throughput, not necessarily the full
potential benefit in an end-to-end training loop.
Hardware
GPU |
CPU |
|---|---|
NVIDIA RTX 5000 Ada Generation |
AMD Ryzen 9 7950X 16-Core Processor |
Results
Method |
Runtime [ms] |
Speedup |
|---|---|---|
|
3.035 +/- 0.006 |
(baseline) |
|
3.172 +/- 0.006 |
(baseline) |
|
0.375 +/- 0.008 |
8.10x +/- 0.16 vs hardcoded, 8.47x +/- 0.16 vs generic |
The multi-tensor-copier package achieves a speedup of approximately 8x over both baselines.
The generic traversal baseline is slightly slower than the hardcoded baseline due to Python overhead from
isinstance checks and dictionary key iteration, but the difference is small compared to the
overall runtime.
Note
In this example the absolute copy time of the baseline (~3 ms with .to()) is moderate. As the
complexity of the meta-data grows (e.g. with additional variable-length annotations such as lane
geometry with multiple lanes per sample that cannot be combined into single tensors), the number of
tensors and thus the overall transfer overhead increases, leading to larger optimization potential.
Similarly, larger batch sizes multiply the number of tensors proportionally.
See also
The evaluation script can be found at packages/multi_tensor_copier/example/evaluation.py.