Evaluation

This page summarizes the performance of multi-tensor-copier compared to standard PyTorch .to() calls when copying training meta-data from CPU to GPU.

Setup

The benchmark uses the same data structure as the example: per-sample meta-data from a multi-camera 3D object detection pipeline, containing variable-size bounding boxes, class IDs, active flags, depths, and projection matrices for 6 cameras, plus ground truth 3D bounding boxes with associated attributes. See the example for the full data structure description.

Benchmark Configuration
Parameter	Value
Batch size	16 samples
Total tensors per batch	528
Total transfer size per batch	~150 KB
Runs	10
Warmup iterations (per run)	100
Measured iterations (per run)	1000

Two baselines are compared against multi-tensor-copier:

``.to()`` hardcoded – per-tensor .to(device) calls with the data structure known at development time (representative of a manual implementation in a training pipeline).
``.to()`` generic – a recursive traversal that copies all tensors in an arbitrary nested structure using .to(device), with isinstance checks and dictionary key iteration at each level.

Note

The evaluation measures only the copy time itself, without any concurrent work. In practice, multi_tensor_copier allows overlapping the copy with other computation (see the example), which can hide some of the latency. The speedups reported here therefore reflect the improvement in raw copy throughput, not necessarily the full potential benefit in an end-to-end training loop.

Hardware

System Configuration
GPU	CPU
NVIDIA RTX 5000 Ada Generation	AMD Ryzen 9 7950X 16-Core Processor

Results

Runtime and Speedup (mean +/- std over 10 runs)
Method	Runtime [ms]	Speedup
`.to()` hardcoded	3.035 +/- 0.006	(baseline)
`.to()` generic	3.172 +/- 0.006	(baseline)
`multi_tensor_copier`	0.375 +/- 0.008	8.10x +/- 0.16 vs hardcoded, 8.47x +/- 0.16 vs generic

The multi-tensor-copier package achieves a speedup of approximately 8x over both baselines. The generic traversal baseline is slightly slower than the hardcoded baseline due to Python overhead from isinstance checks and dictionary key iteration, but the difference is small compared to the overall runtime.

Note

In this example the absolute copy time of the baseline (~3 ms with .to()) is moderate. As the complexity of the meta-data grows (e.g. with additional variable-length annotations such as lane geometry with multiple lanes per sample that cannot be combined into single tensors), the number of tensors and thus the overall transfer overhead increases, leading to larger optimization potential. Similarly, larger batch sizes multiply the number of tensors proportionally.