Evaluation ========== This page summarizes the performance of ``multi-tensor-copier`` compared to standard PyTorch ``.to()`` calls when copying training meta-data from CPU to GPU. Setup ----- The benchmark uses the same data structure as the :doc:`example `: per-sample meta-data from a multi-camera 3D object detection pipeline, containing variable-size bounding boxes, class IDs, active flags, depths, and projection matrices for 6 cameras, plus ground truth 3D bounding boxes with associated attributes. See the :doc:`example ` for the full data structure description. .. list-table:: Benchmark Configuration :header-rows: 1 * - Parameter - Value * - Batch size - 16 samples * - Total tensors per batch - 528 * - Total transfer size per batch - ~150 KB * - Runs - 10 * - Warmup iterations (per run) - 100 * - Measured iterations (per run) - 1000 Two baselines are compared against ``multi-tensor-copier``: - **``.to()`` hardcoded** -- per-tensor ``.to(device)`` calls with the data structure known at development time (representative of a manual implementation in a training pipeline). - **``.to()`` generic** -- a recursive traversal that copies all tensors in an arbitrary nested structure using ``.to(device)``, with ``isinstance`` checks and dictionary key iteration at each level. .. note:: The evaluation measures only the copy time itself, without any concurrent work. In practice, ``multi_tensor_copier`` allows overlapping the copy with other computation (see the :doc:`example `), which can hide some of the latency. The speedups reported here therefore reflect the improvement in raw copy throughput, not necessarily the full potential benefit in an end-to-end training loop. Hardware ~~~~~~~~ .. list-table:: System Configuration :header-rows: 1 * - GPU - CPU * - NVIDIA RTX 5000 Ada Generation - AMD Ryzen 9 7950X 16-Core Processor Results ------- .. list-table:: Runtime and Speedup (mean +/- std over 10 runs) :header-rows: 1 * - Method - Runtime [ms] - Speedup * - ``.to()`` hardcoded - 3.035 +/- 0.006 - (baseline) * - ``.to()`` generic - 3.172 +/- 0.006 - (baseline) * - ``multi_tensor_copier`` - 0.375 +/- 0.008 - **8.10x** +/- 0.16 vs hardcoded, **8.47x** +/- 0.16 vs generic The ``multi-tensor-copier`` package achieves a speedup of approximately **8x** over both baselines. The generic traversal baseline is slightly slower than the hardcoded baseline due to Python overhead from ``isinstance`` checks and dictionary key iteration, but the difference is small compared to the overall runtime. .. note:: In this example the absolute copy time of the baseline (~3 ms with ``.to()``) is moderate. As the complexity of the meta-data grows (e.g. with additional variable-length annotations such as lane geometry with multiple lanes per sample that cannot be combined into single tensors), the number of tensors and thus the overall transfer overhead increases, leading to larger optimization potential. Similarly, larger batch sizes multiply the number of tensors proportionally. .. seealso:: The evaluation script can be found at ``packages/multi_tensor_copier/example/evaluation.py``.