Example

Here, we provide an example of how to use the multi-tensor-copier package to efficiently copy data containing many small tensors in a nested structure (here: training meta-data) from CPU to GPU.

The example consists of the following steps:

Construction of per-sample meta-data with variable-size tensors (for illustration purposes; in a real use-case, the meta-data originates e.g. from a PyTorch DataLoader)
Asynchronous copy of the entire batch of meta-data to the GPU
Overlapping useful work with the transfer
Retrieval and consumption of the GPU-resident meta-data

Important

You can run the example using the script packages/multi_tensor_copier/example/example.py.

Example Data Structure

The meta-data is organized as a list of per-sample dictionaries (one per batch element). Each sample dictionary contains:

"cams_gt": a list of 6 camera dicts, each holding:
- "bounding_boxes": an (N, 4) tensor of 2D bounding boxes, where N varies per image (number of visible objects)
- "class_ids": an (N,) tensor of integer class IDs
- "active": an (N,) boolean tensor indicating active objects
- "depths": an (N,) tensor of depth values
- "proj_mat": a (3, 4) projection matrix from camera coordinates to image coordinates
"gt_data": a dict with:
- "bounding_boxes_3d": an (N, 7) tensor of 3D ground truth bounding boxes, where N varies per sample (number of ground truth objects)
- "class_ids": an (N,) tensor of integer class IDs
- "active": an (N,) boolean tensor indicating active objects

This nested structure of lists, dicts, and variable-size tensors is representative of real-world training tasks (e.g. a multi-camera 3D object detection like StreamPETR). It is also a scenario where standard PyTorch .to() calls are particularly inefficient: the batch contains many small tensors in non-pinned memory, and each individual .to() call incurs overhead that can dominate the actual transfer time for a small tensor. See the introduction for a detailed discussion of the motivation and the optimizations that multi-tensor-copier applies to ensure efficient copying in this scenario.

Workflow

The optimizations described in the Introduction are applied automatically (all enabled by default).

The following snippet shows the core workflow. After the batch meta-data has been assembled (see the full script at packages/multi_tensor_copier/example/example.py for the data creation helpers used in this example), we pass it to start_copy() together with the target device. The function traverses the nested structure and returns an AsyncCopyHandle while the transfer proceeds in the background. Because the copy runs asynchronously, the main thread is free to perform other operations while the transfer is in flight (e.g. computations not involving the copied data, logging, etc.). Finally, get() blocks until the copy is complete and returns the a nested structure corresponding to the input, but with all tensors now residing on the GPU.

packages/multi_tensor_copier/example/example.py

    # ----------------------- Create the batch meta-data -----------------------

    # @NOTE
    # Here, the per-sample meta-data tensors are not combined into per-batch meta-data tensors. This is due to
    # the fact that e.g. the number of visible objects per camera varies per sample, making combination &
    # handling of combined tensors cumbersome.
    batch_meta_data = create_batch_meta_data()

    # ------------------------ Start the asynchronous copy ------------------------

    # @NOTE
    # `start_copy()` traverses the nested structure of lists, tuples, and dicts, and
    # asynchronously copies all contained tensors to the target device. It returns a handle
    # before the copy is complete so that the calling thread can continue with other work while the transfer
    # is in progress.
    # Under the hood, the package applies several optimizations automatically (all enabled by
    # default): tensors are staged into pinned host memory for truly non-blocking H2D copies,
    # and small tensors are packed into (one or more) staging buffers to reduce per-tensor overhead. All of
    # this runs on a background thread, so this call returns before the copies complete.
    #
    # Note that this copy can be started anywhere the data is needed (i.e. not only when obtaining the data
    # from a DataLoader), so that it can be used with only local modifications to the training loop. For
    # example, if the meta-data on the GPU is only needed for loss computation, the copy can
    # be started inside the loss computation implementation (ideally with some work done in the meantime to
    # overlap with the asynchronous copy).
    handle = mtc.start_copy(batch_meta_data, "cuda:0")

    # @NOTE
    # IMPORTANT: Because the copy runs asynchronously, the input tensors must not be freed or
    # modified in-place until the copy has completed (i.e. until `handle.get()` returns or
    # `handle.ready()` returns `True`). See the `start_copy()` function documentation for details.

    # -------------------- Overlap with other work --------------------

    # @NOTE
    # Because `start_copy()` is asynchronous, we can overlap the CPU-to-GPU transfer with
    # other computation. Note that running asynchrounously with the copy is not the only (and not the most
    # important) optimization that is applied, so that this is beneficial but optional.
    dummy_compute()

    # -------------------- Retrieve and use the results --------------------

    # @NOTE
    # `handle.get()` blocks until the copy is complete and returns the same nested structure
    # with all tensors now residing on the target device. Non-tensor leaves (if any) are
    # passed through unchanged.
    gpu_meta_data = handle.get()

    # @NOTE
    # The copied data can now be consumed by downstream GPU operations (e.g. the detection
    # head, loss computation, etc.).
    #
    # Note on performance: For this simplified example, multi_tensor_copier achieves a
    # significant speedup over naive per-tensor .to() calls (see the evaluation script for
    # measurements). However, the absolute overhead of meta-data copying is moderate here.
    # In more complex real-world pipelines, the meta-data can be more extensive (e.g. additional
    # variable-length lane geometry where multiple lanes per sample cannot be combined into
    # single tensors due to variable size; or additional sensor modalities), which increases the number of
    # tensors and thus the per-tensor overhead. In such cases -- and with larger batch sizes -- the absolute
    # time savings grow accordingly.
    print("GPU meta-data ready:")
    dummy_process(gpu_meta_data)