Example

Here, we provide an example of how to use the multi-tensor-copier package to efficiently copy data containing many small tensors in a nested structure (here: training meta-data) from CPU to GPU.

The example consists of the following steps:

  1. Construction of per-sample meta-data with variable-size tensors (for illustration purposes; in a real use-case, the meta-data originates e.g. from a PyTorch DataLoader)

  2. Asynchronous copy of the entire batch of meta-data to the GPU

  3. Overlapping useful work with the transfer

  4. Retrieval and consumption of the GPU-resident meta-data

Important

You can run the example using the script packages/multi_tensor_copier/example/example.py.

Example Data Structure

The meta-data is organized as a list of per-sample dictionaries (one per batch element). Each sample dictionary contains:

  • "cams_gt": a list of 6 camera dicts, each holding:

    • "bounding_boxes": an (N, 4) tensor of 2D bounding boxes, where N varies per image (number of visible objects)

    • "class_ids": an (N,) tensor of integer class IDs

    • "active": an (N,) boolean tensor indicating active objects

    • "depths": an (N,) tensor of depth values

    • "proj_mat": a (3, 4) projection matrix from camera coordinates to image coordinates

  • "gt_data": a dict with:

    • "bounding_boxes_3d": an (N, 7) tensor of 3D ground truth bounding boxes, where N varies per sample (number of ground truth objects)

    • "class_ids": an (N,) tensor of integer class IDs

    • "active": an (N,) boolean tensor indicating active objects

This nested structure of lists, dicts, and variable-size tensors is representative of real-world training tasks (e.g. a multi-camera 3D object detection like StreamPETR). It is also a scenario where standard PyTorch .to() calls are particularly inefficient: the batch contains many small tensors in non-pinned memory, and each individual .to() call incurs overhead that can dominate the actual transfer time for a small tensor. See the introduction for a detailed discussion of the motivation and the optimizations that multi-tensor-copier applies to ensure efficient copying in this scenario.

Workflow

The optimizations described in the Introduction are applied automatically (all enabled by default).

The following snippet shows the core workflow. After the batch meta-data has been assembled (see the full script at packages/multi_tensor_copier/example/example.py for the data creation helpers used in this example), we pass it to start_copy() together with the target device. The function traverses the nested structure and returns an AsyncCopyHandle while the transfer proceeds in the background. Because the copy runs asynchronously, the main thread is free to perform other operations while the transfer is in flight (e.g. computations not involving the copied data, logging, etc.). Finally, get() blocks until the copy is complete and returns the a nested structure corresponding to the input, but with all tensors now residing on the GPU.

packages/multi_tensor_copier/example/example.py
 99    # ----------------------- Create the batch meta-data -----------------------
100
101    # @NOTE
102    # Here, the per-sample meta-data tensors are not combined into per-batch meta-data tensors. This is due to
103    # the fact that e.g. the number of visible objects per camera varies per sample, making combination &
104    # handling of combined tensors cumbersome.
105    batch_meta_data = create_batch_meta_data()
106
107    # ------------------------ Start the asynchronous copy ------------------------
108
109    # @NOTE
110    # `start_copy()` traverses the nested structure of lists, tuples, and dicts, and
111    # asynchronously copies all contained tensors to the target device. It returns a handle
112    # before the copy is complete so that the calling thread can continue with other work while the transfer
113    # is in progress.
114    # Under the hood, the package applies several optimizations automatically (all enabled by
115    # default): tensors are staged into pinned host memory for truly non-blocking H2D copies,
116    # and small tensors are packed into (one or more) staging buffers to reduce per-tensor overhead. All of
117    # this runs on a background thread, so this call returns before the copies complete.
118    #
119    # Note that this copy can be started anywhere the data is needed (i.e. not only when obtaining the data
120    # from a DataLoader), so that it can be used with only local modifications to the training loop. For
121    # example, if the meta-data on the GPU is only needed for loss computation, the copy can
122    # be started inside the loss computation implementation (ideally with some work done in the meantime to
123    # overlap with the asynchronous copy).
124    handle = mtc.start_copy(batch_meta_data, "cuda:0")
125
126    # @NOTE
127    # IMPORTANT: Because the copy runs asynchronously, the input tensors must not be freed or
128    # modified in-place until the copy has completed (i.e. until `handle.get()` returns or
129    # `handle.ready()` returns `True`). See the `start_copy()` function documentation for details.
130
131    # -------------------- Overlap with other work --------------------
132
133    # @NOTE
134    # Because `start_copy()` is asynchronous, we can overlap the CPU-to-GPU transfer with
135    # other computation. Note that running asynchrounously with the copy is not the only (and not the most
136    # important) optimization that is applied, so that this is beneficial but optional.
137    dummy_compute()
138
139    # -------------------- Retrieve and use the results --------------------
140
141    # @NOTE
142    # `handle.get()` blocks until the copy is complete and returns the same nested structure
143    # with all tensors now residing on the target device. Non-tensor leaves (if any) are
144    # passed through unchanged.
145    gpu_meta_data = handle.get()
146
147    # @NOTE
148    # The copied data can now be consumed by downstream GPU operations (e.g. the detection
149    # head, loss computation, etc.).
150    #
151    # Note on performance: For this simplified example, multi_tensor_copier achieves a
152    # significant speedup over naive per-tensor .to() calls (see the evaluation script for
153    # measurements). However, the absolute overhead of meta-data copying is moderate here.
154    # In more complex real-world pipelines, the meta-data can be more extensive (e.g. additional
155    # variable-length lane geometry where multiple lanes per sample cannot be combined into
156    # single tensors due to variable size; or additional sensor modalities), which increases the number of
157    # tensors and thus the per-tensor overhead. In such cases -- and with larger batch sizes -- the absolute
158    # time savings grow accordingly.
159    print("GPU meta-data ready:")
160    dummy_process(gpu_meta_data)
161
162