Introduction

The multi_tensor_copier package provides functionality for efficient copying of tensors contained in nested structures (lists, tuples, dicts) between devices. Its primary goal is to optimize CPU to GPU transfers, especially for many small tensors in non-pinned memory. Other copy directions (GPU to CPU, GPU to GPU, CPU to CPU) are also supported and benefit from some of the optimizations, but are not the main focus.

Motivation

Standard PyTorch copy operations (e.g. torch.Tensor.to()) have two properties that make them inefficient for the scenario of transferring many small tensors to the GPU (e.g. for transferring per-sample meta-data to the GPU):

Non-pinned memory cannot be copied asynchronously. PyTorch’s non_blocking=True only yields truly asynchronous host-to-device (H2D) transfers when the source tensor resides in pinned (page-locked) memory. In some workloads – e.g. when tensors originate from a torch.utils.data.DataLoader with pin_memory=False or when they are obtained e.g. by reading pickled numpy arrays – this precondition is not met, and every transfer blocks the calling thread.
Per-tensor overhead dominates for small tensors. Each call to .to() incurs overhead. For small tensors (e.g. variable-length annotations in object detection), this overhead can exceed the actual transfer time, so that if many small tensors are present, this can lead to a considerable overhead and dominate the actual transfer time.

Note

Apart from improving copying efficiency, the package also makes copying multiple tensors more convenient by automatically traversing the input structure and copying all contained tensors to the target device.

Features

The package addresses the efficiency issues through the following optimizations, all of which are configurable (i.e. can be enabled or disabled):

Automatic packing of small tensors (pack_cpu_tensors, default: enabled): Multiple small contiguous CPU tensors (up to 256 KB each, mixed dtypes supported) are automatically packed into one or more fixed-size byte buffers and transferred with one H2D copy per buffer. On the GPU side, per-tensor views into the packed allocations are created with configurable alignment (min_packed_alignment_bytes) enforced for the individual outputs. This optimization is only applicable to CPU to GPU transfers.

Important

This feature is a major contribution to the overall performance optimization vs. using standard PyTorch .to() calls on the individual tensors. For this optimization to be applied, the input CPU tensors must be contiguous.
Parallel pinned memory staging (use_pinned_staging, default: enabled): For CPU to GPU transfers, input tensors are first copied into pinned host buffers (in parallel) so that the subsequent H2D transfer can use non_blocking=True. For GPU to CPU transfers, output is written directly into a pinned host buffer via an asynchronous D2H copy on a CUDA stream, and the pinned tensor is returned as the result.
Background-thread scheduling (use_background_thread, default: enabled): The copy orchestration (buffer allocation, staging, and CUDA copy submission) runs on a C++ background thread rather than the calling Python thread. start_copy() returns a handle before the copies complete; the caller can do other work and retrieve results via get(). Note that parallel CPU staging is used regardless of this setting. The background-thread scheduling benefits all copy directions, including CPU to CPU.
Nested structure traversal: Input may be an arbitrarily nested combination of list, tuple, and dict containers with torch.Tensor or numpy.ndarray leaves. The output preserves the original structure. Non-tensor, non-container leaves (e.g. strings) are passed through unchanged. Numpy arrays are converted to PyTorch tensors during traversal. The automatic handling of nested structures greatly simplifies copying of nested structures of tensors while also allowing for automatic packing of small tensors (see above) without the need for manual bookkeeping.

Integration

The copy can be started wherever the data is needed – not e.g. only directly after it is obtained from a PyTorch DataLoader. For example, if the GPU-resident data is only required for loss computation, start_copy() can be called at the beginning of the loss computation step, ideally with some work performed in the meantime to overlap with the asynchronous copy. This means the package can be integrated into existing training loops with only local modifications, and can also be used with data originating from other sources than a DataLoader.

Note

At the time of the start_copy() call, the active PyTorch streams on all involved CUDA devices are captured. All copy work is then enqueued on, or synchronized with, these captured streams so that transfers are correctly ordered with respect to preceding GPU operations — no manual synchronization is required. Non-default stream contexts (e.g. torch.cuda.stream()) are respected.

Supported Copy Directions

The table below summarizes which optimizations apply to each copy direction:

Optimization	CPU → GPU	GPU → CPU	GPU → GPU	CPU → CPU
Pinned staging	✓	✓
Packing	✓
Background thread	✓	✓	✓	✓