Introduction
The multi_tensor_copier package provides functionality for efficient copying of tensors contained in
nested structures (lists, tuples, dicts) between devices.
Its primary goal is to optimize CPU to GPU transfers, especially for many small tensors in non-pinned
memory. Other copy directions (GPU to CPU, GPU to GPU, CPU to CPU) are also supported and benefit from
some of the optimizations, but are not the main focus.
Motivation
Standard PyTorch copy operations (e.g. torch.Tensor.to()) have two properties that make them
inefficient for the scenario of transferring many small tensors to the GPU (e.g. for transferring per-sample
meta-data to the GPU):
Non-pinned memory cannot be copied asynchronously. PyTorch’s
non_blocking=Trueonly yields truly asynchronous host-to-device (H2D) transfers when the source tensor resides in pinned (page-locked) memory. In some workloads – e.g. when tensors originate from atorch.utils.data.DataLoaderwithpin_memory=Falseor when they are obtained e.g. by reading pickled numpy arrays – this precondition is not met, and every transfer blocks the calling thread.Per-tensor overhead dominates for small tensors. Each call to
.to()incurs overhead. For small tensors (e.g. variable-length annotations in object detection), this overhead can exceed the actual transfer time, so that if many small tensors are present, this can lead to a considerable overhead and dominate the actual transfer time.
Note
Apart from improving copying efficiency, the package also makes copying multiple tensors more convenient by automatically traversing the input structure and copying all contained tensors to the target device.
Features
The package addresses the efficiency issues through the following optimizations, all of which are configurable (i.e. can be enabled or disabled):
- Automatic packing of small tensors (
pack_cpu_tensors, default: enabled) Multiple small contiguous CPU tensors (up to 256 KB each, mixed dtypes supported) are automatically packed into one or more fixed-size byte buffers and transferred with one H2D copy per buffer. On the GPU side, per-tensor views into the packed allocations are created with configurable alignment (
min_packed_alignment_bytes) enforced for the individual outputs. This optimization is only applicable to CPU to GPU transfers.Important
This feature is a major contribution to the overall performance optimization vs. using standard PyTorch
.to()calls on the individual tensors. For this optimization to be applied, the input CPU tensors must be contiguous.- Parallel pinned memory staging (
use_pinned_staging, default: enabled) For CPU to GPU transfers, input tensors are first copied into pinned host buffers (in parallel) so that the subsequent H2D transfer can use
non_blocking=True. For GPU to CPU transfers, output is written directly into a pinned host buffer via an asynchronous D2H copy on a CUDA stream, and the pinned tensor is returned as the result.- Background-thread scheduling (
use_background_thread, default: enabled) The copy orchestration (buffer allocation, staging, and CUDA copy submission) runs on a C++ background thread rather than the calling Python thread.
start_copy()returns a handle before the copies complete; the caller can do other work and retrieve results viaget(). Note that parallel CPU staging is used regardless of this setting. The background-thread scheduling benefits all copy directions, including CPU to CPU.- Nested structure traversal
Input may be an arbitrarily nested combination of
list,tuple, anddictcontainers withtorch.Tensorornumpy.ndarrayleaves. The output preserves the original structure. Non-tensor, non-container leaves (e.g. strings) are passed through unchanged. Numpy arrays are converted to PyTorch tensors during traversal. The automatic handling of nested structures greatly simplifies copying of nested structures of tensors while also allowing for automatic packing of small tensors (see above) without the need for manual bookkeeping.
Integration
The copy can be started wherever the data is needed – not e.g. only directly after it is obtained from
a PyTorch DataLoader. For example, if the GPU-resident data is only required for loss computation,
start_copy() can be called at the beginning of the loss
computation step, ideally with some work performed in the meantime to overlap with the asynchronous
copy. This means the package can be integrated into existing training loops with only local
modifications, and can also be used with data originating from other sources than a DataLoader.
Note
At the time of the start_copy() call, the active PyTorch
streams on all involved CUDA devices are captured. All copy work is then enqueued on, or
synchronized with, these captured streams so that transfers are correctly ordered with respect
to preceding GPU operations — no manual synchronization is required. Non-default stream
contexts (e.g. torch.cuda.stream()) are respected.
Supported Copy Directions
The table below summarizes which optimizations apply to each copy direction:
Optimization |
CPU → GPU |
GPU → CPU |
GPU → GPU |
CPU → CPU |
|---|---|---|---|---|
Pinned staging |
✓ |
✓ |
||
Packing |
✓ |
|||
Background thread |
✓ |
✓ |
✓ |
✓ |
See also
Refer to the API Reference for the full parameter reference and the Example for a usage example.