Current Release (`0.2.0`)

on_demand_video_decoder

New features:

AV1 codec support: decode AV1-encoded video alongside H.264 and HEVC.
Stream async decoder: asynchronous frame decoding with prefetching via DecodeN12ToRGBAsync() / DecodeN12ToRGBAsyncGetBuffer() in PyNvSampleReader. See the new SampleStreamAsyncAccess.py sample for a usage walkthrough.
Cached separation decode: the new CachedGopDecoder wraps PyNvGopDecoder with transparent GOP caching. When the same GOP is requested again (e.g. for a different frame within the same GOP), the cached result is returned directly, avoiding redundant demuxing. Controlled via the useGOPCache parameter in GetGOP() / GetGOPList().
GPU memory management: new release_device_memory() and release_decoder() methods to explicitly free GPU resources without destroying the decoder.
Drop paged cache utility: call drop_videos_cache() to evict video files from the OS page cache. This frees host memory occupied by cached pages of videos that are no longer needed, e.g. when switching to a different video dataset during training (and enables reproducible I/O benchmarks by ensuring reads hit disk rather than the page cache).

Improvements:

Concurrency: added releasing the Python GIL in existing decode and demux bindings, improving performance in multi-threaded Python code.
Various stability & error message improvements.

New Package: `multi_tensor_copier` – Fast & Easy Transfers of Many Small Tensors

A new package for efficiently & easily copying many (small) tensors between devices in a single call. Pass any nested structure of lists, tuples, and dicts containing tensors (or numpy arrays) and get them transferred – optimized end-to-end with a focus on CPU -> GPU transfers:

Automatically packs many small CPU tensors (mixed dtypes allowed) into one buffer for a single host-to-device copy, drastically reducing per-tensor overhead.
Stages non-pinned memory into pinned buffers (in parallel) so that all tensor transfers can proceed asynchronously.
Distributes work across multiple CUDA streams.
Runs the orchestration on a background C++ thread, keeping the Python thread free.
While the optimization is focused on CPU -> GPU, the package also supports GPU->CPU, GPU->GPU, and CPU->CPU directions (with some of the optimizations also applied to those).

The individual optimizations may be optionally turned off (e.g. to limit pinned memory usage).

Usage is straightforward:

handle = multi_tensor_copier.start_copy(nested_data, device="cuda:0")
# ... do other work while the copy runs ...
result = handle.get()   # same nested structure, now on GPU

~8x speedup over per-tensor .to() calls in an example benchmark with 528 small tensors (simplified synthetic meta-data for object detection similar to StreamPETR). See the included examples, evaluation script, and documentation.

`optim_test_tools` – Enhancements

TensorDumper: added ranges for context/disambiguation, PICKLE dump type, and allow_missing_data_in_previous comparison option.
Numba NVTX: new numba_nvtx submodule enabling use of NVTX ranges from Numba-compiled CPU code.

Documentation and Tooling

Changes inpackage_manager.sh
- Now uses --no-deps for wheels by default; added options for build isolation control.
- Fixed issue where re-installation of new version over existing one would leave the old version in place (additionally to the new one).
Various documentation improvements (e.g. online documentation, references to related resources such as videos and WeChat discussion group; improvements in various guides).
Pinned cupy version in the dockerfile to 13.6.0 to avoid compatibility issues.

Current Release (0.2.0)

on_demand_video_decoder

New Package: multi_tensor_copier – Fast & Easy Transfers of Many Small Tensors

optim_test_tools – Enhancements

Documentation and Tooling

Current Release (`0.2.0`)

New Package: `multi_tensor_copier` – Fast & Easy Transfers of Many Small Tensors

`optim_test_tools` – Enhancements