Current Release (0.2.0)

on_demand_video_decoder

New features:

  • AV1 codec support: decode AV1-encoded video alongside H.264 and HEVC.

  • Stream async decoder: asynchronous frame decoding with prefetching via DecodeN12ToRGBAsync() / DecodeN12ToRGBAsyncGetBuffer() in PyNvSampleReader. See the new SampleStreamAsyncAccess.py sample for a usage walkthrough.

  • Cached separation decode: the new CachedGopDecoder wraps PyNvGopDecoder with transparent GOP caching. When the same GOP is requested again (e.g. for a different frame within the same GOP), the cached result is returned directly, avoiding redundant demuxing. Controlled via the useGOPCache parameter in GetGOP() / GetGOPList().

  • GPU memory management: new release_device_memory() and release_decoder() methods to explicitly free GPU resources without destroying the decoder.

  • Drop paged cache utility: call drop_videos_cache() to evict video files from the OS page cache. This frees host memory occupied by cached pages of videos that are no longer needed, e.g. when switching to a different video dataset during training (and enables reproducible I/O benchmarks by ensuring reads hit disk rather than the page cache).

Improvements:

  • Concurrency: added releasing the Python GIL in existing decode and demux bindings, improving performance in multi-threaded Python code.

  • Various stability & error message improvements.

New Package: multi_tensor_copier – Fast & Easy Transfers of Many Small Tensors

A new package for efficiently & easily copying many (small) tensors between devices in a single call. Pass any nested structure of lists, tuples, and dicts containing tensors (or numpy arrays) and get them transferred – optimized end-to-end with a focus on CPU -> GPU transfers:

  • Automatically packs many small CPU tensors (mixed dtypes allowed) into one buffer for a single host-to-device copy, drastically reducing per-tensor overhead.

  • Stages non-pinned memory into pinned buffers (in parallel) so that all tensor transfers can proceed asynchronously.

  • Distributes work across multiple CUDA streams.

  • Runs the orchestration on a background C++ thread, keeping the Python thread free.

  • While the optimization is focused on CPU -> GPU, the package also supports GPU->CPU, GPU->GPU, and CPU->CPU directions (with some of the optimizations also applied to those).

The individual optimizations may be optionally turned off (e.g. to limit pinned memory usage).

Usage is straightforward:

handle = multi_tensor_copier.start_copy(nested_data, device="cuda:0")
# ... do other work while the copy runs ...
result = handle.get()   # same nested structure, now on GPU

~8x speedup over per-tensor .to() calls in an example benchmark with 528 small tensors (simplified synthetic meta-data for object detection similar to StreamPETR). See the included examples, evaluation script, and documentation.

optim_test_tools – Enhancements

  • TensorDumper: added ranges for context/disambiguation, PICKLE dump type, and allow_missing_data_in_previous comparison option.

  • Numba NVTX: new numba_nvtx submodule enabling use of NVTX ranges from Numba-compiled CPU code.

Documentation and Tooling

  • Changes inpackage_manager.sh

    • Now uses --no-deps for wheels by default; added options for build isolation control.

    • Fixed issue where re-installation of new version over existing one would leave the old version in place (additionally to the new one).

  • Various documentation improvements (e.g. online documentation, references to related resources such as videos and WeChat discussion group; improvements in various guides).

  • Pinned cupy version in the dockerfile to 13.6.0 to avoid compatibility issues.