Current Release (0.2.0)
on_demand_video_decoder
New features:
AV1 codec support: decode AV1-encoded video alongside H.264 and HEVC.
Stream async decoder: asynchronous frame decoding with prefetching via
DecodeN12ToRGBAsync()/DecodeN12ToRGBAsyncGetBuffer()inPyNvSampleReader. See the newSampleStreamAsyncAccess.pysample for a usage walkthrough.Cached separation decode: the new
CachedGopDecoderwrapsPyNvGopDecoderwith transparent GOP caching. When the same GOP is requested again (e.g. for a different frame within the same GOP), the cached result is returned directly, avoiding redundant demuxing. Controlled via theuseGOPCacheparameter inGetGOP()/GetGOPList().GPU memory management: new
release_device_memory()andrelease_decoder()methods to explicitly free GPU resources without destroying the decoder.Drop paged cache utility: call
drop_videos_cache()to evict video files from the OS page cache. This frees host memory occupied by cached pages of videos that are no longer needed, e.g. when switching to a different video dataset during training (and enables reproducible I/O benchmarks by ensuring reads hit disk rather than the page cache).
Improvements:
Concurrency: added releasing the Python GIL in existing decode and demux bindings, improving performance in multi-threaded Python code.
Various stability & error message improvements.
New Package: multi_tensor_copier – Fast & Easy Transfers of Many Small Tensors
A new package for efficiently & easily copying many (small) tensors between devices in a single call. Pass any nested structure of lists, tuples, and dicts containing tensors (or numpy arrays) and get them transferred – optimized end-to-end with a focus on CPU -> GPU transfers:
Automatically packs many small CPU tensors (mixed dtypes allowed) into one buffer for a single host-to-device copy, drastically reducing per-tensor overhead.
Stages non-pinned memory into pinned buffers (in parallel) so that all tensor transfers can proceed asynchronously.
Distributes work across multiple CUDA streams.
Runs the orchestration on a background C++ thread, keeping the Python thread free.
While the optimization is focused on CPU -> GPU, the package also supports GPU->CPU, GPU->GPU, and CPU->CPU directions (with some of the optimizations also applied to those).
The individual optimizations may be optionally turned off (e.g. to limit pinned memory usage).
Usage is straightforward:
handle = multi_tensor_copier.start_copy(nested_data, device="cuda:0")
# ... do other work while the copy runs ...
result = handle.get() # same nested structure, now on GPU
~8x speedup over per-tensor .to() calls in an example benchmark with 528
small tensors (simplified synthetic meta-data for object detection similar
to StreamPETR). See the included examples, evaluation script, and documentation.
optim_test_tools – Enhancements
TensorDumper: added ranges for context/disambiguation,PICKLEdump type, andallow_missing_data_in_previouscomparison option.Numba NVTX: new
numba_nvtxsubmodule enabling use of NVTX ranges from Numba-compiled CPU code.
Documentation and Tooling
Changes in
package_manager.shNow uses
--no-depsfor wheels by default; added options for build isolation control.Fixed issue where re-installation of new version over existing one would leave the old version in place (additionally to the new one).
Various documentation improvements (e.g. online documentation, references to related resources such as videos and WeChat discussion group; improvements in various guides).
Pinned cupy version in the dockerfile to 13.6.0 to avoid compatibility issues.