cuda.core 1.0.0 Release Notes#
Highlights#
TBD
New features#
Added managed-memory range operations to
cuda.core.utils:Location,advise(),prefetch(),discard(), anddiscard_prefetch(). Each operation accepts either a single managedBufferor a sequence; with cuda.bindings 12.8+ the N>1 case dispatches to the correspondingcuMem*BatchAsyncdriver entry point, addressing the managed-memory portion of #1333. Locations are expressed via the typedLocationdataclass (with classmethod constructorsdevice,host,host_numa, andhost_numa_current);Deviceandintvalues are still accepted for ergonomic compatibility.
Fixes and enhancements#
StridedMemoryViewnow provides a fast path fortorch.Tensorobjects via PyTorch’s AOT Inductor (AOTI) stable C ABI. When atorch.Tensoris passed to anyfrom_*classmethod (from_dlpack,from_cuda_array_interface,from_array_interface, orfrom_any_interface), tensor metadata is read directly from the underlying C struct, bypassing the DLPack and CUDA Array Interface protocol overhead. This yields ~7-20x fasterStridedMemoryViewconstruction for PyTorch tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch’s current stream and the consumer stream, matching the DLPack synchronization contract. Requires PyTorch >= 2.3. (#749)