cuda.core 0.7.0 Release Notes#
Highlights#
Introduced support for explicit graph construction. CUDA graphs can now be built programmatically by adding nodes and edges, and their topology can be modified after construction.
Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of GPU memory between CUDA compute kernels and OpenGL renderers.
Added
TensorMapDescriptorfor Hopper+ TMA (Tensor Memory Accelerator) bulk data movement, with automatic kernel argument integration.StridedMemoryViewnow supports DLPack export viafrom_dlpack()array API.
New features#
Added the
cuda.core.graphpublic module containingGraphDeffor explicit graph construction, typed node subclasses, and supporting types.GraphBuilder(stream capture) also moves into this module.Added
callback()for CPU callbacks during stream capture, mirroring the existingcallback()API.Added
GraphicsResourcefor CUDA-OpenGL interoperability. Factory classmethodsfrom_gl_buffer()andfrom_gl_image()register OpenGL objects for CUDA access, and mapping returns aBufferfor zero-copy kernel use.Added
TensorMapDescriptorwrapping the CUDA driver’sCUtensorMapfor Hopper+ TMA (Tensor Memory Accelerator) bulk data movement.StridedMemoryViewgains anas_tensor_map()method for convenient descriptor creation, with automatic dtype inference, stride computation, and first-class kernel argument integration.Added DLPack export support to
StridedMemoryViewvia__dlpack__and__dlpack_device__, complementing the existing import path.Added the DLPack C exchange API (
__dlpack_c_exchange_api__) toStridedMemoryView.Added NVRTC precompiled header (PCH) support (CUDA 12.8+).
ProgramOptionsgainspch,create_pch,use_pch,pch_dir, and related options.Program.pch_statusreports the PCH creation outcome, andcompile()automatically resizes the NVRTC PCH heap and retries when PCH creation fails due to heap exhaustion.Added NUMA-aware managed memory pool placement.
ManagedMemoryResourceOptionsgains apreferred_location_typeoption ("device","host", or"host_numa"), andManagedMemoryResource.preferred_locationqueries the resolved location. The existingpreferred_locationparameter retains full backwards compatibility.Added NUMA-aware pinned memory pool placement.
PinnedMemoryResourceOptionsgains anuma_idoption, andPinnedMemoryResource.numa_idqueries the host NUMA node ID used for pool placement. Whenipc_enabled=Trueandnuma_idis not set, the NUMA node is automatically derived from the current CUDA device.Added support for CUDA 13.2.
New examples#
gl_interop_plasma.py: Real-time plasma effect demonstrating CUDA-OpenGL interoperability viaGraphicsResource.tma_tensor_map.py: TMA bulk data movement usingTensorMapDescriptoron Hopper+ GPUs.
Fixes and enhancements#
Fixed managed memory buffers being misclassified as
kDLCUDAHostin DLPack device mapping. They are now correctly reported askDLCUDAManaged. (#1863)Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of
0instead of the NUMA node closest to the active CUDA device. On multi-NUMA systems where the device is attached to a non-zero host NUMA node, this could cause pool creation or allocation failures. (#1603)Fixed
DeviceMemoryResource.peer_accessible_byreturning stale results when wrapping a non-owned (default) memory pool. The property now always queries the CUDA driver for non-owned pools, so multiple wrappers around the same pool see consistent state. (#1720)Fixed a bare
exceptclause in stream acceptance that silently swallowed all exceptions, includingKeyboardInterruptandSystemExit. Only the expected “protocol not supported” case is now caught. (#1631)StridedMemoryViewnow validates strides at construction time so unsupported layouts fail immediately instead of on first metadata access. (#1429)IPC file descriptor cleanup now uses a C++
shared_ptrwith a POSIX deleter, avoiding cryptic errors when aDeviceMemoryResourceis destroyed during Python shutdown.Improved error message when
ManagedMemoryResourceis called without options on platforms that lack a default managed memory pool (e.g. WSL2). (#1617)Handle properties on core API objects now return
Noneduring Python shutdown instead of crashing.Reduced Python overhead in
ProgramandLinkerby moving compilation and linking operations to the C level and releasing the GIL during backend calls. This benefits workloads that create many programs or linkers, and enables concurrent compilation in multithreaded applications.Error enum explanations are now derived from
cuda-bindingsdocstrings when available (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely missing optional modules are treated as unavailable; unrelated import failures now surface normally, and
cuda.corenow depends directly oncuda-pathfinder.