`cuda.core` 0.7.0 Release Notes#

Highlights#

Introduced support for explicit graph construction. CUDA graphs can now be built programmatically by adding nodes and edges, and their topology can be modified after construction.
Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of GPU memory between CUDA compute kernels and OpenGL renderers.
Added TensorMapDescriptor for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement, with automatic kernel argument integration.
StridedMemoryView now supports DLPack export via from_dlpack() array API.

New features#

Added the cuda.core.graph public module containing GraphDef for explicit graph construction, typed node subclasses, and supporting types. GraphBuilder (stream capture) also moves into this module.
Added callback() for CPU callbacks during stream capture, mirroring the existing GraphDef.callback API.
Added GraphicsResource for CUDA-OpenGL interoperability. Factory classmethods from_gl_buffer() and from_gl_image() register OpenGL objects for CUDA access, and mapping returns a Buffer for zero-copy kernel use.
Added TensorMapDescriptor wrapping the CUDA driver’s CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement. StridedMemoryView gains an as_tensor_map() method for convenient descriptor creation, with automatic dtype inference, stride computation, and first-class kernel argument integration.
Added DLPack export support to StridedMemoryView via __dlpack__ and __dlpack_device__, complementing the existing import path.
Added the DLPack C exchange API (__dlpack_c_exchange_api__) to StridedMemoryView.
Added NVRTC precompiled header (PCH) support (CUDA 12.8+). ProgramOptions gains pch, create_pch, use_pch, pch_dir, and related options. Program.pch_status reports the PCH creation outcome, and compile() automatically resizes the NVRTC PCH heap and retries when PCH creation fails due to heap exhaustion.
Added NUMA-aware managed memory pool placement. ManagedMemoryResourceOptions gains a preferred_location_type option ("device", "host", or "host_numa"), and ManagedMemoryResource.preferred_location queries the resolved location. The existing preferred_location parameter retains full backwards compatibility.
Added NUMA-aware pinned memory pool placement. PinnedMemoryResourceOptions gains a numa_id option, and PinnedMemoryResource.numa_id queries the host NUMA node ID used for pool placement. When ipc_enabled=True and numa_id is not set, the NUMA node is automatically derived from the current CUDA device.
Added support for CUDA 13.2.

New examples#

gl_interop_plasma.py: Real-time plasma effect demonstrating CUDA-OpenGL interoperability via GraphicsResource.
tma_tensor_map.py: TMA bulk data movement using TensorMapDescriptor on Hopper+ GPUs.

Fixes and enhancements#

Fixed managed memory buffers being misclassified as kDLCUDAHost in DLPack device mapping. They are now correctly reported as kDLCUDAManaged. (#1863)
Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of 0 instead of the NUMA node closest to the active CUDA device. On multi-NUMA systems where the device is attached to a non-zero host NUMA node, this could cause pool creation or allocation failures. (#1603)
Fixed DeviceMemoryResource.peer_accessible_by returning stale results when wrapping a non-owned (default) memory pool. The property now always queries the CUDA driver for non-owned pools, so multiple wrappers around the same pool see consistent state. (#1720)
Fixed a bare except clause in stream acceptance that silently swallowed all exceptions, including KeyboardInterrupt and SystemExit. Only the expected “protocol not supported” case is now caught. (#1631)
StridedMemoryView now validates strides at construction time so unsupported layouts fail immediately instead of on first metadata access. (#1429)
IPC file descriptor cleanup now uses a C++ shared_ptr with a POSIX deleter, avoiding cryptic errors when a DeviceMemoryResource is destroyed during Python shutdown.
Improved error message when ManagedMemoryResource is called without options on platforms that lack a default managed memory pool (e.g. WSL2). (#1617)
Handle properties on core API objects now return None during Python shutdown instead of crashing.
Reduced Python overhead in Program and Linker by moving compilation and linking operations to the C level and releasing the GIL during backend calls. This benefits workloads that create many programs or linkers, and enables concurrent compilation in multithreaded applications.
Error enum explanations are now derived from cuda-bindings docstrings when available (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.
Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely missing optional modules are treated as unavailable; unrelated import failures now surface normally, and cuda.core now depends directly on cuda-pathfinder.

cuda.core 0.7.0 Release Notes#

Highlights#

New features#

New examples#

Fixes and enhancements#

`cuda.core` 0.7.0 Release Notes#