cuda.core 0.7.0 Release Notes#

Released on Apr 7, 2026

Highlights#

  • Introduced support for explicit graph construction. CUDA graphs can now be built programmatically by adding nodes and edges, and their topology can be modified after construction.

  • Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of GPU memory between CUDA compute kernels and OpenGL renderers.

  • Added TensorMapDescriptor for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement, with automatic kernel argument integration.

  • StridedMemoryView now supports DLPack export via from_dlpack() array API.

New features#

  • Added the cuda.core.graph public module containing GraphDef for explicit graph construction, typed node subclasses, and supporting types. GraphBuilder (stream capture) also moves into this module.

  • Added callback() for CPU callbacks during stream capture, mirroring the existing callback() API.

  • Added GraphicsResource for CUDA-OpenGL interoperability. Factory classmethods from_gl_buffer() and from_gl_image() register OpenGL objects for CUDA access, and mapping returns a Buffer for zero-copy kernel use.

  • Added TensorMapDescriptor wrapping the CUDA driver’s CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement. StridedMemoryView gains an as_tensor_map() method for convenient descriptor creation, with automatic dtype inference, stride computation, and first-class kernel argument integration.

  • Added DLPack export support to StridedMemoryView via __dlpack__ and __dlpack_device__, complementing the existing import path.

  • Added the DLPack C exchange API (__dlpack_c_exchange_api__) to StridedMemoryView.

  • Added NVRTC precompiled header (PCH) support (CUDA 12.8+). ProgramOptions gains pch, create_pch, use_pch, pch_dir, and related options. Program.pch_status reports the PCH creation outcome, and compile() automatically resizes the NVRTC PCH heap and retries when PCH creation fails due to heap exhaustion.

  • Added NUMA-aware managed memory pool placement. ManagedMemoryResourceOptions gains a preferred_location_type option ("device", "host", or "host_numa"), and ManagedMemoryResource.preferred_location queries the resolved location. The existing preferred_location parameter retains full backwards compatibility.

  • Added NUMA-aware pinned memory pool placement. PinnedMemoryResourceOptions gains a numa_id option, and PinnedMemoryResource.numa_id queries the host NUMA node ID used for pool placement. When ipc_enabled=True and numa_id is not set, the NUMA node is automatically derived from the current CUDA device.

  • Added support for CUDA 13.2.

New examples#

  • gl_interop_plasma.py: Real-time plasma effect demonstrating CUDA-OpenGL interoperability via GraphicsResource.

  • tma_tensor_map.py: TMA bulk data movement using TensorMapDescriptor on Hopper+ GPUs.

Fixes and enhancements#

  • Fixed managed memory buffers being misclassified as kDLCUDAHost in DLPack device mapping. They are now correctly reported as kDLCUDAManaged. (#1863)

  • Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of 0 instead of the NUMA node closest to the active CUDA device. On multi-NUMA systems where the device is attached to a non-zero host NUMA node, this could cause pool creation or allocation failures. (#1603)

  • Fixed DeviceMemoryResource.peer_accessible_by returning stale results when wrapping a non-owned (default) memory pool. The property now always queries the CUDA driver for non-owned pools, so multiple wrappers around the same pool see consistent state. (#1720)

  • Fixed a bare except clause in stream acceptance that silently swallowed all exceptions, including KeyboardInterrupt and SystemExit. Only the expected “protocol not supported” case is now caught. (#1631)

  • StridedMemoryView now validates strides at construction time so unsupported layouts fail immediately instead of on first metadata access. (#1429)

  • IPC file descriptor cleanup now uses a C++ shared_ptr with a POSIX deleter, avoiding cryptic errors when a DeviceMemoryResource is destroyed during Python shutdown.

  • Improved error message when ManagedMemoryResource is called without options on platforms that lack a default managed memory pool (e.g. WSL2). (#1617)

  • Handle properties on core API objects now return None during Python shutdown instead of crashing.

  • Reduced Python overhead in Program and Linker by moving compilation and linking operations to the C level and releasing the GIL during backend calls. This benefits workloads that create many programs or linkers, and enables concurrent compilation in multithreaded applications.

  • Error enum explanations are now derived from cuda-bindings docstrings when available (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.

  • Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely missing optional modules are treated as unavailable; unrelated import failures now surface normally, and cuda.core now depends directly on cuda-pathfinder.