`cuda.core` 1.1.0 Release Notes#

Highlights#

cuda.core now ships with .pyi type stubs for all public APIs, giving IDEs and type checkers full autocompletion and static analysis.
New cuda.core.texture module for texture and surface memory: OpaqueArray, MipmappedArray, TextureObject, and SurfaceObject, constructed through the corresponding Device.create_* factories.
Richer managed-memory support: the new ManagedBuffer exposes a property-style advice API (read_mostly, preferred_location, accessed_by) with NUMA-aware host locations via the new Host type, plus batched range operations in cuda.core.utils for prefetching and discarding many buffers at once.
CUDA 13.3 toolkit support. (#2139)

New features#

Added Host as the symmetric counterpart of Device for expressing managed-memory locations: Host() (any host), Host(numa_id=N) (specific NUMA node), and Host.numa_current() (calling thread’s NUMA node). (#1775)
Added ManagedBuffer, a Buffer subclass returned by ManagedMemoryResource.allocate() that exposes a property-style advice API:
- buf.read_mostly (bool) — driver-backed get/set.
- buf.preferred_location (Device | Host | None) — driver-backed get/set; assigning None unsets.
- buf.accessed_by — a live, set-like view; add() / discard() issue advice, iteration queries the driver.
- buf.prefetch(location, *, stream), buf.discard(*, stream), buf.discard_prefetch(location, *, stream) — instance methods that delegate to the matching free functions.
Use ManagedBuffer.from_handle() to wrap an existing managed-memory pointer. (#1775)
Added batched managed-memory range operations to cuda.core.utils (CUDA 13+): prefetch_batch(), discard_batch(), and discard_prefetch_batch(). Each takes a sequence of managed Buffer instances and dispatches to the corresponding cuMem*BatchAsync driver entry point, addressing the managed-memory portion of #1333. Single-buffer operations are exposed as instance methods on ManagedBuffer (prefetch(), discard(), discard_prefetch()) and as property setters (read_mostly, preferred_location, accessed_by). Locations are expressed via Device or Host.
Added system.Device.get_nvlink_count() and system.Device.get_nvlinks() for device-specific NVLink enumeration. These APIs avoid relying on the static NVML NVML_NVLINK_MAX_LINKS macro when querying the links available on a particular device. (#2192)
Added the graph.GraphBuilder.graph_definition property, which exposes a captured graph as an explicit graph.GraphDefinition view sharing ownership of the same underlying graph. This enables hybrid flows that mix the capture and explicit graph-building APIs, such as inspecting or augmenting a captured graph, or populating a conditional body entirely through the explicit API. (#2026)
Added the cuda.core.texture module for texture and surface memory: OpaqueArray and MipmappedArray for hardware-laid-out array allocations, and TextureObject and SurfaceObject for bindless kernel-side sampled reads and typed load/store. Objects are constructed from a ResourceDescriptor via Device.create_opaque_array(), Device.create_mipmapped_array(), Device.create_texture_object(), and Device.create_surface_object(). (#467, #2095, #2307)
cuda.core now ships with .pyi stubs for all public APIs, enabling users’ IDEs and type checkers to provide better autocompletion and static analysis. (#2061)
ObjectCode and Program now accept path-like inputs in addition to strings and bytes. (#2123)
Exposed a Buffer.size accessor to Python. (#2068, closes #2049)
Extended WorkqueueResource and WorkqueueResourceOptions to cover the full driver-side workqueue-config surface. Added concurrency_limit to WorkqueueResourceOptions for configuring the expected maximum concurrent stream-ordered workloads, and read-only WorkqueueResource.sharing_scope, concurrency_limit, and device properties for round-tripping the driver-populated values. Added WorkqueueSharingScopeType StrEnum accepted by WorkqueueResourceOptions.sharing_scope in addition to raw strings. (#2329, #2330)

Fixes and enhancements#

On WSL, cuda.core.system.get_process_name would raise a UnicodeDecodeError. It should now return the correct result. (#2118)
Calling cuda.core.system.get_process_name before querying any device’s compute_running_processes would raise a NvmlNotFoundError. Now it will correctly return the process name, if it is a GPU-using process.
system.Device.get_nvlink() now validates link numbers against the device-specific NVLink count and raises ValueError for unsupported links. (#2192)
Hardened the IPC buffer import path against malformed or untrusted peer descriptors: descriptor payloads shorter than the driver struct are now rejected before import (#2223), an imported buffer’s size is validated against the mapped allocation extent before any copy (#2224), and negative allocation handles are always rejected, including under -O (#2219).
ManagedBuffer.accessed_by now validates every location before issuing any advice, so a bulk assignment containing an invalid entry can no longer leave the applied advice in a torn state. (#2222)
Graph nodes now keep their Python-owned attachments (kernel-argument buffers, host-callback functions and user data, and memcpy/memset operands) alive for the lifetime of the graph. Previously, keeping these objects alive was the caller’s responsibility. (#2280)
Hardened the graph user-object destructor against races during Python interpreter shutdown. (#2074)
Free-threading correctness fixes: buffer and memory-resource threading (#2162), critical-section guards on shared accessors (#2215), and an atomic flag guarding buffer memory-attribute initialization (#2216).
Program.compile() cache keys are now FIPS-safe. (#2087)
Memory-pool driver errors are now preserved instead of being masked by out-of-memory handling. (#2084)
DLPack export now raises BufferError (the intended exception) instead of RuntimeError when a buffer cannot be exported. (#2160)
Corrected the Buffer and MemoryResource __eq__ implementations. (#2067, closes #2050)
Checkpoint restore now validates GPU UUID inputs early. (#2086)
Bumped the PyTorch tensor-bridge upper bound to 2.12. (#2099)

Documentation#

Documented the IPC buffer pickle trust boundary: Buffer.__reduce__() and multi-process IPC users should review the security note before unpickling buffer handles from untrusted sources. (#2225)

Deprecated APIs#

Deprecated system.NvlinkInfo.max_links. Use system.Device.get_nvlink_count() or system.Device.get_nvlinks() to query NVLink availability for a specific device.

cuda.core 1.1.0 Release Notes#

Highlights#

New features#

Fixes and enhancements#

Documentation#

Deprecated APIs#

`cuda.core` 1.1.0 Release Notes#