`cuda.core` 1.0.0 Release Notes#

Highlights#

First stable release of cuda.core! As of version 1.0.0, all APIs are considered stable and follow Semantic Versioning (SemVer) with appropriate deprecation periods for breaking changes. See the support policy for details.
Added green context support (CUDA 12.4+). New types Context, ContextOptions, SMResource, SMResourceOptions, WorkqueueResource, and WorkqueueResourceOptions enable GPU SM and workqueue resource partitioning. Create green contexts via Device.create_context(), then use Context.create_stream() and Context.resources to work within the partitioned resources. (#1976)
Added the cuda.core.checkpoint module for CUDA process checkpointing, including string process state queries, lock/checkpoint/restore/unlock operations, and GPU UUID remapping support for restore. (#1343)

New features#

Program.compile() now accepts an optional cache= keyword argument for avoiding recompilation of identical source + options + target. Two concrete implementations of the ProgramCacheResource ABC are provided: InMemoryProgramCache (thread-safe, single-process LRU) and FileStreamProgramCache (disk-backed, cross-process safe, LRU-evicting). A standalone make_program_cache_key() function is exposed for callers who need to incorporate additional content (e.g. headers or PCH files) into the cache key. (#1912)
Changes to the cuda.core.system module for NVIDIA Management Library (NVML) access:
- system.Device.mig for querying and setting MIG mode, enumerating MIG device instances, and navigating parent/child relationships. (#1916)
- system.Device.compute_running_processes for querying running compute processes on a device, returning ProcessInfo objects with PID, GPU memory usage, and MIG instance IDs. (#1917)
- system.Device.get_nvlink() for querying NVLink version and state per link, and system.Device.utilization returning current GPU and memory utilization rates. (#1918)
Enums are now available in places where a small number of string values are accepted or returned. You may continue to use the string values, or use enumerations for better linting and type-checking. (#2016) The new enums are:

Breaking changes#

StridedMemoryView now provides a fast path for torch.Tensor objects via PyTorch’s AOT Inductor (AOTI) stable C ABI. When a torch.Tensor is passed to any from_* classmethod (from_dlpack, from_cuda_array_interface, from_array_interface, or from_any_interface), tensor metadata is read directly from the underlying C struct, bypassing the DLPack and CUDA Array Interface protocol overhead. This yields ~7–20x faster StridedMemoryView construction for PyTorch tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch’s current stream and the consumer stream, matching the DLPack synchronization contract. Requires PyTorch >= 2.3.

This is a behavioral breaking change: because the AOTI tensor bridge reads raw metadata without re-enacting PyTorch’s export guardrails, tensors that PyTorch would reject at the DLPack boundary (notably requires_grad, conjugated, non-strided/sparse, and wrong-current-device CUDA tensors) are now accepted. This is intentional — StridedMemoryView is designed for low-level interop where those checks are not needed. (#749)
Removed the deprecated cuda.core.experimental namespace. All public APIs have been available under cuda.core since v0.5.0. Code that imports from cuda.core.experimental must be updated to import from cuda.core instead.
Graph types are no longer re-exported from the top-level cuda.core namespace; they must be imported from cuda.core.graph. The affected symbols are Graph, GraphBuilder, GraphCompleteOptions, GraphCondition, GraphDebugPrintOptions, and GraphDefinition. Update from cuda.core import GraphBuilder to from cuda.core.graph import GraphBuilder (and similarly for the other symbols).
Removed the GraphAllocOptions dataclass and the AllocNode.options property. Its fields are now keyword-only parameters on graph.GraphDefinition.allocate() and graph.GraphNode.allocate().
Renamed GraphDef to GraphDefinition for consistency with the rest of the API, which spells words out (e.g. TensorMapDescriptor, not TensorMapDesc). (#1950)
Renamed cuda.core.graph.Condition to GraphCondition to follow the Graph* prefix convention used by GraphBuilder, GraphDefinition, GraphNode. (#1945)
Converted no-argument deterministic getters to properties for consistency with the rest of the API (#1945):
- Buffer.get_ipc_descriptor() -> Buffer.ipc_descriptor
- Event.get_ipc_descriptor() -> Event.ipc_descriptor
- DeviceMemoryResource.get_allocation_handle() -> DeviceMemoryResource.allocation_handle
- PinnedMemoryResource.get_allocation_handle() -> PinnedMemoryResource.allocation_handle
Renamed boolean / non-noun properties for clearer naming (#1945):
- LaunchConfig.cooperative_launch -> LaunchConfig.is_cooperative (also renames the constructor keyword argument).
- Event.is_timing_disabled -> Event.is_timing_enabled.
- Event.is_sync_busy_waited -> Event.is_blocking_sync.
- EventOptions.enable_timing -> EventOptions.timing_enabled and EventOptions.busy_waited_sync -> EventOptions.blocking_sync.
Renamed graph allocation methods to match MemoryResource.allocate() / MemoryResource.deallocate() (#1945):
- GraphDefinition.alloc -> graph.GraphDefinition.allocate()
- GraphDefinition.free -> graph.GraphDefinition.deallocate()
- GraphNode.alloc -> graph.GraphNode.allocate()
- GraphNode.free -> graph.GraphNode.deallocate()
Cross-API consistency for graph builders (#1945):
- GraphBuilder.add_child -> graph.GraphBuilder.embed() (matches graph.GraphDefinition.embed() and graph.GraphNode.embed()).
- GraphDefinition.record_event / wait_event -> graph.GraphDefinition.record() / graph.GraphDefinition.wait() and the same on GraphNode, matching Stream.record() / Stream.wait().
KernelAttributes methods are now properties; per-device queries use indexing (#1945):
- The 17 attribute methods (max_threads_per_block, num_regs, shared_size_bytes, cluster_scheduling_policy_preference, etc.) that previously took a device_id argument are now properties on the view returned by Kernel.attributes. The view is bound to the current device by default; kernel.attributes[device] returns a view bound to a specific Device or device ordinal. The cache is shared across views of the same kernel.
- Old: kernel.attributes.num_regs() and kernel.attributes.num_regs(some_dev)
- New: kernel.attributes.num_regs and kernel.attributes[some_dev].num_regs
Renamed graph.HostCallbackNode.callback_fn to graph.HostCallbackNode.callback to drop the redundant _fn suffix (#1945).
Unified the conditional graph API on GraphCondition and consistent verbs (#1945):
- GraphBuilder.create_conditional_handle -> graph.GraphBuilder.create_condition(). The new factory returns a GraphCondition (matching graph.GraphDefinition.create_condition()) instead of a raw CUgraphConditionalHandle. The four conditional builder methods (if_then(), if_else(), while_loop(), switch()) now accept a GraphCondition instead of a raw handle.
- GraphBuilder.if_cond / GraphDefinition.if_cond / GraphNode.if_cond -> graph.GraphBuilder.if_then() / graph.GraphDefinition.if_then() / graph.GraphNode.if_then(). The new name parallels the existing if_else, while_loop, and switch methods (verb describing the control-flow construct, not an abbreviation of “condition”) and matches Python’s own if/then/else vocabulary.
- A GraphCondition may be passed directly as a kernel argument to launch(); the launcher unwraps it to the underlying CUgraphConditionalHandle value. Previously, .handle had to be extracted explicitly.
Linker.which_backend() is now a classmethod instead of the former backend instance property. Call sites must use Linker.which_backend() (with parentheses) instead of linker.backend. This allows querying the linking backend without constructing a Linker instance — for example, to choose between PTX and LTOIR input before linking.
DeviceMemoryResource.peer_accessible_by now returns a collections.abc.MutableSet of Device objects instead of a sorted tuple[int, ...]. The property setter is unchanged. (#2018)
stream is now a required keyword-only argument on APIs that schedule work on a stream (#2001). Pass device.default_stream (or any Stream) explicitly to retain the previous behavior. Affected APIs:
- MemoryResource.allocate() / MemoryResource.deallocate() and overrides on DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource, and graph.GraphMemoryResource.
- Device.allocate().
- GraphicsResource.map().
- KernelOccupancy.max_potential_cluster_size() and KernelOccupancy.max_active_clusters().
- Buffer.from_ipc_descriptor() (no longer falls back to the default stream when stream=None is passed).
Synchronous memory resources are exempt: their allocate/deallocate methods accept an optional stream (validated when non-None) but do not use it. This applies to LegacyPinnedMemoryResource and VirtualMemoryResource.
Consistent naming of types annotation helpers (#2016): - cuda.core.typing.DevicePointerT -> cuda.core.typing.DevicePointerType - cuda.core.typing.IsStreamT -> cuda.core.typing.IsStreamType
Renamed and converted multiple Device properties and methods for naming consistency (#1946):

On Device:
- is_c2c_mode_enabled -> is_c2c_enabled
- persistence_mode_enabled -> is_persistence_mode_enabled
- clock(clock_type) -> get_clock(clock_type)
- get_auto_boosted_clocks_enabled() -> is_auto_boosted_clocks_enabled (method -> property)
- get_current_clock_event_reasons() -> current_clock_event_reasons (method -> property)
- get_supported_clock_event_reasons() -> supported_clock_event_reasons (method -> property)
- display_mode -> is_display_connected
- display_active -> is_display_active
- fan(fan=0) -> get_fan(fan=0)
- get_supported_pstates() -> supported_pstates (method -> property)
On PciInfo:
- get_max_pcie_link_generation() -> link_generation (method -> property)
- get_gpu_max_pcie_link_generation() -> max_link_generation (method -> property)
- get_max_pcie_link_width() -> max_link_width (method -> property)
- get_current_pcie_link_generation() -> current_link_generation (method -> property)
- get_current_pcie_link_width() -> current_link_width (method -> property)
- get_pcie_throughput(counter) -> get_throughput(counter)
- get_pcie_replay_counter() -> replay_counter (method -> property)
On Temperature:
- sensor(sensor=...) -> get_sensor(sensor=...)
- threshold(threshold_type) -> get_threshold(threshold_type)
- thermal_settings(sensor_index) -> get_thermal_settings(sensor_index)
On FanInfo:
- set_default_fan_speed() -> set_default_speed()
Re-wrapped NVML enums as human-readable StrEnum subclasses instead of raw integer re-exports from cuda.bindings.nvml. These are available in cuda.core.system.typing. (#2014)
Removed 18 helper/data-container classes from cuda.core.system.__all__: BAR1MemoryInfo, ClockInfo, ClockOffsets, CoolerInfo, DeviceAttributes, DeviceEvents, EventData, FanInfo, FieldValue, FieldValues, GpuDynamicPstatesInfo, GpuDynamicPstatesUtilization, InforomInfo, PciInfo, RepairStatus, Temperature, ThermalSensor, ThermalSettings. These classes are still returned by Device properties and methods but should not be directly instantiated by users. (#1942)
system.Device.uuid now returns the full NVML UUID with prefix (e.g. GPU-...). Use system.Device.uuid_without_prefix for the previous behavior. (#1916)
args_viewable_as_strided_memory() and StridedMemoryView were accidentally exposed at the top-level in cuda.core. They are available publicly from the cuda.core.utils module. (#2028)
Replaced system.get_driver_version() and system.get_driver_version_full() with system.get_user_mode_driver_version() (works with or without NVML) and system.get_kernel_mode_driver_version() (requires NVML). Each returns a tuple[int, ...].

Fixes and enhancements#

Fixed Buffer.is_managed returning False for pool-allocated managed memory (ManagedMemoryResource), which caused DLPack interop to misclassify managed buffers as kDLCUDAHost. The fix queries both the driver pointer attribute and the memory resource. (#1924)
system.Device.arch now returns UNKNOWN instead of raising ValueError when NVML reports an architecture not yet in the enum. (#1937)
system.Device.get_field_values() and system.Device.clear_field_values() with an empty list no longer raise InvalidArgumentError. (#1982)
Linker error and info log retrieval now properly checks return codes from nvJitLink, raising exceptions on failure instead of silently ignoring errors. (#1993)
Fixed a potential crash when NVML event set creation failed on Windows, due to __dealloc__ freeing an uninitialized handle. (#1992)
CUDA Runtime error messages are now more reliable, especially on Windows where the runtime DLL name table could disagree with the installed bindings. (#2003)
Graph kernel nodes now prevent Python kernel-argument objects from being garbage-collected before the graph executes. Previously, objects passed as kernel arguments (e.g. a Buffer) could be freed if the only Python reference was through the launch call, causing the graph to operate on stale device pointers. (#2041)
Fixed a potential crash in DeviceEvents.__dealloc__ when __init__ raised before the NVML event set was created, due to freeing an uninitialized handle. (#2047)
Linux release wheels are now stripped of debug symbols, significantly reducing package size. Debug builds are now supported via --config-settings=debug=true. (#1890)

cuda.core 1.0.0 Release Notes#

Highlights#

New features#

Breaking changes#

Fixes and enhancements#

`cuda.core` 1.0.0 Release Notes#