cuda.core 1.0.0 Release Notes#
Highlights#
First stable release of
cuda.core! As of version 1.0.0, all APIs are considered stable and follow Semantic Versioning (SemVer) with appropriate deprecation periods for breaking changes. See the support policy for details.Added green context support (CUDA 12.4+). New types
Context,ContextOptions,SMResource,SMResourceOptions,WorkqueueResource, andWorkqueueResourceOptionsenable GPU SM and workqueue resource partitioning. Create green contexts viaDevice.create_context(), then useContext.create_stream()andContext.resourcesto work within the partitioned resources. (#1976)Added the
cuda.core.checkpointmodule for CUDA process checkpointing, including string process state queries, lock/checkpoint/restore/unlock operations, and GPU UUID remapping support for restore. (#1343)
New features#
Program.compile()now accepts an optionalcache=keyword argument for avoiding recompilation of identical source + options + target. Two concrete implementations of theProgramCacheResourceABC are provided:InMemoryProgramCache(thread-safe, single-process LRU) andFileStreamProgramCache(disk-backed, cross-process safe, LRU-evicting). A standalonemake_program_cache_key()function is exposed for callers who need to incorporate additional content (e.g. headers or PCH files) into the cache key. (#1912)Changes to the
cuda.core.systemmodule for NVIDIA Management Library (NVML) access:system.Device.migfor querying and setting MIG mode, enumerating MIG device instances, and navigating parent/child relationships. (#1916)system.Device.compute_running_processesfor querying running compute processes on a device, returningProcessInfoobjects with PID, GPU memory usage, and MIG instance IDs. (#1917)system.Device.get_nvlink()for querying NVLink version and state per link, andsystem.Device.utilizationreturning current GPU and memory utilization rates. (#1918)
Enums are now available in places where a small number of string values are accepted or returned. You may continue to use the string values, or use enumerations for better linting and type-checking. (#2016) The new enums are:
Breaking changes#
StridedMemoryViewnow provides a fast path fortorch.Tensorobjects via PyTorch’s AOT Inductor (AOTI) stable C ABI. When atorch.Tensoris passed to anyfrom_*classmethod (from_dlpack,from_cuda_array_interface,from_array_interface, orfrom_any_interface), tensor metadata is read directly from the underlying C struct, bypassing the DLPack and CUDA Array Interface protocol overhead. This yields ~7–20x fasterStridedMemoryViewconstruction for PyTorch tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch’s current stream and the consumer stream, matching the DLPack synchronization contract. Requires PyTorch >= 2.3.This is a behavioral breaking change: because the AOTI tensor bridge reads raw metadata without re-enacting PyTorch’s export guardrails, tensors that PyTorch would reject at the DLPack boundary (notably
requires_grad, conjugated, non-strided/sparse, and wrong-current-device CUDA tensors) are now accepted. This is intentional —StridedMemoryViewis designed for low-level interop where those checks are not needed. (#749)Removed the deprecated
cuda.core.experimentalnamespace. All public APIs have been available undercuda.coresince v0.5.0. Code that imports fromcuda.core.experimentalmust be updated to import fromcuda.coreinstead.Graph types are no longer re-exported from the top-level
cuda.corenamespace; they must be imported fromcuda.core.graph. The affected symbols areGraph,GraphBuilder,GraphCompleteOptions,GraphCondition,GraphDebugPrintOptions, andGraphDefinition. Updatefrom cuda.core import GraphBuildertofrom cuda.core.graph import GraphBuilder(and similarly for the other symbols).Removed the
GraphAllocOptionsdataclass and theAllocNode.optionsproperty. Its fields are now keyword-only parameters ongraph.GraphDefinition.allocate()andgraph.GraphNode.allocate().Renamed
GraphDeftoGraphDefinitionfor consistency with the rest of the API, which spells words out (e.g.TensorMapDescriptor, notTensorMapDesc). (#1950)Renamed
cuda.core.graph.ConditiontoGraphConditionto follow theGraph*prefix convention used byGraphBuilder,GraphDefinition,GraphNode. (#1945)Converted no-argument deterministic getters to properties for consistency with the rest of the API (#1945):
Buffer.get_ipc_descriptor()->Buffer.ipc_descriptorEvent.get_ipc_descriptor()->Event.ipc_descriptorDeviceMemoryResource.get_allocation_handle()->DeviceMemoryResource.allocation_handlePinnedMemoryResource.get_allocation_handle()->PinnedMemoryResource.allocation_handle
Renamed boolean / non-noun properties for clearer naming (#1945):
LaunchConfig.cooperative_launch->LaunchConfig.is_cooperative(also renames the constructor keyword argument).Event.is_timing_disabled->Event.is_timing_enabled.Event.is_sync_busy_waited->Event.is_blocking_sync.EventOptions.enable_timing->EventOptions.timing_enabledandEventOptions.busy_waited_sync->EventOptions.blocking_sync.
Renamed graph allocation methods to match
MemoryResource.allocate()/MemoryResource.deallocate()(#1945):GraphDefinition.alloc->graph.GraphDefinition.allocate()GraphDefinition.free->graph.GraphDefinition.deallocate()GraphNode.alloc->graph.GraphNode.allocate()GraphNode.free->graph.GraphNode.deallocate()
Cross-API consistency for graph builders (#1945):
GraphBuilder.add_child->graph.GraphBuilder.embed()(matchesgraph.GraphDefinition.embed()andgraph.GraphNode.embed()).GraphDefinition.record_event/wait_event->graph.GraphDefinition.record()/graph.GraphDefinition.wait()and the same onGraphNode, matchingStream.record()/Stream.wait().
KernelAttributesmethods are now properties; per-device queries use indexing (#1945):The 17 attribute methods (
max_threads_per_block,num_regs,shared_size_bytes,cluster_scheduling_policy_preference, etc.) that previously took adevice_idargument are now properties on the view returned byKernel.attributes. The view is bound to the current device by default;kernel.attributes[device]returns a view bound to a specificDeviceor device ordinal. The cache is shared across views of the same kernel.Old:
kernel.attributes.num_regs()andkernel.attributes.num_regs(some_dev)New:
kernel.attributes.num_regsandkernel.attributes[some_dev].num_regs
Renamed
graph.HostCallbackNode.callback_fntograph.HostCallbackNode.callbackto drop the redundant_fnsuffix (#1945).Unified the conditional graph API on
GraphConditionand consistent verbs (#1945):GraphBuilder.create_conditional_handle->graph.GraphBuilder.create_condition(). The new factory returns aGraphCondition(matchinggraph.GraphDefinition.create_condition()) instead of a rawCUgraphConditionalHandle. The four conditional builder methods (if_then(),if_else(),while_loop(),switch()) now accept aGraphConditioninstead of a raw handle.GraphBuilder.if_cond/GraphDefinition.if_cond/GraphNode.if_cond->graph.GraphBuilder.if_then()/graph.GraphDefinition.if_then()/graph.GraphNode.if_then(). The new name parallels the existingif_else,while_loop, andswitchmethods (verb describing the control-flow construct, not an abbreviation of “condition”) and matches Python’s ownif/then/elsevocabulary.A
GraphConditionmay be passed directly as a kernel argument tolaunch(); the launcher unwraps it to the underlyingCUgraphConditionalHandlevalue. Previously,.handlehad to be extracted explicitly.
Linker.which_backend()is now a classmethod instead of the formerbackendinstance property. Call sites must useLinker.which_backend()(with parentheses) instead oflinker.backend. This allows querying the linking backend without constructing aLinkerinstance — for example, to choose between PTX and LTOIR input before linking.DeviceMemoryResource.peer_accessible_bynow returns acollections.abc.MutableSetofDeviceobjects instead of a sortedtuple[int, ...]. The property setter is unchanged. (#2018)streamis now a required keyword-only argument on APIs that schedule work on a stream (#2001). Passdevice.default_stream(or anyStream) explicitly to retain the previous behavior. Affected APIs:MemoryResource.allocate()/MemoryResource.deallocate()and overrides onDeviceMemoryResource,PinnedMemoryResource,ManagedMemoryResource, andgraph.GraphMemoryResource.KernelOccupancy.max_potential_cluster_size()andKernelOccupancy.max_active_clusters().Buffer.from_ipc_descriptor()(no longer falls back to the default stream whenstream=Noneis passed).
Synchronous memory resources are exempt: their allocate/deallocate methods accept an optional
stream(validated when non-None) but do not use it. This applies toLegacyPinnedMemoryResourceandVirtualMemoryResource.Consistent naming of types annotation helpers (#2016): -
cuda.core.typing.DevicePointerT->cuda.core.typing.DevicePointerType-cuda.core.typing.IsStreamT->cuda.core.typing.IsStreamTypeRenamed and converted multiple
Deviceproperties and methods for naming consistency (#1946):On
Device:is_c2c_mode_enabled->is_c2c_enabledpersistence_mode_enabled->is_persistence_mode_enabledclock(clock_type)->get_clock(clock_type)get_auto_boosted_clocks_enabled()->is_auto_boosted_clocks_enabled(method -> property)get_current_clock_event_reasons()->current_clock_event_reasons(method -> property)get_supported_clock_event_reasons()->supported_clock_event_reasons(method -> property)display_mode->is_display_connecteddisplay_active->is_display_activefan(fan=0)->get_fan(fan=0)get_supported_pstates()->supported_pstates(method -> property)
On
PciInfo:get_max_pcie_link_generation()->link_generation(method -> property)get_gpu_max_pcie_link_generation()->max_link_generation(method -> property)get_max_pcie_link_width()->max_link_width(method -> property)get_current_pcie_link_generation()->current_link_generation(method -> property)get_current_pcie_link_width()->current_link_width(method -> property)get_pcie_throughput(counter)->get_throughput(counter)get_pcie_replay_counter()->replay_counter(method -> property)
On
Temperature:sensor(sensor=...)->get_sensor(sensor=...)threshold(threshold_type)->get_threshold(threshold_type)thermal_settings(sensor_index)->get_thermal_settings(sensor_index)
On
FanInfo:set_default_fan_speed()->set_default_speed()
Re-wrapped NVML enums as human-readable
StrEnumsubclasses instead of raw integer re-exports fromcuda.bindings.nvml. These are available incuda.core.system.typing. (#2014)Removed 18 helper/data-container classes from
cuda.core.system.__all__:BAR1MemoryInfo,ClockInfo,ClockOffsets,CoolerInfo,DeviceAttributes,DeviceEvents,EventData,FanInfo,FieldValue,FieldValues,GpuDynamicPstatesInfo,GpuDynamicPstatesUtilization,InforomInfo,PciInfo,RepairStatus,Temperature,ThermalSensor,ThermalSettings. These classes are still returned byDeviceproperties and methods but should not be directly instantiated by users. (#1942)system.Device.uuidnow returns the full NVML UUID with prefix (e.g.GPU-...). Usesystem.Device.uuid_without_prefixfor the previous behavior. (#1916)args_viewable_as_strided_memory()andStridedMemoryViewwere accidentally exposed at the top-level incuda.core. They are available publicly from thecuda.core.utilsmodule. (#2028)Replaced
system.get_driver_version()andsystem.get_driver_version_full()withsystem.get_user_mode_driver_version()(works with or without NVML) andsystem.get_kernel_mode_driver_version()(requires NVML). Each returns atuple[int, ...].
Fixes and enhancements#
Fixed
Buffer.is_managedreturningFalsefor pool-allocated managed memory (ManagedMemoryResource), which caused DLPack interop to misclassify managed buffers askDLCUDAHost. The fix queries both the driver pointer attribute and the memory resource. (#1924)system.Device.archnow returnsUNKNOWNinstead of raisingValueErrorwhen NVML reports an architecture not yet in the enum. (#1937)system.Device.get_field_values()andsystem.Device.clear_field_values()with an empty list no longer raiseInvalidArgumentError. (#1982)Linkererror and info log retrieval now properly checks return codes from nvJitLink, raising exceptions on failure instead of silently ignoring errors. (#1993)Fixed a potential crash when NVML event set creation failed on Windows, due to
__dealloc__freeing an uninitialized handle. (#1992)CUDA Runtime error messages are now more reliable, especially on Windows where the runtime DLL name table could disagree with the installed bindings. (#2003)
Graph kernel nodes now prevent Python kernel-argument objects from being garbage-collected before the graph executes. Previously, objects passed as kernel arguments (e.g. a
Buffer) could be freed if the only Python reference was through the launch call, causing the graph to operate on stale device pointers. (#2041)Fixed a potential crash in
DeviceEvents.__dealloc__when__init__raised before the NVML event set was created, due to freeing an uninitialized handle. (#2047)Linux release wheels are now stripped of debug symbols, significantly reducing package size. Debug builds are now supported via
--config-settings=debug=true. (#1890)