cuda.core 1.0.0 Release Notes#

Highlights#

New features#

Breaking changes#

  • StridedMemoryView now provides a fast path for torch.Tensor objects via PyTorch’s AOT Inductor (AOTI) stable C ABI. When a torch.Tensor is passed to any from_* classmethod (from_dlpack, from_cuda_array_interface, from_array_interface, or from_any_interface), tensor metadata is read directly from the underlying C struct, bypassing the DLPack and CUDA Array Interface protocol overhead. This yields ~7–20x faster StridedMemoryView construction for PyTorch tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch’s current stream and the consumer stream, matching the DLPack synchronization contract. Requires PyTorch >= 2.3.

    This is a behavioral breaking change: because the AOTI tensor bridge reads raw metadata without re-enacting PyTorch’s export guardrails, tensors that PyTorch would reject at the DLPack boundary (notably requires_grad, conjugated, non-strided/sparse, and wrong-current-device CUDA tensors) are now accepted. This is intentional — StridedMemoryView is designed for low-level interop where those checks are not needed. (#749)

  • Removed the deprecated cuda.core.experimental namespace. All public APIs have been available under cuda.core since v0.5.0. Code that imports from cuda.core.experimental must be updated to import from cuda.core instead.

  • Graph types are no longer re-exported from the top-level cuda.core namespace; they must be imported from cuda.core.graph. The affected symbols are Graph, GraphBuilder, GraphCompleteOptions, GraphCondition, GraphDebugPrintOptions, and GraphDefinition. Update from cuda.core import GraphBuilder to from cuda.core.graph import GraphBuilder (and similarly for the other symbols).

  • Removed the GraphAllocOptions dataclass and the AllocNode.options property. Its fields are now keyword-only parameters on graph.GraphDefinition.allocate() and graph.GraphNode.allocate().

  • Renamed GraphDef to GraphDefinition for consistency with the rest of the API, which spells words out (e.g. TensorMapDescriptor, not TensorMapDesc). (#1950)

  • Renamed cuda.core.graph.Condition to GraphCondition to follow the Graph* prefix convention used by GraphBuilder, GraphDefinition, GraphNode. (#1945)

  • Converted no-argument deterministic getters to properties for consistency with the rest of the API (#1945):

  • Renamed boolean / non-noun properties for clearer naming (#1945):

  • Renamed graph allocation methods to match MemoryResource.allocate() / MemoryResource.deallocate() (#1945):

  • Cross-API consistency for graph builders (#1945):

  • KernelAttributes methods are now properties; per-device queries use indexing (#1945):

    • The 17 attribute methods (max_threads_per_block, num_regs, shared_size_bytes, cluster_scheduling_policy_preference, etc.) that previously took a device_id argument are now properties on the view returned by Kernel.attributes. The view is bound to the current device by default; kernel.attributes[device] returns a view bound to a specific Device or device ordinal. The cache is shared across views of the same kernel.

    • Old: kernel.attributes.num_regs() and kernel.attributes.num_regs(some_dev)

    • New: kernel.attributes.num_regs and kernel.attributes[some_dev].num_regs

  • Renamed graph.HostCallbackNode.callback_fn to graph.HostCallbackNode.callback to drop the redundant _fn suffix (#1945).

  • Unified the conditional graph API on GraphCondition and consistent verbs (#1945):

  • Linker.which_backend() is now a classmethod instead of the former backend instance property. Call sites must use Linker.which_backend() (with parentheses) instead of linker.backend. This allows querying the linking backend without constructing a Linker instance — for example, to choose between PTX and LTOIR input before linking.

  • DeviceMemoryResource.peer_accessible_by now returns a collections.abc.MutableSet of Device objects instead of a sorted tuple[int, ...]. The property setter is unchanged. (#2018)

  • stream is now a required keyword-only argument on APIs that schedule work on a stream (#2001). Pass device.default_stream (or any Stream) explicitly to retain the previous behavior. Affected APIs:

    Synchronous memory resources are exempt: their allocate/deallocate methods accept an optional stream (validated when non-None) but do not use it. This applies to LegacyPinnedMemoryResource and VirtualMemoryResource.

  • Consistent naming of types annotation helpers (#2016): - cuda.core.typing.DevicePointerT -> cuda.core.typing.DevicePointerType - cuda.core.typing.IsStreamT -> cuda.core.typing.IsStreamType

  • Renamed and converted multiple Device properties and methods for naming consistency (#1946):

    On Device:

    • is_c2c_mode_enabled -> is_c2c_enabled

    • persistence_mode_enabled -> is_persistence_mode_enabled

    • clock(clock_type) -> get_clock(clock_type)

    • get_auto_boosted_clocks_enabled() -> is_auto_boosted_clocks_enabled (method -> property)

    • get_current_clock_event_reasons() -> current_clock_event_reasons (method -> property)

    • get_supported_clock_event_reasons() -> supported_clock_event_reasons (method -> property)

    • display_mode -> is_display_connected

    • display_active -> is_display_active

    • fan(fan=0) -> get_fan(fan=0)

    • get_supported_pstates() -> supported_pstates (method -> property)

    On PciInfo:

    • get_max_pcie_link_generation() -> link_generation (method -> property)

    • get_gpu_max_pcie_link_generation() -> max_link_generation (method -> property)

    • get_max_pcie_link_width() -> max_link_width (method -> property)

    • get_current_pcie_link_generation() -> current_link_generation (method -> property)

    • get_current_pcie_link_width() -> current_link_width (method -> property)

    • get_pcie_throughput(counter) -> get_throughput(counter)

    • get_pcie_replay_counter() -> replay_counter (method -> property)

    On Temperature:

    • sensor(sensor=...) -> get_sensor(sensor=...)

    • threshold(threshold_type) -> get_threshold(threshold_type)

    • thermal_settings(sensor_index) -> get_thermal_settings(sensor_index)

    On FanInfo:

    • set_default_fan_speed() -> set_default_speed()

  • Re-wrapped NVML enums as human-readable StrEnum subclasses instead of raw integer re-exports from cuda.bindings.nvml. These are available in cuda.core.system.typing. (#2014)

  • Removed 18 helper/data-container classes from cuda.core.system.__all__: BAR1MemoryInfo, ClockInfo, ClockOffsets, CoolerInfo, DeviceAttributes, DeviceEvents, EventData, FanInfo, FieldValue, FieldValues, GpuDynamicPstatesInfo, GpuDynamicPstatesUtilization, InforomInfo, PciInfo, RepairStatus, Temperature, ThermalSensor, ThermalSettings. These classes are still returned by Device properties and methods but should not be directly instantiated by users. (#1942)

  • system.Device.uuid now returns the full NVML UUID with prefix (e.g. GPU-...). Use system.Device.uuid_without_prefix for the previous behavior. (#1916)

  • args_viewable_as_strided_memory() and StridedMemoryView were accidentally exposed at the top-level in cuda.core. They are available publicly from the cuda.core.utils module. (#2028)

  • Replaced system.get_driver_version() and system.get_driver_version_full() with system.get_user_mode_driver_version() (works with or without NVML) and system.get_kernel_mode_driver_version() (requires NVML). Each returns a tuple[int, ...].

Fixes and enhancements#

  • Fixed Buffer.is_managed returning False for pool-allocated managed memory (ManagedMemoryResource), which caused DLPack interop to misclassify managed buffers as kDLCUDAHost. The fix queries both the driver pointer attribute and the memory resource. (#1924)

  • system.Device.arch now returns UNKNOWN instead of raising ValueError when NVML reports an architecture not yet in the enum. (#1937)

  • system.Device.get_field_values() and system.Device.clear_field_values() with an empty list no longer raise InvalidArgumentError. (#1982)

  • Linker error and info log retrieval now properly checks return codes from nvJitLink, raising exceptions on failure instead of silently ignoring errors. (#1993)

  • Fixed a potential crash when NVML event set creation failed on Windows, due to __dealloc__ freeing an uninitialized handle. (#1992)

  • CUDA Runtime error messages are now more reliable, especially on Windows where the runtime DLL name table could disagree with the installed bindings. (#2003)

  • Graph kernel nodes now prevent Python kernel-argument objects from being garbage-collected before the graph executes. Previously, objects passed as kernel arguments (e.g. a Buffer) could be freed if the only Python reference was through the launch call, causing the graph to operate on stale device pointers. (#2041)

  • Fixed a potential crash in DeviceEvents.__dealloc__ when __init__ raised before the NVML event set was created, due to freeing an uninitialized handle. (#2047)

  • Linux release wheels are now stripped of debug symbols, significantly reducing package size. Debug builds are now supported via --config-settings=debug=true. (#1890)