`cuda.core` 1.0.0 Release Notes#

Highlights#

New features#

Breaking changes#

Renamed GraphDef to GraphDefinition for consistency with the rest of the API, which spells words out (e.g. TensorMapDescriptor, not TensorMapDesc). (#1950)
Renamed cuda.core.graph.Condition to GraphCondition to follow the Graph* prefix convention used by GraphBuilder, GraphDefinition, GraphNode. (#1945)
Converted no-argument deterministic getters to properties for consistency with the rest of the API (#1945):
- Buffer.get_ipc_descriptor() -> Buffer.ipc_descriptor
- Event.get_ipc_descriptor() -> Event.ipc_descriptor
- DeviceMemoryResource.get_allocation_handle() -> DeviceMemoryResource.allocation_handle
- PinnedMemoryResource.get_allocation_handle() -> PinnedMemoryResource.allocation_handle
Renamed boolean / non-noun properties for clearer naming (#1945):
- LaunchConfig.cooperative_launch -> LaunchConfig.is_cooperative (also renames the constructor keyword argument).
- Event.is_timing_disabled -> Event.is_timing_enabled.
- Event.is_sync_busy_waited -> Event.is_blocking_sync.
- EventOptions.enable_timing -> EventOptions.timing_enabled and EventOptions.busy_waited_sync -> EventOptions.blocking_sync.
Renamed graph allocation methods to match MemoryResource.allocate() / MemoryResource.deallocate() (#1945):
- GraphDefinition.alloc -> graph.GraphDefinition.allocate()
- GraphDefinition.free -> graph.GraphDefinition.deallocate()
- GraphNode.alloc -> graph.GraphNode.allocate()
- GraphNode.free -> graph.GraphNode.deallocate()
Cross-API consistency for graph builders (#1945):
- GraphBuilder.add_child -> graph.GraphBuilder.embed() (matches graph.GraphDefinition.embed() and graph.GraphNode.embed()).
- GraphDefinition.record_event / wait_event -> graph.GraphDefinition.record() / graph.GraphDefinition.wait() and the same on GraphNode, matching Stream.record() / Stream.wait().
KernelAttributes methods are now properties; per-device queries use indexing (#1945):
- The 17 attribute methods (max_threads_per_block, num_regs, shared_size_bytes, cluster_scheduling_policy_preference, etc.) that previously took a device_id argument are now properties on the view returned by Kernel.attributes. The view is bound to the current device by default; kernel.attributes[device] returns a view bound to a specific Device or device ordinal. The cache is shared across views of the same kernel.
- Old: kernel.attributes.num_regs() and kernel.attributes.num_regs(some_dev)
- New: kernel.attributes.num_regs and kernel.attributes[some_dev].num_regs
Unified the conditional graph API on GraphCondition and consistent verbs (#1945):
- GraphBuilder.create_conditional_handle -> graph.GraphBuilder.create_condition(). The new factory returns a GraphCondition (matching graph.GraphDefinition.create_condition()) instead of a raw CUgraphConditionalHandle. The four conditional builder methods (if_then(), if_else(), while_loop(), switch()) now accept a GraphCondition instead of a raw handle.
- GraphBuilder.if_cond / GraphDefinition.if_cond / GraphNode.if_cond -> graph.GraphBuilder.if_then() / graph.GraphDefinition.if_then() / graph.GraphNode.if_then(). The new name parallels the existing if_else, while_loop, and switch methods (verb describing the control-flow construct, not an abbreviation of “condition”) and matches Python’s own if/then/else vocabulary.
- A GraphCondition may be passed directly as a kernel argument to launch(); the launcher unwraps it to the underlying CUgraphConditionalHandle value. Previously, .handle had to be extracted explicitly.

Fixes and enhancements#

StridedMemoryView now provides a fast path for torch.Tensor objects via PyTorch’s AOT Inductor (AOTI) stable C ABI. When a torch.Tensor is passed to any from_* classmethod (from_dlpack, from_cuda_array_interface, from_array_interface, or from_any_interface), tensor metadata is read directly from the underlying C struct, bypassing the DLPack and CUDA Array Interface protocol overhead. This yields ~7-20x faster StridedMemoryView construction for PyTorch tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch’s current stream and the consumer stream, matching the DLPack synchronization contract. Requires PyTorch >= 2.3. (#749)

cuda.core 1.0.0 Release Notes#

Highlights#

New features#

Breaking changes#

Fixes and enhancements#

`cuda.core` 1.0.0 Release Notes#