cuda.core 1.0.0 Release Notes#
Highlights#
TBD
New features#
TBD
Breaking changes#
Renamed
GraphDeftoGraphDefinitionfor consistency with the rest of the API, which spells words out (e.g.TensorMapDescriptor, notTensorMapDesc). (#1950)Renamed
cuda.core.graph.ConditiontoGraphConditionto follow theGraph*prefix convention used byGraphBuilder,GraphDefinition,GraphNode. (#1945)Converted no-argument deterministic getters to properties for consistency with the rest of the API (#1945):
Buffer.get_ipc_descriptor()->Buffer.ipc_descriptorEvent.get_ipc_descriptor()->Event.ipc_descriptorDeviceMemoryResource.get_allocation_handle()->DeviceMemoryResource.allocation_handlePinnedMemoryResource.get_allocation_handle()->PinnedMemoryResource.allocation_handle
Renamed boolean / non-noun properties for clearer naming (#1945):
LaunchConfig.cooperative_launch->LaunchConfig.is_cooperative(also renames the constructor keyword argument).Event.is_timing_disabled->Event.is_timing_enabled.Event.is_sync_busy_waited->Event.is_blocking_sync.EventOptions.enable_timing->EventOptions.timing_enabledandEventOptions.busy_waited_sync->EventOptions.blocking_sync.
Renamed graph allocation methods to match
MemoryResource.allocate()/MemoryResource.deallocate()(#1945):GraphDefinition.alloc->graph.GraphDefinition.allocate()GraphDefinition.free->graph.GraphDefinition.deallocate()GraphNode.alloc->graph.GraphNode.allocate()GraphNode.free->graph.GraphNode.deallocate()
Cross-API consistency for graph builders (#1945):
GraphBuilder.add_child->graph.GraphBuilder.embed()(matchesgraph.GraphDefinition.embed()andgraph.GraphNode.embed()).GraphDefinition.record_event/wait_event->graph.GraphDefinition.record()/graph.GraphDefinition.wait()and the same onGraphNode, matchingStream.record()/Stream.wait().
KernelAttributesmethods are now properties; per-device queries use indexing (#1945):The 17 attribute methods (
max_threads_per_block,num_regs,shared_size_bytes,cluster_scheduling_policy_preference, etc.) that previously took adevice_idargument are now properties on the view returned byKernel.attributes. The view is bound to the current device by default;kernel.attributes[device]returns a view bound to a specificDeviceor device ordinal. The cache is shared across views of the same kernel.Old:
kernel.attributes.num_regs()andkernel.attributes.num_regs(some_dev)New:
kernel.attributes.num_regsandkernel.attributes[some_dev].num_regs
Unified the conditional graph API on
GraphConditionand consistent verbs (#1945):GraphBuilder.create_conditional_handle->graph.GraphBuilder.create_condition(). The new factory returns aGraphCondition(matchinggraph.GraphDefinition.create_condition()) instead of a rawCUgraphConditionalHandle. The four conditional builder methods (if_then(),if_else(),while_loop(),switch()) now accept aGraphConditioninstead of a raw handle.GraphBuilder.if_cond/GraphDefinition.if_cond/GraphNode.if_cond->graph.GraphBuilder.if_then()/graph.GraphDefinition.if_then()/graph.GraphNode.if_then(). The new name parallels the existingif_else,while_loop, andswitchmethods (verb describing the control-flow construct, not an abbreviation of “condition”) and matches Python’s ownif/then/elsevocabulary.A
GraphConditionmay be passed directly as a kernel argument tolaunch(); the launcher unwraps it to the underlyingCUgraphConditionalHandlevalue. Previously,.handlehad to be extracted explicitly.
Fixes and enhancements#
StridedMemoryViewnow provides a fast path fortorch.Tensorobjects via PyTorch’s AOT Inductor (AOTI) stable C ABI. When atorch.Tensoris passed to anyfrom_*classmethod (from_dlpack,from_cuda_array_interface,from_array_interface, orfrom_any_interface), tensor metadata is read directly from the underlying C struct, bypassing the DLPack and CUDA Array Interface protocol overhead. This yields ~7-20x fasterStridedMemoryViewconstruction for PyTorch tensors (depending on whether stream ordering is required). Proper CUDA stream ordering is established between PyTorch’s current stream and the consumer stream, matching the DLPack synchronization contract. Requires PyTorch >= 2.3. (#749)