cuda.core API Reference#

This is the main API reference for cuda.core. The package has not yet reached version 1.0.0, and APIs may change between minor versions, possibly without deprecation warnings. Once version 1.0.0 is released, APIs will be considered stable and will follow semantic versioning with appropriate deprecation periods for breaking changes.

Devices and execution#

Device([device_id])

Represent a GPU and act as an entry point for cuda.core features.

launch(stream, LaunchConfig config, ...)

Launches a Kernel object with launch-time configuration.

Stream(*args, **kwargs)

Represent a queue of GPU operations that are executed in a specific order.

Event(*args, **kwargs)

Represent a record at a specific point of execution within a CUDA stream.

Context(*args, **kwargs)

CUDA context wrapper.

SMResource(*args, **kwargs)

Represent an SM (streaming multiprocessor) resource partition.

WorkqueueResource(*args, **kwargs)

Represent a workqueue resource for a device or green context.

StreamOptions([nonblocking, priority])

Customizable Stream options.

EventOptions([timing_enabled, ...])

Customizable Event options.

LaunchConfig([grid, cluster, block, ...])

Customizable launch options.

ContextOptions(resources)

Options for context creation.

SMResourceOptions([count, ...])

Customizable SMResource.split options.

WorkqueueResourceOptions([sharing_scope])

Customizable WorkqueueResource.configure options.

cuda.core.LEGACY_DEFAULT_STREAM#

The legacy default CUDA stream. All devices share the same legacy default stream, and work launched on it is not concurrent with work on any other stream.

cuda.core.PER_THREAD_DEFAULT_STREAM#

The per-thread default CUDA stream. Each host thread has its own per-thread default stream, and work launched on it can execute concurrently with work on other non-blocking streams.

Memory management#

Buffer(*args, **kwargs)

Represent a handle to allocated memory.

MemoryResource

Abstract base class for memory resources that manage allocation and deallocation of buffers.

DeviceMemoryResource(device_id[, options])

A device memory resource managing a stream-ordered memory pool.

GraphMemoryResource(device_id)

A memory resource for memory related to graphs.

PinnedMemoryResource([options])

A host-pinned memory resource managing a stream-ordered memory pool.

ManagedMemoryResource([options])

A managed memory resource managing a stream-ordered memory pool.

LegacyPinnedMemoryResource

Create a pinned memory resource that uses legacy cuMemAllocHost/cudaMallocHost APIs.

VirtualMemoryResource(device_id[, config])

Create a device memory resource that uses the CUDA VMM APIs to allocate memory.

DeviceMemoryResourceOptions([ipc_enabled, ...])

Customizable DeviceMemoryResource options.

PinnedMemoryResourceOptions([ipc_enabled, ...])

Customizable PinnedMemoryResource options.

ManagedMemoryResourceOptions([...])

Customizable ManagedMemoryResource options.

VirtualMemoryResourceOptions(...)

A configuration object for the VirtualMemoryResource

CUDA graphs#

A CUDA graph captures a set of GPU operations and their dependencies, allowing them to be defined once and launched repeatedly with minimal CPU overhead. Graphs can be constructed in two ways: GraphBuilder captures operations from a stream, while GraphDefinition builds a graph explicitly by adding nodes and edges. Both produce an executable Graph that can be launched on a Stream.

graph.Graph()

An executable graph.

graph.GraphBuilder()

A graph under construction by stream capture.

graph.GraphDefinition()

A graph definition.

graph.GraphNode

A node in a graph definition.

graph.GraphCondition

A condition variable for conditional graph nodes.

graph.GraphAllocOptions([device, ...])

Options for graph memory allocation nodes.

graph.GraphCompleteOptions([...])

Options for graph instantiation.

graph.GraphDebugPrintOptions([verbose, ...])

Options for debug_dot_print().

Node types#

Every graph node is a subclass of GraphNode, which provides the common interface (dependencies, successors, destruction). Each subclass exposes attributes unique to its operation type.

graph.EmptyNode

An empty (synchronization) node.

graph.KernelNode

A kernel launch node.

graph.AllocNode

A memory allocation node.

graph.FreeNode

A memory deallocation node.

graph.MemsetNode

A memset node.

graph.MemcpyNode

A memcpy node.

graph.ChildGraphNode

A child graph node.

graph.EventRecordNode

An event record node.

graph.EventWaitNode

An event wait node.

graph.HostCallbackNode

A host callback node.

graph.ConditionalNode

Base class for conditional nodes.

graph.IfNode

An if-conditional node.

graph.IfElseNode

An if-else conditional node.

graph.WhileNode

A while-loop conditional node.

graph.SwitchNode

A switch conditional node.

Graphics interoperability#

GraphicsResource()

RAII wrapper for a CUDA graphics resource (CUgraphicsResource).

Tensor Memory Accelerator (TMA)#

TensorMapDescriptor()

Describes a TMA (Tensor Memory Accelerator) tensor map for Hopper+ GPUs.

TensorMapDescriptorOptions(box_dim[, ...])

Options for cuda.core.StridedMemoryView.as_tensor_map().

CUDA compilation toolchain#

Program(code, code_type[, options])

Represent a compilation machinery to process programs into ObjectCode.

Linker([options])

Represent a linking machinery to link one or more object codes into ObjectCode.

ObjectCode(*args, **kwargs)

Represent a compiled program to be loaded onto the device.

Kernel(*args, **kwargs)

Represent a compiled kernel that had been loaded onto the device.

ProgramOptions([name, arch, ...])

Customizable options for configuring Program.

LinkerOptions([name, arch, ...])

Customizable options for configuring Linker.

CUDA process checkpointing#

The cuda.core.checkpoint module wraps the CUDA driver process checkpoint APIs. These APIs are intended for Linux process checkpoint and restore workflows, and require a CUDA driver with checkpoint API support and a cuda-bindings version that exposes those driver entry points.

Checkpointing is typically driven by a coordinator process acting on a target CUDA process, similar to attaching a debugger or sending a signal. The target process is identified by process ID. Linux and the CUDA driver enforce process permissions; checkpointing another user’s process may require elevated permissions such as CAP_SYS_PTRACE or administrator privileges.

The CUDA checkpoint APIs prepare CUDA-managed GPU state for process-level checkpoint and restore. They do not capture the CPU process image by themselves; full process checkpoint workflows still need a CPU-side process checkpointing tool such as CRIU. A minimal coordinator-side sequence looks like this:

import os

from cuda.core import checkpoint

target_pid = os.getpid()  # or the PID of another CUDA process
process = checkpoint.Process(target_pid)
process.lock(timeout_ms=5000)
process.checkpoint()

# Capture or restore the CPU process image outside cuda.core.

process.restore()
process.unlock()

Process.state returns one of "running", "locked", "checkpointed", or "failed". Restore may optionally remap GPUs by passing gpu_mapping from each checkpointed GPU UUID to the GPU UUID that should be used during restore. For migration workflows, provide mappings for every GPU visible to the NVIDIA kernel-mode driver at checkpoint time. User-space masking such as CUDA_VISIBLE_DEVICES does not reduce this mapping requirement, so applications that rely on user-space GPU masking may not be valid migration targets. The mapping may use CUuuid objects or the UUID strings returned by Device.uuid. A successful restore returns the process to the locked state; call Process.unlock after restore to allow CUDA API calls to resume.

The CUDA driver requires restore to run from the process restore thread. Use Process.restore_thread_id to discover that thread before calling Process.restore from a checkpoint coordinator. Restore also requires persistence mode to be enabled or cuInit to have been called before execution.

checkpoint.Process(pid)

CUDA process that can be locked, checkpointed, restored, and unlocked.

CUDA system information and NVIDIA Management Library (NVML)#

Basic functions#

system.get_driver_version(bool kernel_mode)

Get the driver version.

system.get_driver_version_full(bool kernel_mode)

Get the full driver version.

system.get_driver_branch()

Retrieves the driver branch of the NVIDIA driver installed on the system.

system.get_num_devices()

Return the number of devices in the system.

system.get_nvml_version()

The version of the NVML library.

system.get_process_name(int pid)

The name of process with given PID.

system.get_topology_common_ancestor(...)

Retrieve the common ancestor for two devices.

system.get_p2p_status(Device device1, ...)

Retrieve the P2P status between two devices.

Events#

system.register_events(events)

Starts recording of events on test system.

Types#

system.Device(int index, *, uuid, pci_bus_id)

Representation of a device.

system.NvlinkInfo(Device device, int link)

Nvlink information for a device.

Utility functions#

args_viewable_as_strided_memory(...)

Decorator to create proxy objects to StridedMemoryView for the specified positional arguments.

StridedMemoryView(obj, int stream_ptr)

A class holding metadata of a strided dense array/tensor.