cuda.core API Reference#
This is the main API reference for cuda.core. The package has not yet
reached version 1.0.0, and APIs may change between minor versions, possibly
without deprecation warnings. Once version 1.0.0 is released, APIs will
be considered stable and will follow semantic versioning with appropriate
deprecation periods for breaking changes.
Devices and execution#
|
Represent a GPU and act as an entry point for cuda.core features. |
|
Launches a |
|
Represent a queue of GPU operations that are executed in a specific order. |
|
Represent a record at a specific point of execution within a CUDA stream. |
|
CUDA context wrapper. |
|
Represent an SM (streaming multiprocessor) resource partition. |
|
Represent a workqueue resource for a device or green context. |
|
Customizable |
|
Customizable |
|
Customizable launch options. |
|
Options for context creation. |
|
Customizable |
|
Customizable |
- cuda.core.LEGACY_DEFAULT_STREAM#
The legacy default CUDA stream. All devices share the same legacy default stream, and work launched on it is not concurrent with work on any other stream.
- cuda.core.PER_THREAD_DEFAULT_STREAM#
The per-thread default CUDA stream. Each host thread has its own per-thread default stream, and work launched on it can execute concurrently with work on other non-blocking streams.
Memory management#
|
Represent a handle to allocated memory. |
Abstract base class for memory resources that manage allocation and deallocation of buffers. |
|
|
A device memory resource managing a stream-ordered memory pool. |
|
A memory resource for memory related to graphs. |
|
A host-pinned memory resource managing a stream-ordered memory pool. |
|
A managed memory resource managing a stream-ordered memory pool. |
Create a pinned memory resource that uses legacy cuMemAllocHost/cudaMallocHost APIs. |
|
|
Create a device memory resource that uses the CUDA VMM APIs to allocate memory. |
|
Customizable |
|
Customizable |
|
Customizable |
A configuration object for the VirtualMemoryResource |
CUDA graphs#
A CUDA graph captures a set of GPU operations and their dependencies,
allowing them to be defined once and launched repeatedly with minimal
CPU overhead. Graphs can be constructed in two ways:
GraphBuilder captures operations from a stream, while
GraphDefinition builds a graph explicitly by adding nodes and
edges. Both produce an executable Graph that can be
launched on a Stream.
An executable graph. |
|
A graph under construction by stream capture. |
|
A graph definition. |
|
A node in a graph definition. |
|
A condition variable for conditional graph nodes. |
|
|
Options for graph memory allocation nodes. |
|
Options for graph instantiation. |
|
Options for debug_dot_print(). |
Node types#
Every graph node is a subclass of GraphNode, which
provides the common interface (dependencies, successors, destruction).
Each subclass exposes attributes unique to its operation type.
An empty (synchronization) node. |
|
A kernel launch node. |
|
A memory allocation node. |
|
A memory deallocation node. |
|
A memset node. |
|
A memcpy node. |
|
A child graph node. |
|
An event record node. |
|
An event wait node. |
|
A host callback node. |
|
Base class for conditional nodes. |
|
An if-conditional node. |
|
An if-else conditional node. |
|
A while-loop conditional node. |
|
A switch conditional node. |
Graphics interoperability#
RAII wrapper for a CUDA graphics resource ( |
Tensor Memory Accelerator (TMA)#
Describes a TMA (Tensor Memory Accelerator) tensor map for Hopper+ GPUs. |
|
|
Options for |
CUDA compilation toolchain#
|
Represent a compilation machinery to process programs into |
|
Represent a linking machinery to link one or more object codes into |
|
Represent a compiled program to be loaded onto the device. |
|
Represent a compiled kernel that had been loaded onto the device. |
|
Customizable options for configuring |
|
Customizable options for configuring |
CUDA process checkpointing#
The cuda.core.checkpoint module wraps the CUDA driver process
checkpoint APIs. These APIs are intended for Linux process checkpoint and
restore workflows, and require a CUDA driver with checkpoint API support and
a cuda-bindings version that exposes those driver entry points.
Checkpointing is typically driven by a coordinator process acting on a target
CUDA process, similar to attaching a debugger or sending a signal. The target
process is identified by process ID. Linux and the CUDA driver enforce process
permissions; checkpointing another user’s process may require elevated
permissions such as CAP_SYS_PTRACE or administrator privileges.
The CUDA checkpoint APIs prepare CUDA-managed GPU state for process-level checkpoint and restore. They do not capture the CPU process image by themselves; full process checkpoint workflows still need a CPU-side process checkpointing tool such as CRIU. A minimal coordinator-side sequence looks like this:
import os
from cuda.core import checkpoint
target_pid = os.getpid() # or the PID of another CUDA process
process = checkpoint.Process(target_pid)
process.lock(timeout_ms=5000)
process.checkpoint()
# Capture or restore the CPU process image outside cuda.core.
process.restore()
process.unlock()
Process.state returns one of "running", "locked",
"checkpointed", or "failed". Restore may optionally remap GPUs by
passing gpu_mapping from each checkpointed GPU UUID to the GPU UUID that
should be used during restore. For migration workflows, provide mappings for
every GPU visible to the NVIDIA kernel-mode driver at checkpoint time.
User-space masking such as CUDA_VISIBLE_DEVICES does not reduce this
mapping requirement, so applications that rely on user-space GPU masking may
not be valid migration targets. The mapping may use CUuuid objects or the
UUID strings returned by Device.uuid. A successful restore returns the
process to the locked state; call Process.unlock after restore to allow
CUDA API calls to resume.
The CUDA driver requires restore to run from the process restore thread.
Use Process.restore_thread_id to discover that thread before calling
Process.restore from a checkpoint coordinator. Restore also requires
persistence mode to be enabled or cuInit to have been called before
execution.
|
CUDA process that can be locked, checkpointed, restored, and unlocked. |
CUDA system information and NVIDIA Management Library (NVML)#
Basic functions#
|
Get the driver version. |
|
Get the full driver version. |
Retrieves the driver branch of the NVIDIA driver installed on the system. |
|
Return the number of devices in the system. |
|
The version of the NVML library. |
|
|
The name of process with given PID. |
Retrieve the common ancestor for two devices. |
|
|
Retrieve the P2P status between two devices. |
Events#
|
Starts recording of events on test system. |
Types#
|
Representation of a device. |
|
Nvlink information for a device. |
Utility functions#
Decorator to create proxy objects to |
|
|
A class holding metadata of a strided dense array/tensor. |