Runtime#
The Runtime API provides higher-level building blocks for core CUDA functionality. It takes the existing CUDA Runtime API
set and removes or replaces some problematic patterns, such as implicit state. It is designed to make common operations
like resource management, work submission, and memory allocation easier to express and safer to compose. These APIs lower
to the CUDA Driver API under the hood, but remain composable with the CUDA Runtime API by reusing runtime handle types
(such as cudaStream_t) in the interfaces. This results in an interface that applies RAII for lifetime management,
while remaining composable with existing CUDA C++ code that manages resources explicitly.
At a glance, the runtime layer includes:
Streams and events work submission and synchronization.
Buffers as a typed, stream-ordered storage with property-checked memory container.
Memory pools to allocate device, managed, and pinned memory, either directly or through buffers.
Launch API to configure and launch kernels.
Runtime algorithms like
copy_bytesandfill_bytesfor basic data movement.Legacy memory resources as synchronous compatibility fallbacks for older toolkits.
See CUDA Runtime interactions if you are interested in CUDA Runtime interop.
Example: vector add with buffers, pools, and launch#
#include <cuda/devices>
#include <cuda/stream>
#include <cuda/std/span>
#include <cuda/buffer>
#include <cuda/memory_pool>
#include <cuda/launch>
struct kernel {
template <typename Config>
__device__ void operator()(Config config,
cuda::std::span<const float> A,
cuda::std::span<const float> B,
cuda::std::span<float> C) {
auto tid = cuda::gpu_thread.rank(cuda::grid, config);
if (tid < A.size())
C[tid] = A[tid] + B[tid];
}
};
int main() {
cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);
int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);
constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
cuda::launch(stream, config, kernel{}, A, B, C);
}
API |
Content |
CCCL Availability |
CUDA Toolkit Availability |
|---|---|---|---|
A range of all available CUDA devices |
CCCL 3.1.0 |
CUDA 13.1 |
|
A non-owning representation of a CUDA device |
CCCL 3.1.0 |
CUDA 13.1 |
|
Per-architecture trait accessors |
CCCL 3.1.0 |
CUDA 13.1 |
|
A non-owning wrapper around a |
CCCL 2.2.0 |
CUDA 12.3 |
|
An owning wrapper around a |
CCCL 3.1.0 |
CUDA 13.1 |
|
A non-owning wrapper around a |
CCCL 3.1.0 |
CUDA 13.1 |
|
An owning wrapper around a |
CCCL 3.1.0 |
CUDA 13.1 |
|
An owning wrapper around a |
CCCL 3.1.0 |
CUDA 13.1 |
|
Byte-wise copy into a |
CCCL 3.1.0 |
CUDA 13.1 |
|
Byte-wise fill into a |
CCCL 3.1.0 |
CUDA 13.1 |
|
Representation of CUDA thread hierarchies (grid, cluster, block, warp, thread) |
CCCL 3.2.0 |
CUDA 13.2 |
|
Kernel launch with configuration and options |
CCCL 3.2.0 |
CUDA 13.2 |
|
Kernel launch configuration combining hierarchy dimensions and launch options |
CCCL 3.2.0 |
CUDA 13.2 |
|
Factory function to create kernel configurations from hierarchy dimensions and launch options |
CCCL 3.2.0 |
CUDA 13.2 |
|
Stream-ordered device memory pool using CUDA memory pool API |
CCCL 3.2.0 |
CUDA 13.2 |
|
Stream-ordered managed (unified) memory pool |
CCCL 3.2.0 |
CUDA 13.2 |
|
Stream-ordered pinned (page-locked) host memory pool |
CCCL 3.2.0 |
CUDA 13.2 |
|
Get the default device memory pool for a device |
CCCL 3.2.0 |
CUDA 13.2 |
|
Get the default managed (unified) memory pool |
CCCL 3.2.0 |
CUDA 13.2 |
|
Get the default pinned (page-locked) host memory pool |
CCCL 3.2.0 |
CUDA 13.2 |
|
Typed data container allocated from memory resources. It handles stream-ordered allocation, initialization, and deallocation of memory. |
CCCL 3.2.0 |
CUDA 13.2 |
|
Synchronous compatibility resources backed by legacy CUDA allocation APIs. |
CCCL 3.2.0 |
CUDA 13.2 |