Runtime#

The Runtime API provides higher-level building blocks for core CUDA functionality. It takes the existing CUDA Runtime API set and removes or replaces some problematic patterns, such as implicit state. It is designed to make common operations like resource management, work submission, and memory allocation easier to express and safer to compose. These APIs lower to the CUDA Driver API under the hood, but remain composable with the CUDA Runtime API by reusing runtime handle types (such as cudaStream_t) in the interfaces. This results in an interface that applies RAII for lifetime management, while remaining composable with existing CUDA C++ code that manages resources explicitly.

At a glance, the runtime layer includes:

Streams and events work submission and synchronization.
Buffers as a typed, stream-ordered storage with property-checked memory container.
Memory pools to allocate device, managed, and pinned memory, either directly or through buffers.
Launch API to configure and launch kernels.
Runtime algorithms like copy_bytes and fill_bytes for basic data movement.
Legacy memory resources as synchronous compatibility fallbacks for older toolkits.

See CUDA Runtime interactions if you are interested in CUDA Runtime interop.

Example: vector add with buffers, pools, and launch#

#include <cuda/devices>
#include <cuda/stream>
#include <cuda/std/span>
#include <cuda/buffer>
#include <cuda/memory_pool>
#include <cuda/launch>

struct kernel {
  template <typename Config>
  __device__ void operator()(Config config,
                             cuda::std::span<const float> A,
                             cuda::std::span<const float> B,
                             cuda::std::span<float> C) {
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
      C[tid] = A[tid] + B[tid];
  }
};

int main() {
  cuda::device_ref device = cuda::devices[0];
  cuda::stream stream{device};
  auto pool = cuda::device_default_memory_pool(device);

  int num_elements = 1000;
  auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
  auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
  auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

  constexpr int threads_per_block = 256;
  auto config = cuda::distribute<threads_per_block>(num_elements);

  cuda::launch(stream, config, kernel{}, A, B, C);
}

API	Content	CCCL Availability	CUDA Toolkit Availability
devices	A range of all available CUDA devices	CCCL 3.1.0	CUDA 13.1
device_ref	A non-owning representation of a CUDA device	CCCL 3.1.0	CUDA 13.1
arch_traits	Per-architecture trait accessors	CCCL 3.1.0	CUDA 13.1
stream_ref	A non-owning wrapper around a `cudaStream_t`	CCCL 2.2.0	CUDA 12.3
stream	An owning wrapper around a `cudaStream_t`	CCCL 3.1.0	CUDA 13.1
event_ref	A non-owning wrapper around a `cudaEvent_t`	CCCL 3.1.0	CUDA 13.1
event	An owning wrapper around a `cudaEvent_t` (timing disabled)	CCCL 3.1.0	CUDA 13.1
timed_event	An owning wrapper around a `cudaEvent_t` with timing enabled and elapsed-time queries	CCCL 3.1.0	CUDA 13.1
copy_bytes	Byte-wise copy into a `cuda::stream_ref` for `cuda::std::span`/`cuda::std::mdspan` sources and destinations	CCCL 3.1.0	CUDA 13.1
fill_bytes	Byte-wise fill into a `cuda::stream_ref` for `cuda::std::span`/`cuda::std::mdspan` destinations	CCCL 3.1.0	CUDA 13.1
hierarchy	Representation of CUDA thread hierarchies (grid, cluster, block, warp, thread)	CCCL 3.2.0	CUDA 13.2
launch	Kernel launch with configuration and options	CCCL 3.2.0	CUDA 13.2
kernel_config	Kernel launch configuration combining hierarchy dimensions and launch options	CCCL 3.2.0	CUDA 13.2
make_config	Factory function to create kernel configurations from hierarchy dimensions and launch options	CCCL 3.2.0	CUDA 13.2
device_memory_pool	Stream-ordered device memory pool using CUDA memory pool API	CCCL 3.2.0	CUDA 13.2
managed_memory_pool	Stream-ordered managed (unified) memory pool	CCCL 3.2.0	CUDA 13.2
pinned_memory_pool	Stream-ordered pinned (page-locked) host memory pool	CCCL 3.2.0	CUDA 13.2
device_default_memory_pool	Get the default device memory pool for a device	CCCL 3.2.0	CUDA 13.2
managed_default_memory_pool	Get the default managed (unified) memory pool	CCCL 3.2.0	CUDA 13.2
pinned_default_memory_pool	Get the default pinned (page-locked) host memory pool	CCCL 3.2.0	CUDA 13.2
buffer	Typed data container allocated from memory resources. It handles stream-ordered allocation, initialization, and deallocation of memory.	CCCL 3.2.0	CUDA 13.2
legacy resources	Synchronous compatibility resources backed by legacy CUDA allocation APIs.	CCCL 3.2.0	CUDA 13.2