Runtime#

The Runtime API provides higher-level building blocks for core CUDA functionality. It takes the existing CUDA Runtime API set and removes or replaces some problematic patterns, such as implicit state. It is designed to make common operations like resource management, work submission, and memory allocation easier to express and safer to compose. These APIs lower to the CUDA Driver API under the hood, but remain composable with the CUDA Runtime API by reusing runtime handle types (such as cudaStream_t) in the interfaces. This results in an interface that applies RAII for lifetime management, while remaining composable with existing CUDA C++ code that manages resources explicitly.

At a glance, the runtime layer includes:

  • Streams and events work submission and synchronization.

  • Buffers as a typed, stream-ordered storage with property-checked memory container.

  • Memory pools to allocate device, managed, and pinned memory, either directly or through buffers.

  • Launch API to configure and launch kernels.

  • Runtime algorithms like copy_bytes and fill_bytes for basic data movement.

  • Legacy memory resources as synchronous compatibility fallbacks for older toolkits.

See CUDA Runtime interactions if you are interested in CUDA Runtime interop.

Example: vector add with buffers, pools, and launch#

#include <cuda/devices>
#include <cuda/stream>
#include <cuda/std/span>
#include <cuda/buffer>
#include <cuda/memory_pool>
#include <cuda/launch>

struct kernel {
  template <typename Config>
  __device__ void operator()(Config config,
                             cuda::std::span<const float> A,
                             cuda::std::span<const float> B,
                             cuda::std::span<float> C) {
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
      C[tid] = A[tid] + B[tid];
  }
};

int main() {
  cuda::device_ref device = cuda::devices[0];
  cuda::stream stream{device};
  auto pool = cuda::device_default_memory_pool(device);

  int num_elements = 1000;
  auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
  auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
  auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

  constexpr int threads_per_block = 256;
  auto config = cuda::distribute<threads_per_block>(num_elements);

  cuda::launch(stream, config, kernel{}, A, B, C);
}

API

Content

CCCL Availability

CUDA Toolkit Availability

devices

A range of all available CUDA devices

CCCL 3.1.0

CUDA 13.1

device_ref

A non-owning representation of a CUDA device

CCCL 3.1.0

CUDA 13.1

arch_traits

Per-architecture trait accessors

CCCL 3.1.0

CUDA 13.1

stream_ref

A non-owning wrapper around a cudaStream_t

CCCL 2.2.0

CUDA 12.3

stream

An owning wrapper around a cudaStream_t

CCCL 3.1.0

CUDA 13.1

event_ref

A non-owning wrapper around a cudaEvent_t

CCCL 3.1.0

CUDA 13.1

event

An owning wrapper around a cudaEvent_t (timing disabled)

CCCL 3.1.0

CUDA 13.1

timed_event

An owning wrapper around a cudaEvent_t with timing enabled and elapsed-time queries

CCCL 3.1.0

CUDA 13.1

copy_bytes

Byte-wise copy into a cuda::stream_ref for cuda::std::span/cuda::std::mdspan sources and destinations

CCCL 3.1.0

CUDA 13.1

fill_bytes

Byte-wise fill into a cuda::stream_ref for cuda::std::span/cuda::std::mdspan destinations

CCCL 3.1.0

CUDA 13.1

hierarchy

Representation of CUDA thread hierarchies (grid, cluster, block, warp, thread)

CCCL 3.2.0

CUDA 13.2

launch

Kernel launch with configuration and options

CCCL 3.2.0

CUDA 13.2

kernel_config

Kernel launch configuration combining hierarchy dimensions and launch options

CCCL 3.2.0

CUDA 13.2

make_config

Factory function to create kernel configurations from hierarchy dimensions and launch options

CCCL 3.2.0

CUDA 13.2

device_memory_pool

Stream-ordered device memory pool using CUDA memory pool API

CCCL 3.2.0

CUDA 13.2

managed_memory_pool

Stream-ordered managed (unified) memory pool

CCCL 3.2.0

CUDA 13.2

pinned_memory_pool

Stream-ordered pinned (page-locked) host memory pool

CCCL 3.2.0

CUDA 13.2

device_default_memory_pool

Get the default device memory pool for a device

CCCL 3.2.0

CUDA 13.2

managed_default_memory_pool

Get the default managed (unified) memory pool

CCCL 3.2.0

CUDA 13.2

pinned_default_memory_pool

Get the default pinned (page-locked) host memory pool

CCCL 3.2.0

CUDA 13.2

buffer

Typed data container allocated from memory resources. It handles stream-ordered allocation, initialization, and deallocation of memory.

CCCL 3.2.0

CUDA 13.2

legacy resources

Synchronous compatibility resources backed by legacy CUDA allocation APIs.

CCCL 3.2.0

CUDA 13.2