10 minutes to `cuda.core`#

Why `cuda.core`?#

cuda.core gives you a Pythonic interface to the CUDA runtime: you can compile a CUDA C++ kernel at runtime, launch it, move memory, time it, and capture it into a CUDA graph, all without writing a single raw CUDA driver or runtime call. It is part of CUDA Python, providing high-level, Pythonic access to the CUDA runtime on top of the low-level cuda.bindings, so you get clean, idiomatic Python:

Write GPU kernels from Python. Compile CUDA C++ to a runnable kernel in a few lines, with no separate build step, no Makefile, and no nvcc invocation.
Stay in the Python GPU ecosystem. It shares CUDA context, streams, and memory with CuPy and PyTorch, so you can launch your own kernels directly on their arrays, with zero copies.
Safe by design. Resources are real Python objects that you release with close(), and errors raise Python exceptions instead of return codes you have to check.

If you have ever wanted to drop a custom CUDA kernel into a NumPy/CuPy/PyTorch workflow without leaving Python, this is the fastest way to do it.

Before you begin#

This is a short, hands-on introduction to cuda.core, geared mainly toward new users. It walks from “talk to a GPU” to “compile, launch, time, and capture a kernel” using small, runnable snippets. For the full reference, see the API reference; for more complete programs, see the examples page.

Note

These snippets target CUDA 13 and need a working CUDA 13 installation plus cuda.core and a matching cuda.bindings (13.x). The quickest way to get both from PyPI is:

$ pip install cuda-core[cu13]

The install page covers driver requirements, conda, CUDA 12, and installing from source.

Customarily, we import as follows:

from cuda.core import Device, LaunchConfig, Program, ProgramOptions, launch

Selecting a device#

Device is your entry point. Creating one does not initialize the GPU; calling Device.set_current() is what sets up a CUDA context on the current host thread. Always call it before doing GPU work.

dev = Device()        # device 0 by default; Device(1) selects another GPU
dev.set_current()     # initialize CUDA and make this device current

print(dev.name)                 # e.g. 'NVIDIA GB10'
print(dev.compute_capability)   # ComputeCapability(major=12, minor=1)
print(dev.arch)                 # '121'  (handy for building kernels below)

Device objects are thread-local singletons: Device(0) always hands you back the same object for device 0 on that thread, so libraries sharing your process see and use the same GPU. Rich device attributes live under Device.properties (for example dev.properties.multiprocessor_count).

Compiling a kernel#

CUDA C++ source is compiled at runtime with Program. You describe the compile with ProgramOptions (here: the C++ standard and the target architecture, taken from dev.arch), call Program.compile() to get an ObjectCode, then pull out a callable Kernel by name.

code = r"""
extern "C" __global__
void scale(float* data, float factor, size_t n) {
    size_t i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
        data[i] *= factor;
}
"""

opts = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}")
prog = Program(code, code_type="c++", options=opts)
mod = prog.compile("cubin")          # "cubin" | "ptx" | "ltoir"
scale = mod.get_kernel("scale")

We wrap the kernel in extern "C" so its name is exactly "scale", which is the name we pass to get_kernel().

When something goes wrong, cuda.core raises a regular Python exception rather than returning an error code. A compilation failure is especially friendly: the exception carries the compiler’s log, so you can see exactly what went wrong (other errors raise an exception with the CUDA error name and description).

bad = r'extern "C" __global__ void k() { not_a_real_symbol; }'
try:
    Program(bad, code_type="c++", options=opts).compile("cubin")
except Exception as e:
    print(e)        # includes: error: identifier "not_a_real_symbol" is undefined

Streams#

A Stream is an ordered queue of GPU work. Operations on the same stream run in order; separate streams may overlap. Most cuda.core operations that touch the GPU take a stream so you stay in control of ordering. Create one from the device:

stream = dev.create_stream()

We create it first because the steps that follow, allocating, copying, and launching, are all issued on a stream. We will pass this stream to each of them in the next two sections.

Allocating memory#

Memory comes from a MemoryResource. Two you’ll use early:

Device.allocate() hands you a device Buffer from the device’s default memory resource.
PinnedMemoryResource allocates host memory that is page-locked (and host-accessible), which is convenient for staging data to and from the GPU.

Like Device.allocate(), PinnedMemoryResource is stream-ordered, so its allocate() also takes a stream.

From here on we use NumPy for host data and to check results, but only as a convenience: cuda.core works directly on raw memory buffers and typed pointers, and does not depend on NumPy or any other array library.

import numpy as np

from cuda.core import PinnedMemoryResource

n = 1024
nbytes = n * np.dtype(np.float32).itemsize

pinned = PinnedMemoryResource()
host = pinned.allocate(nbytes, stream=stream)   # host-accessible buffer
dbuf = dev.allocate(nbytes, stream=stream)      # device buffer

print(host.size, host.is_host_accessible)    # 4096 True
print(dbuf.is_device_accessible)             # True

A Buffer is an owning handle to an allocation. Because every Buffer implements __dlpack__, we can get a NumPy view of the host buffer with numpy.from_dlpack() (no copy, no raw pointers) and fill it in place. (from_dlpack hands back raw bytes, so we .view it as float32.)

host_np = np.from_dlpack(host).view(np.float32)   # writable view, zero-copy
host_np[:] = np.arange(n, dtype=np.float32)

The same trick works on the device side with CuPy: cp.from_dlpack(dbuf) gives a CuPy array backed by the device buffer.

Copying and launching#

Copy host → device with Buffer.copy_from(), run the kernel with launch(), then copy device → host with Buffer.copy_to(). These are all stream-ordered and asynchronous; Stream.sync() blocks until the queued work finishes.

LaunchConfig describes the grid and block; launch() takes the stream, the config, the kernel, and then the kernel arguments. A Buffer can be passed straight through as a pointer argument. Scalars, however, must carry an explicit C type, so we use NumPy scalars (np.float32, np.uint64) to match the kernel signature.

dbuf.copy_from(host, stream=stream)          # H2D

block = 256
grid = (n + block - 1) // block
config = LaunchConfig(grid=grid, block=block)
launch(stream, config, scale, dbuf, np.float32(3.0), np.uint64(n))

dbuf.copy_to(host, stream=stream)            # D2H
stream.sync()                                # wait for all of the above

assert np.array_equal(host_np, np.arange(n, dtype=np.float32) * 3.0)

That is the core cuda.core workflow: select a device, compile, allocate, copy, launch, sync. Everything below builds on it.

Timing with events#

An Event marks a point in a stream. Create timing-enabled events, record them around some work, synchronize, and subtract to get elapsed GPU time in milliseconds.

start = dev.create_event({"timing_enabled": True})
end = dev.create_event({"timing_enabled": True})

stream.record(start)
for _ in range(100):
    launch(stream, config, scale, dbuf, np.float32(1.0), np.uint64(n))
stream.record(end)
end.sync()

print(f"100 launches took {end - start:.4f} ms")

Events are also how you build cross-stream dependencies without stalling the host: Stream.wait() makes one stream wait on an event (or another stream).

Working with multiple streams#

A single stream is enough for ordered work, but multiple streams let independent work proceed in parallel. The two launches below touch different buffers, so the GPU is free to run them at the same time. When a later step does depend on another stream’s result, Stream.wait() joins them: it makes one stream wait for another’s work to finish before continuing.

stream_a = dev.create_stream()
stream_b = dev.create_stream()

buf_a = dev.allocate(nbytes, stream=stream_a)
buf_b = dev.allocate(nbytes, stream=stream_b)

host_np[:] = np.arange(n, dtype=np.float32)   # known input
buf_a.copy_from(host, stream=stream_a)
buf_b.copy_from(host, stream=stream_b)

# Independent work on each stream: free to run concurrently.
launch(stream_a, config, scale, buf_a, np.float32(2.0), np.uint64(n))
launch(stream_b, config, scale, buf_b, np.float32(5.0), np.uint64(n))

# stream_b now needs stream_a's result, so it waits before reading it.
stream_b.wait(stream_a)
buf_b.copy_from(buf_a, stream=stream_b)   # safe: buf_a is ready

buf_b.copy_to(host, stream=stream_b)
stream_b.sync()
assert np.array_equal(host_np, np.arange(n, dtype=np.float32) * 2.0)

Without the Stream.wait(), the copy on stream_b could race ahead of the kernel on stream_a. Whether the two independent launches actually overlap is up to the GPU scheduler, which can run them together only when each leaves the device underutilized; using separate streams expresses that the work is independent and lets the runtime overlap it when it can. The examples show this scaled up across multiple GPUs.

Capturing work in a CUDA graph#

When you launch the same sequence of operations repeatedly, per-launch CPU overhead adds up. A CUDA graph lets you record that sequence once and replay it with a single launch. Use the stream’s graph builder: begin building, issue the launches into the builder instead of the stream, then complete the graph.

gb = stream.create_graph_builder()
gb.begin_building()

launch(gb, config, scale, dbuf, np.float32(1.0), np.uint64(n))
launch(gb, config, scale, dbuf, np.float32(1.0), np.uint64(n))

graph = gb.end_building().complete()

graph.upload(stream)
graph.launch(stream)     # replay the whole sequence in one shot
stream.sync()

See cuda_graphs.py for a complete capture-and-replay example with a measured speedup.

Working with CuPy and PyTorch#

cuda.core is designed to interoperate with the rest of the Python GPU ecosystem, so in real workflows you often skip manual host buffers entirely and operate directly on CuPy or PyTorch arrays.

Current device/context. Because Device.set_current() sets a normal CUDA context (the standard primary context), other CUDA-runtime libraries pick it up automatically: if CuPy or PyTorch has already selected a device, Device() shares it, and vice versa.

Passing array data to a kernel. Both CuPy and PyTorch expose their device pointer, which you pass to launch() like any other buffer pointer. We reuse the scale kernel and config from above, applying a factor of 2.0 to an array of ones:

# CuPy exposes its device pointer as .data.ptr
import cupy as cp

a = cp.ones(n, dtype=cp.float32)
dev.sync()  # CuPy fills "a" on its own stream; sync before our stream reads it
launch(stream, config, scale, a.data.ptr, np.float32(2.0), np.uint64(a.size))
stream.sync()
assert bool((a == 2).all())

# PyTorch: wrap torch's stream via the __cuda_stream__ protocol
import torch

class PyTorchStreamWrapper:
    def __init__(self, pt_stream):
        self.pt_stream = pt_stream
    def __cuda_stream__(self):
        return (0, self.pt_stream.cuda_stream)

t = torch.ones(n, dtype=torch.float32, device="cuda")
ts = dev.create_stream(PyTorchStreamWrapper(torch.cuda.current_stream()))
launch(ts, config, scale, t.data_ptr(), np.float32(2.0), np.uint64(t.numel()))
ts.sync()
assert torch.allclose(t, torch.full_like(t, 2.0))

The __cuda_stream__ protocol is how any object advertises that it represents a CUDA stream; Device.create_stream() accepts such objects so you can drive cuda.core work on another library’s stream. See the interoperability guide for details.

Array-library-agnostic views. To accept any CuPy/PyTorch/NumPy-like array that supports DLPack or the CUDA Array Interface, decorate a function with args_viewable_as_strided_memory(). The chosen arguments become StridedMemoryView objects exposing ptr, shape, dtype, and is_device_accessible:

from cuda.core.utils import StridedMemoryView, args_viewable_as_strided_memory

@args_viewable_as_strided_memory((0,))
def scale_array(arr, work_stream, kern, factor):
    view = arr.view(work_stream.handle)
    assert isinstance(view, StridedMemoryView)
    assert view.is_device_accessible
    size = view.shape[0]
    cfg = LaunchConfig(grid=(size + 255) // 256, block=256)
    launch(work_stream, cfg, kern, view.ptr, np.float32(factor), np.uint64(size))
    work_stream.sync()

# Works on a CuPy array, a PyTorch tensor, or anything else with DLPack/CAI:
buf = cp.ones(n, dtype=cp.float32)
dev.sync()
scale_array(buf, stream, scale, 2.0)
assert bool((buf == 2).all())

cuda.core buffers also implement __dlpack__, so a device Buffer can be handed to any DLPack importer for zero-copy exchange.

Cleaning up#

Buffers, streams, events, graphs, and graph builders hold CUDA resources. They are released when garbage-collected, but you can release them explicitly with close(), which the cuda.core examples do in a finally block. Buffers are also context managers.

graph.close()
gb.close()
dbuf.close(stream=stream)
host.close(stream=stream)
stream.close()

# or scope a buffer with a with-statement:
with dev.allocate(nbytes, stream=stream) as tmp:
    ...  # tmp is freed at the end of the block

Where to go next#

You now know the essential cuda.core workflow and how to plug it into the Python GPU ecosystem. From here:

Examples: runnable programs for templated kernels (name_expressions), multi-GPU, graphs, JIT link-time optimization, TMA, and interop.
Interoperability: the __cuda_stream__ protocol, DLPack/CAI, and StridedMemoryView in depth.
API reference: every public class and function, including Linker (runtime linking/LTO), the memory-resource family (DeviceMemoryResource, ManagedMemoryResource, VirtualMemoryResource, …), and the graph node types.
Environment variables: runtime knobs such as the per-thread default stream.
Prefer writing kernels in Python instead of CUDA C++? Numba CUDA compiles a subset of Python into CUDA kernels, and numba-cuda-mlir is its next-generation evolution.