10 minutes to cuda.core#
Why cuda.core?#
cuda.core gives you a Pythonic interface to the CUDA runtime: you can
compile a CUDA C++ kernel at runtime, launch it, move memory, time it, and
capture it into a CUDA graph, all without writing a single raw CUDA driver or
runtime call. It is part of CUDA Python, providing high-level, Pythonic access
to the CUDA runtime on top of the low-level
cuda.bindings, so you get clean, idiomatic Python:
Write GPU kernels from Python. Compile CUDA C++ to a runnable kernel in a few lines, with no separate build step, no
Makefile, and nonvccinvocation.Stay in the Python GPU ecosystem. It shares CUDA context, streams, and memory with CuPy and PyTorch, so you can launch your own kernels directly on their arrays, with zero copies.
Safe by design. Resources are real Python objects that you release with
close(), and errors raise Python exceptions instead of return codes you have to check.
If you have ever wanted to drop a custom CUDA kernel into a NumPy/CuPy/PyTorch workflow without leaving Python, this is the fastest way to do it.
Before you begin#
This is a short, hands-on introduction to cuda.core, geared mainly toward
new users. It walks from “talk to a GPU” to “compile, launch, time, and capture
a kernel” using small, runnable snippets. For the full reference, see the
API reference; for more complete programs, see the
examples page.
Note
These snippets target CUDA 13 and need a working CUDA 13 installation plus
cuda.core and a matching cuda.bindings (13.x). The quickest way to get
both from PyPI is:
$ pip install cuda-core[cu13]
The install page covers driver requirements, conda, CUDA 12, and installing from source.
Customarily, we import as follows:
from cuda.core import Device, LaunchConfig, Program, ProgramOptions, launch
Selecting a device#
Device is your entry point. Creating one does not initialize the
GPU; calling Device.set_current() is what sets up a CUDA context
on the current host thread. Always call it before doing GPU work.
dev = Device() # device 0 by default; Device(1) selects another GPU
dev.set_current() # initialize CUDA and make this device current
print(dev.name) # e.g. 'NVIDIA GB10'
print(dev.compute_capability) # ComputeCapability(major=12, minor=1)
print(dev.arch) # '121' (handy for building kernels below)
Device objects are thread-local singletons: Device(0) always hands
you back the same object for device 0 on that thread, so libraries sharing your
process see and use the same GPU. Rich device attributes live under
Device.properties (for example
dev.properties.multiprocessor_count).
Compiling a kernel#
CUDA C++ source is compiled at runtime with Program. You describe the
compile with ProgramOptions (here: the C++ standard and the target
architecture, taken from dev.arch), call Program.compile() to get an
ObjectCode, then pull out a callable Kernel by name.
code = r"""
extern "C" __global__
void scale(float* data, float factor, size_t n) {
size_t i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
data[i] *= factor;
}
"""
opts = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}")
prog = Program(code, code_type="c++", options=opts)
mod = prog.compile("cubin") # "cubin" | "ptx" | "ltoir"
scale = mod.get_kernel("scale")
We wrap the kernel in extern "C" so its name is exactly "scale", which is
the name we pass to get_kernel().
When something goes wrong, cuda.core raises a regular Python exception rather
than returning an error code. A compilation failure is especially friendly: the
exception carries the compiler’s log, so you can see exactly what went wrong
(other errors raise an exception with the CUDA error name and description).
bad = r'extern "C" __global__ void k() { not_a_real_symbol; }'
try:
Program(bad, code_type="c++", options=opts).compile("cubin")
except Exception as e:
print(e) # includes: error: identifier "not_a_real_symbol" is undefined
Streams#
A Stream is an ordered queue of GPU work. Operations on the same stream
run in order; separate streams may overlap. Most cuda.core operations that
touch the GPU take a stream so you stay in control of ordering. Create one from
the device:
stream = dev.create_stream()
We create it first because the steps that follow, allocating, copying, and
launching, are all issued on a stream. We will pass this stream to each of
them in the next two sections.
Allocating memory#
Memory comes from a MemoryResource. Two you’ll use early:
Device.allocate()hands you a deviceBufferfrom the device’s default memory resource.PinnedMemoryResourceallocates host memory that is page-locked (and host-accessible), which is convenient for staging data to and from the GPU.
Like Device.allocate(), PinnedMemoryResource is stream-ordered, so
its allocate() also takes a stream.
From here on we use NumPy for host data and to check
results, but only as a convenience: cuda.core works directly on raw memory
buffers and typed pointers, and does not depend on NumPy or any other array
library.
import numpy as np
from cuda.core import PinnedMemoryResource
n = 1024
nbytes = n * np.dtype(np.float32).itemsize
pinned = PinnedMemoryResource()
host = pinned.allocate(nbytes, stream=stream) # host-accessible buffer
dbuf = dev.allocate(nbytes, stream=stream) # device buffer
print(host.size, host.is_host_accessible) # 4096 True
print(dbuf.is_device_accessible) # True
A Buffer is an owning handle to an allocation. Because every
Buffer implements __dlpack__, we can get a NumPy view of the
host buffer with numpy.from_dlpack() (no copy, no raw pointers) and
fill it in place. (from_dlpack hands back raw bytes, so we .view it as
float32.)
host_np = np.from_dlpack(host).view(np.float32) # writable view, zero-copy
host_np[:] = np.arange(n, dtype=np.float32)
The same trick works on the device side with CuPy: cp.from_dlpack(dbuf)
gives a CuPy array backed by the device buffer.
Copying and launching#
Copy host → device with Buffer.copy_from(), run the kernel with
launch(), then copy device → host with Buffer.copy_to(). These are
all stream-ordered and asynchronous; Stream.sync() blocks until the queued
work finishes.
LaunchConfig describes the grid and block; launch() takes the
stream, the config, the kernel, and then the kernel arguments. A Buffer
can be passed straight through as a pointer argument. Scalars, however, must
carry an explicit C type, so we use NumPy scalars (np.float32,
np.uint64) to match the kernel signature.
dbuf.copy_from(host, stream=stream) # H2D
block = 256
grid = (n + block - 1) // block
config = LaunchConfig(grid=grid, block=block)
launch(stream, config, scale, dbuf, np.float32(3.0), np.uint64(n))
dbuf.copy_to(host, stream=stream) # D2H
stream.sync() # wait for all of the above
assert np.array_equal(host_np, np.arange(n, dtype=np.float32) * 3.0)
That is the core cuda.core workflow: select a device, compile, allocate,
copy, launch, sync. Everything below builds on it.
Timing with events#
An Event marks a point in a stream. Create timing-enabled events,
record them around some work, synchronize, and subtract
to get elapsed GPU time in milliseconds.
start = dev.create_event({"timing_enabled": True})
end = dev.create_event({"timing_enabled": True})
stream.record(start)
for _ in range(100):
launch(stream, config, scale, dbuf, np.float32(1.0), np.uint64(n))
stream.record(end)
end.sync()
print(f"100 launches took {end - start:.4f} ms")
Events are also how you build cross-stream dependencies without stalling the
host: Stream.wait() makes one stream wait on an event (or another stream).
Working with multiple streams#
A single stream is enough for ordered work, but multiple streams let independent
work proceed in parallel. The two launches below touch different buffers, so the
GPU is free to run them at the same time. When a later step does depend on
another stream’s result, Stream.wait() joins them: it makes one stream wait
for another’s work to finish before continuing.
stream_a = dev.create_stream()
stream_b = dev.create_stream()
buf_a = dev.allocate(nbytes, stream=stream_a)
buf_b = dev.allocate(nbytes, stream=stream_b)
host_np[:] = np.arange(n, dtype=np.float32) # known input
buf_a.copy_from(host, stream=stream_a)
buf_b.copy_from(host, stream=stream_b)
# Independent work on each stream: free to run concurrently.
launch(stream_a, config, scale, buf_a, np.float32(2.0), np.uint64(n))
launch(stream_b, config, scale, buf_b, np.float32(5.0), np.uint64(n))
# stream_b now needs stream_a's result, so it waits before reading it.
stream_b.wait(stream_a)
buf_b.copy_from(buf_a, stream=stream_b) # safe: buf_a is ready
buf_b.copy_to(host, stream=stream_b)
stream_b.sync()
assert np.array_equal(host_np, np.arange(n, dtype=np.float32) * 2.0)
Without the Stream.wait(), the copy on stream_b could race ahead of the
kernel on stream_a. Whether the two independent launches actually overlap is
up to the GPU scheduler, which can run them together only when each leaves the
device underutilized; using separate streams expresses that the work is
independent and lets the runtime overlap it when it can. The
examples show this scaled up across multiple GPUs.
Capturing work in a CUDA graph#
When you launch the same sequence of operations repeatedly, per-launch CPU overhead adds up. A CUDA graph lets you record that sequence once and replay it with a single launch. Use the stream’s graph builder: begin building, issue the launches into the builder instead of the stream, then complete the graph.
gb = stream.create_graph_builder()
gb.begin_building()
launch(gb, config, scale, dbuf, np.float32(1.0), np.uint64(n))
launch(gb, config, scale, dbuf, np.float32(1.0), np.uint64(n))
graph = gb.end_building().complete()
graph.upload(stream)
graph.launch(stream) # replay the whole sequence in one shot
stream.sync()
See cuda_graphs.py for a complete capture-and-replay example with a measured speedup.
Working with CuPy and PyTorch#
cuda.core is designed to interoperate with the rest of the Python GPU
ecosystem, so in real workflows you often skip manual host buffers entirely and
operate directly on CuPy or PyTorch arrays.
Current device/context. Because Device.set_current() sets a normal
CUDA context (the standard primary context), other CUDA-runtime libraries pick
it up automatically: if CuPy or PyTorch has already selected a device,
Device() shares it, and vice versa.
Passing array data to a kernel. Both CuPy and PyTorch expose their device
pointer, which you pass to launch() like any other buffer pointer. We reuse
the scale kernel and config from above, applying a factor of 2.0 to an
array of ones:
# CuPy exposes its device pointer as .data.ptr
import cupy as cp
a = cp.ones(n, dtype=cp.float32)
dev.sync() # CuPy fills "a" on its own stream; sync before our stream reads it
launch(stream, config, scale, a.data.ptr, np.float32(2.0), np.uint64(a.size))
stream.sync()
assert bool((a == 2).all())
# PyTorch: wrap torch's stream via the __cuda_stream__ protocol
import torch
class PyTorchStreamWrapper:
def __init__(self, pt_stream):
self.pt_stream = pt_stream
def __cuda_stream__(self):
return (0, self.pt_stream.cuda_stream)
t = torch.ones(n, dtype=torch.float32, device="cuda")
ts = dev.create_stream(PyTorchStreamWrapper(torch.cuda.current_stream()))
launch(ts, config, scale, t.data_ptr(), np.float32(2.0), np.uint64(t.numel()))
ts.sync()
assert torch.allclose(t, torch.full_like(t, 2.0))
The __cuda_stream__ protocol is how any object advertises that it represents
a CUDA stream; Device.create_stream() accepts such objects so you can drive
cuda.core work on another library’s stream. See the
interoperability guide for details.
Array-library-agnostic views. To accept any CuPy/PyTorch/NumPy-like array
that supports DLPack or the CUDA Array Interface, decorate a function with
args_viewable_as_strided_memory(). The chosen arguments
become StridedMemoryView objects exposing ptr,
shape, dtype, and is_device_accessible:
from cuda.core.utils import StridedMemoryView, args_viewable_as_strided_memory
@args_viewable_as_strided_memory((0,))
def scale_array(arr, work_stream, kern, factor):
view = arr.view(work_stream.handle)
assert isinstance(view, StridedMemoryView)
assert view.is_device_accessible
size = view.shape[0]
cfg = LaunchConfig(grid=(size + 255) // 256, block=256)
launch(work_stream, cfg, kern, view.ptr, np.float32(factor), np.uint64(size))
work_stream.sync()
# Works on a CuPy array, a PyTorch tensor, or anything else with DLPack/CAI:
buf = cp.ones(n, dtype=cp.float32)
dev.sync()
scale_array(buf, stream, scale, 2.0)
assert bool((buf == 2).all())
cuda.core buffers also implement __dlpack__, so a device Buffer
can be handed to any DLPack importer for zero-copy exchange.
Cleaning up#
Buffers, streams, events, graphs, and graph builders hold CUDA resources. They
are released when garbage-collected, but you can release them explicitly with
close(), which the cuda.core examples do in a
finally block. Buffers are also context managers.
graph.close()
gb.close()
dbuf.close(stream=stream)
host.close(stream=stream)
stream.close()
# or scope a buffer with a with-statement:
with dev.allocate(nbytes, stream=stream) as tmp:
... # tmp is freed at the end of the block
Where to go next#
You now know the essential cuda.core workflow and how to plug it into the
Python GPU ecosystem. From here:
Examples: runnable programs for templated kernels (
name_expressions), multi-GPU, graphs, JIT link-time optimization, TMA, and interop.Interoperability: the
__cuda_stream__protocol, DLPack/CAI, andStridedMemoryViewin depth.API reference: every public class and function, including
Linker(runtime linking/LTO), the memory-resource family (DeviceMemoryResource,ManagedMemoryResource,VirtualMemoryResource, …), and thegraphnode types.Environment variables: runtime knobs such as the per-thread default stream.
Prefer writing kernels in Python instead of CUDA C++? Numba CUDA compiles a subset of Python into CUDA kernels, and numba-cuda-mlir is its next-generation evolution.