Examples#
This page links to the cuda.core examples shipped in the
cuda-python repository.
Use it as a quick index when you want a runnable starting point for a specific
workflow.
Compilation and kernel launch#
vector_add.py compiles and launches a simple vector-add kernel with CuPy arrays.
saxpy.py JIT-compiles a templated SAXPY kernel and launches both float and double instantiations.
pytorch_example.py launches a CUDA kernel with PyTorch tensors and a wrapped PyTorch stream.
Multi-device and advanced launch configuration#
simple_multi_gpu_example.py compiles and launches kernels across multiple GPUs.
thread_block_cluster.py demonstrates thread block cluster launch configuration on Hopper-class GPUs.
tma_tensor_map.py demonstrates Tensor Memory Accelerator descriptors and TMA-based bulk copies.
Linking and graphs#
jit_lto_fractal.py uses JIT link-time optimization to link user-provided device code into a fractal workflow at runtime.
cuda_graphs.py captures and replays a multi-kernel CUDA graph to reduce launch overhead.
Interoperability and memory access#
memory_ops.py covers memory resources, pinned memory, device transfers, and DLPack interop.
strided_memory_view_cpu.py uses
StridedMemoryViewwith JIT-compiled CPU code viacffi.strided_memory_view_gpu.py uses
StridedMemoryViewwith JIT-compiled GPU code and foreign GPU buffers.gl_interop_plasma.py renders a CUDA-generated plasma effect through OpenGL interop without CPU copies.
System inspection#
show_device_properties.py prints a detailed report of the CUDA devices available on the system.