CPU/GPU cross-device memory access#
Warp arrays are associated with an allocation Device such as
"cpu" or "cuda:0", and kernels run on a launch device. The portable
default is to launch kernels on the same device as their array arguments. On
systems with hardware-supported CPU/GPU memory access, some cross-device
patterns can also be valid: a GPU may be able to read or write unpinned CPU
memory directly, and some systems can let CPU code directly access GPU-resident
CUDA managed memory.
This page describes how Warp exposes those hardware capabilities and how to use them when writing mixed CPU/GPU code.
Launching with arrays on the same device#
The launch device determines where a kernel runs, and the array device describes where the array allocation lives:
cpu_array = wp.zeros(1024, dtype=float, device="cpu")
gpu_array = wp.zeros(1024, dtype=float, device="cuda:0")
wp.launch(kernel, dim=cpu_array.size, inputs=[cpu_array], device="cpu")
wp.launch(kernel, dim=gpu_array.size, inputs=[gpu_array], device="cuda:0")
The same-device pattern works on all supported systems. Passing an array from one device to a kernel running on another device depends on the capabilities of the device that performs the access.
Device capability properties#
Each device exposes three CPU/GPU memory access properties:
This deep dive focuses on how those capabilities affect cross-device launches, managed memory, atomics, and diagnostics.
On CPU devices, these properties are always False. On GPU devices, each
property describes a specific access path or operation; support for one does not
imply support for another. For example, a system can allow GPU access to CPU
memory without allowing CPU access to GPU-resident managed memory.
Common CPU/GPU memory models#
The exact values are reported by the CUDA driver and may vary by platform, driver, kernel, and GPU generation. The following table summarizes the models advanced users commonly need to reason about:
System model |
GPU access to CPU arrays |
CPU access to GPU-resident managed memory |
CPU/GPU atomics |
|---|---|---|---|
Discrete GPU without HMM |
Usually no |
Usually no |
Usually no |
Discrete GPU with Linux HMM |
Yes |
Usually no |
Usually no |
Jetson Thor-style ATS |
Yes |
Platform-dependent for managed memory |
Yes, when reported by the driver |
Host-page-table ATS with distinct CPU/GPU physical memory |
Yes |
Only when reported by the driver |
Yes, when reported by the driver |
HMM stands for Heterogeneous Memory Management; for background, see NVIDIA’s
HMM overview.
ATS stands for Address Translation Services. Warp does not require users to
classify the platform manually. Query the Device properties and branch
on the behavior your program needs.
For the CUDA-level model behind these categories, see the CUDA Programming Guide’s Unified and System Memory chapter, especially its Unified Memory paradigms table.
Do not infer CPU access to GPU-resident CUDA managed memory from ATS, C2C, or a
product family name. For example, a DGX Spark-class GB10 system can report ATS
and GPU access to CPU memory while
device.is_gpu_memory_access_from_cpu_supported is False. Query the
property directly before CPU code reads or writes GPU-resident managed memory.
Launching GPU kernels with CPU arrays#
When device.is_cpu_memory_access_from_gpu_supported is true, a GPU kernel can
directly read or write a CPU array:
device = wp.get_device("cuda:0")
a = wp.zeros(1024, dtype=float, device="cpu")
if device.is_cpu_memory_access_from_gpu_supported:
wp.launch(kernel, dim=a.size, inputs=[a], device=device)
else:
a_gpu = a.to(device)
wp.launch(kernel, dim=a_gpu.size, inputs=[a_gpu], device=device)
This can avoid explicit copies on HMM and coherent CPU/GPU systems. If the capability is false and the kernel actually dereferences the CPU pointer, CUDA will report a runtime error such as an illegal memory access.
Accessing GPU data from CPU code#
CPU access to GPU-resident managed memory is a separate capability:
device = wp.get_device("cuda:0")
if device.is_gpu_memory_access_from_cpu_supported:
...
Important
device.is_gpu_memory_access_from_cpu_supported reports a hardware
capability for CUDA managed memory. Warp exposes the property today, but
standard Warp CUDA arrays are not managed-memory allocations. Until Warp
provides managed-memory allocation APIs, copy CUDA arrays to "cpu" before
CPU code reads or writes them.
CUDA arrays created by standard Warp array constructors, such as
zeros(), empty(), and ones(), are not CUDA managed-memory
allocations. This is true whether the array comes from Warp’s mempool
allocator or the built-in default CUDA allocator. For
those arrays, use an explicit copy before CPU code reads or writes the data:
a = wp.zeros(1024, dtype=float, device=device)
a_cpu = a.to("cpu")
wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu")
Do not assume that GPU access to CPU memory implies CPU access to GPU-resident memory. Some systems support the former but not the latter.
Checking access for a specific array with wp.can_access()#
The function warp.can_access() answers whether code running on one device
can directly access a specific Warp array:
launch_device = wp.get_device("cuda:0")
data = wp.empty(1024, dtype=float, device="cpu")
if wp.can_access(launch_device, data):
...
For CPU arrays passed to CUDA kernels, pinned CPU arrays are accepted on CUDA
devices with unified virtual addressing, and unpinned CPU arrays require
is_cpu_memory_access_from_gpu_supported. For CUDA arrays, default CUDA
allocations use CUDA peer-access state, while memory pool allocations use
memory-pool access state. See Memory Pool Access for the distinction between
peer access for default CUDA allocations and memory-pool access for mempool
allocations.
wp.can_access(device, array) returns False when Warp cannot verify that
the array is directly accessible. This includes cross-device arrays backed by
custom allocators or externally wrapped allocations whose allocation kind is not
known to Warp. A False result means “not verified accessible”; it does not
prove that the hardware could never access the pointer.
wp.can_access() is a resource-oriented API. In this release, the second
argument must be a concrete Warp array instance. Annotation-only arrays such as
wp.array(dtype=float) or wp.array[float] and device objects are not
supported.
Checking coarse device access with Device.can_access()#
The method Device.can_access() is a coarse device-level query for cases
where no concrete array is available:
launch_device = wp.get_device("cuda:0")
array_device = wp.get_device("cpu")
if launch_device.can_access(array_device):
...
For GPU kernels accessing CPU arrays, this method uses
is_cpu_memory_access_from_gpu_supported because standard Warp CPU arrays use
unpinned CPU memory. For CPU code accessing CUDA arrays, it returns False for
Warp CUDA arrays because the built-in CUDA allocators do not create CUDA
managed-memory allocations. For GPU/GPU pairs, it reflects the target device’s
current built-in allocator mode: memory-pool access when memory pools are
enabled on the target device, and peer access otherwise.
Device.can_access() is not authoritative for existing arrays. An array may
have been allocated before memory-pool settings changed, may use a custom
allocator, or may wrap external memory. Code that has an actual array should use
wp.can_access(device, array) instead.
Checking array access before launch#
By default, Warp launches kernels after type, dtype, and dimension validation without checking array accessibility. This keeps the launch path lightweight and allows hardware-supported mixed CPU/GPU launches to work.
If you want a clear Python error before the kernel runs, set
warp.config.launch_array_access_mode:
wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED
wp.config.LaunchArrayAccessMode.RELAXEDis the default and performs no pre-launch array access checks beyond type, dtype, and dimension validation.wp.config.LaunchArrayAccessMode.STRICTrestores Warp’s original same-device rule and requires every Warp array argument to be allocated on the launch device.wp.config.LaunchArrayAccessMode.CHECKEDraises an error before launch when Warp can determine that a cross-device Warp array argument is not accessible from the launch device. This is useful when debugging mixed-device launches on systems that do not support direct CPU/GPU memory access or on multi-GPU systems where peer and memory-pool access are configured separately.
Arrays backed by custom or externally wrapped allocators are a limitation of this
diagnostic. Warp does not know the allocation kind for those arrays, so
wp.config.LaunchArrayAccessMode.CHECKED emits a UserWarning once per
(kernel, argument name, source device, launch device) pattern and allows the
launch to proceed. Use wp.config.LaunchArrayAccessMode.STRICT if unknown allocation
provenance should be rejected, or wp.config.LaunchArrayAccessMode.RELAXED to suppress
the diagnostic.
Objects exposing __array_interface__ are accepted only for CPU launches.
Warp treats that protocol as a CPU-addressable pointer and does not infer CUDA
allocation provenance from it, so wp.config.LaunchArrayAccessMode.CHECKED has no
cross-device access decision to make for that protocol.
Directly passing an object that exposes __cuda_array_interface__ is
different from passing a Warp array. The protocol lets Warp construct the kernel
argument at launch time, but it does not identify the allocation device or
allocation kind. In this phase, wp.config.LaunchArrayAccessMode.CHECKED does not fully
verify directly passed objects exposing this protocol. Advanced users who know
such an allocation is valid are responsible for ensuring that the launch device
can legally access the pointer.
wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED
wp.launch(kernel, dim=a.size, inputs=[a], device="cuda:0")
warp.config.launch_array_access_mode can add launch overhead in
wp.config.LaunchArrayAccessMode.STRICT and wp.config.LaunchArrayAccessMode.CHECKED modes.
Use wp.config.LaunchArrayAccessMode.RELAXED in performance-sensitive code that has
already validated its launch accessibility assumptions.
Unlike warp.config.verify_cuda,
warp.config.launch_array_access_mode can be used during CUDA graph
capture because wp.config.LaunchArrayAccessMode.CHECKED checks run before each launch
is recorded. For cross-GPU graph capture, enable peer access or memory-pool
access with Warp APIs before capture begins so verification can use the recorded
access state during capture. When a CUDA graph captures a launch with CPU array
arguments, replay uses the same captured CPU pointers. If the arrays remain
alive, CPU updates made between replays are visible to kernels on devices that
can access CPU memory.
Checking CPU/GPU atomic support#
Direct loads and stores do not imply atomic safety. Code that uses atomics from
both CPU and GPU code paths on the same allocation should also check
Device.is_cpu_gpu_atomic_supported:
device = wp.get_device("cuda:0")
if not device.is_cpu_gpu_atomic_supported:
raise RuntimeError("This algorithm requires CPU/GPU atomic support")
Device.is_cpu_gpu_atomic_supported
answers only whether CPU/GPU atomic operations are supported for an otherwise
accessible allocation. The allocation must still be accessible from both the CPU
and GPU, and the program must provide any required synchronization.
For example, GPU atomics into a CPU allocation require both GPU access to CPU memory and CPU/GPU atomic support:
device = wp.get_device("cuda:0")
counters = wp.zeros(1, dtype=wp.int32, device="cpu")
if (
device.is_cpu_memory_access_from_gpu_supported
and device.is_cpu_gpu_atomic_supported
):
wp.launch(update_counters, dim=n, inputs=[counters], device=device)
wp.synchronize_device(device)
print(counters.numpy()[0])
The same requirements apply when CPU and GPU work overlap. If a CPU kernel and a GPU kernel both write the same allocation concurrently, all conflicting accesses must use atomic operations, and the device must report CPU/GPU atomic support. Atomicity prevents lost updates, but it does not provide a deterministic ordering for non-commutative operations or floating-point accumulation:
# Assume both kernels call wp.atomic_add(counters, 0, 1) once per thread.
counters = wp.zeros(1, dtype=wp.int32, device="cpu")
if (
device.is_cpu_memory_access_from_gpu_supported
and device.is_cpu_gpu_atomic_supported
):
wp.launch(gpu_increment, dim=num_gpu_threads, inputs=[counters], device=device)
wp.launch(cpu_increment, dim=num_cpu_threads, inputs=[counters], device="cpu")
wp.synchronize_device(device)
assert counters.numpy()[0] == num_gpu_threads + num_cpu_threads
If Device.is_cpu_gpu_atomic_supported
is False, do not rely on concurrent CPU/GPU atomics, even on systems where
the GPU can directly load and store CPU memory.
That does not make non-managed CUDA allocations CPU-accessible. CPU code should still copy arrays backed by those allocations before reading or writing them:
values = wp.zeros(1024, dtype=float, device=device)
values_cpu = values.to("cpu")
Choosing a memory access pattern#
Use the same-device pattern unless you need zero-copy CPU/GPU access. When you
already have an array, use wp.can_access(device, array)
to decide whether a specific launch device can directly access that allocation.
Capability flags are most useful before allocation, when deciding what kind of
allocation or access pattern to create:
GPU kernel reads or writes unpinned CPU arrays: check
device.is_cpu_memory_access_from_gpu_supported.GPU kernel reads or writes pinned CPU arrays: use
pinned=Trueand checkdevice.is_uva.CPU code reads or writes arrays backed by non-managed CUDA allocations: copy the data to
"cpu"first.CPU code accesses externally provided GPU-resident CUDA managed memory: check
device.is_gpu_memory_access_from_cpu_supported.CPU and GPU both use atomics on the same allocation: make sure the allocation is accessible from both the CPU and GPU, and check
device.is_cpu_gpu_atomic_supported.GPU kernels use arrays from another GPU: enable peer access for default CUDA allocations, or memory-pool access for CUDA memory-pool allocations, then check the concrete array with
wp.can_access(device, array).Debugging mixed-device launch failures: temporarily set
warp.config.launch_array_access_modetowp.config.LaunchArrayAccessMode.CHECKED.
Prefer capability checks over platform-name checks. They make code portable across discrete GPUs, HMM-enabled systems, Jetson, Grace, and future coherent CPU/GPU platforms.