CPU/GPU cross-device memory access#

Warp arrays are associated with an allocation Device such as "cpu" or "cuda:0", and kernels run on a launch device. The portable default is to launch kernels on the same device as their array arguments. On systems with hardware-supported CPU/GPU memory access, some cross-device patterns can also be valid: a GPU may be able to read or write unpinned CPU memory directly, and some systems can let CPU code directly access GPU-resident CUDA managed memory.

This page describes how Warp exposes those hardware capabilities and how to use them when writing mixed CPU/GPU code.

Launching with arrays on the same device#

The launch device determines where a kernel runs, and the array device describes where the array allocation lives:

cpu_array = wp.zeros(1024, dtype=float, device="cpu")
gpu_array = wp.zeros(1024, dtype=float, device="cuda:0")

wp.launch(kernel, dim=cpu_array.size, inputs=[cpu_array], device="cpu")
wp.launch(kernel, dim=gpu_array.size, inputs=[gpu_array], device="cuda:0")

The same-device pattern works on all supported systems. Passing an array from one device to a kernel running on another device depends on the capabilities of the device that performs the access.

Device capability properties#

Each device exposes three CPU/GPU memory access properties:

This deep dive focuses on how those capabilities affect cross-device launches, managed memory, atomics, and diagnostics.

On CPU devices, these properties are always False. On GPU devices, each property describes a specific access path or operation; support for one does not imply support for another. For example, a system can allow GPU access to CPU memory without allowing CPU access to GPU-resident managed memory.

Common CPU/GPU memory models#

The exact values are reported by the CUDA driver and may vary by platform, driver, kernel, and GPU generation. The following table summarizes the models advanced users commonly need to reason about:

System model

GPU access to CPU arrays

CPU access to GPU-resident managed memory

CPU/GPU atomics

Discrete GPU without HMM

Usually no

Usually no

Usually no

Discrete GPU with Linux HMM

Yes

Usually no

Usually no

Jetson Thor-style ATS

Yes

Platform-dependent for managed memory

Yes, when reported by the driver

Host-page-table ATS with distinct CPU/GPU physical memory

Yes

Only when reported by the driver

Yes, when reported by the driver

HMM stands for Heterogeneous Memory Management; for background, see NVIDIA’s HMM overview. ATS stands for Address Translation Services. Warp does not require users to classify the platform manually. Query the Device properties and branch on the behavior your program needs.

For the CUDA-level model behind these categories, see the CUDA Programming Guide’s Unified and System Memory chapter, especially its Unified Memory paradigms table.

Do not infer CPU access to GPU-resident CUDA managed memory from ATS, C2C, or a product family name. For example, a DGX Spark-class GB10 system can report ATS and GPU access to CPU memory while device.is_gpu_memory_access_from_cpu_supported is False. Query the property directly before CPU code reads or writes GPU-resident managed memory.

Launching GPU kernels with CPU arrays#

When device.is_cpu_memory_access_from_gpu_supported is true, a GPU kernel can directly read or write a CPU array:

device = wp.get_device("cuda:0")
a = wp.zeros(1024, dtype=float, device="cpu")

if device.is_cpu_memory_access_from_gpu_supported:
    wp.launch(kernel, dim=a.size, inputs=[a], device=device)
else:
    a_gpu = a.to(device)
    wp.launch(kernel, dim=a_gpu.size, inputs=[a_gpu], device=device)

This can avoid explicit copies on HMM and coherent CPU/GPU systems. If the capability is false and the kernel actually dereferences the CPU pointer, CUDA will report a runtime error such as an illegal memory access.

Accessing GPU data from CPU code#

CPU access to GPU-resident managed memory is a separate capability:

device = wp.get_device("cuda:0")
if device.is_gpu_memory_access_from_cpu_supported:
    ...

Important

device.is_gpu_memory_access_from_cpu_supported reports a hardware capability for CUDA managed memory. Warp exposes the property today, but standard Warp CUDA arrays are not managed-memory allocations. Until Warp provides managed-memory allocation APIs, copy CUDA arrays to "cpu" before CPU code reads or writes them.

CUDA arrays created by standard Warp array constructors, such as zeros(), empty(), and ones(), are not CUDA managed-memory allocations. This is true whether the array comes from Warp’s mempool allocator or the built-in default CUDA allocator. For those arrays, use an explicit copy before CPU code reads or writes the data:

a = wp.zeros(1024, dtype=float, device=device)
a_cpu = a.to("cpu")
wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu")

Do not assume that GPU access to CPU memory implies CPU access to GPU-resident memory. Some systems support the former but not the latter.

Checking access for a specific array with wp.can_access()#

The function warp.can_access() answers whether code running on one device can directly access a specific Warp array:

launch_device = wp.get_device("cuda:0")
data = wp.empty(1024, dtype=float, device="cpu")

if wp.can_access(launch_device, data):
    ...

For CPU arrays passed to CUDA kernels, pinned CPU arrays are accepted on CUDA devices with unified virtual addressing, and unpinned CPU arrays require is_cpu_memory_access_from_gpu_supported. For CUDA arrays, default CUDA allocations use CUDA peer-access state, while memory pool allocations use memory-pool access state. See Memory Pool Access for the distinction between peer access for default CUDA allocations and memory-pool access for mempool allocations.

wp.can_access(device, array) returns False when Warp cannot verify that the array is directly accessible. This includes cross-device arrays backed by custom allocators or externally wrapped allocations whose allocation kind is not known to Warp. A False result means “not verified accessible”; it does not prove that the hardware could never access the pointer.

wp.can_access() is a resource-oriented API. In this release, the second argument must be a concrete Warp array instance. Annotation-only arrays such as wp.array(dtype=float) or wp.array[float] and device objects are not supported.

Checking coarse device access with Device.can_access()#

The method Device.can_access() is a coarse device-level query for cases where no concrete array is available:

launch_device = wp.get_device("cuda:0")
array_device = wp.get_device("cpu")

if launch_device.can_access(array_device):
    ...

For GPU kernels accessing CPU arrays, this method uses is_cpu_memory_access_from_gpu_supported because standard Warp CPU arrays use unpinned CPU memory. For CPU code accessing CUDA arrays, it returns False for Warp CUDA arrays because the built-in CUDA allocators do not create CUDA managed-memory allocations. For GPU/GPU pairs, it reflects the target device’s current built-in allocator mode: memory-pool access when memory pools are enabled on the target device, and peer access otherwise.

Device.can_access() is not authoritative for existing arrays. An array may have been allocated before memory-pool settings changed, may use a custom allocator, or may wrap external memory. Code that has an actual array should use wp.can_access(device, array) instead.

Checking array access before launch#

By default, Warp launches kernels after type, dtype, and dimension validation without checking array accessibility. This keeps the launch path lightweight and allows hardware-supported mixed CPU/GPU launches to work.

If you want a clear Python error before the kernel runs, set warp.config.launch_array_access_mode:

wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED
  • wp.config.LaunchArrayAccessMode.RELAXED is the default and performs no pre-launch array access checks beyond type, dtype, and dimension validation.

  • wp.config.LaunchArrayAccessMode.STRICT restores Warp’s original same-device rule and requires every Warp array argument to be allocated on the launch device.

  • wp.config.LaunchArrayAccessMode.CHECKED raises an error before launch when Warp can determine that a cross-device Warp array argument is not accessible from the launch device. This is useful when debugging mixed-device launches on systems that do not support direct CPU/GPU memory access or on multi-GPU systems where peer and memory-pool access are configured separately.

Arrays backed by custom or externally wrapped allocators are a limitation of this diagnostic. Warp does not know the allocation kind for those arrays, so wp.config.LaunchArrayAccessMode.CHECKED emits a UserWarning once per (kernel, argument name, source device, launch device) pattern and allows the launch to proceed. Use wp.config.LaunchArrayAccessMode.STRICT if unknown allocation provenance should be rejected, or wp.config.LaunchArrayAccessMode.RELAXED to suppress the diagnostic.

Objects exposing __array_interface__ are accepted only for CPU launches. Warp treats that protocol as a CPU-addressable pointer and does not infer CUDA allocation provenance from it, so wp.config.LaunchArrayAccessMode.CHECKED has no cross-device access decision to make for that protocol.

Directly passing an object that exposes __cuda_array_interface__ is different from passing a Warp array. The protocol lets Warp construct the kernel argument at launch time, but it does not identify the allocation device or allocation kind. In this phase, wp.config.LaunchArrayAccessMode.CHECKED does not fully verify directly passed objects exposing this protocol. Advanced users who know such an allocation is valid are responsible for ensuring that the launch device can legally access the pointer.

wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED
wp.launch(kernel, dim=a.size, inputs=[a], device="cuda:0")

warp.config.launch_array_access_mode can add launch overhead in wp.config.LaunchArrayAccessMode.STRICT and wp.config.LaunchArrayAccessMode.CHECKED modes. Use wp.config.LaunchArrayAccessMode.RELAXED in performance-sensitive code that has already validated its launch accessibility assumptions.

Unlike warp.config.verify_cuda, warp.config.launch_array_access_mode can be used during CUDA graph capture because wp.config.LaunchArrayAccessMode.CHECKED checks run before each launch is recorded. For cross-GPU graph capture, enable peer access or memory-pool access with Warp APIs before capture begins so verification can use the recorded access state during capture. When a CUDA graph captures a launch with CPU array arguments, replay uses the same captured CPU pointers. If the arrays remain alive, CPU updates made between replays are visible to kernels on devices that can access CPU memory.

Checking CPU/GPU atomic support#

Direct loads and stores do not imply atomic safety. Code that uses atomics from both CPU and GPU code paths on the same allocation should also check Device.is_cpu_gpu_atomic_supported:

device = wp.get_device("cuda:0")

if not device.is_cpu_gpu_atomic_supported:
    raise RuntimeError("This algorithm requires CPU/GPU atomic support")

Device.is_cpu_gpu_atomic_supported answers only whether CPU/GPU atomic operations are supported for an otherwise accessible allocation. The allocation must still be accessible from both the CPU and GPU, and the program must provide any required synchronization.

For example, GPU atomics into a CPU allocation require both GPU access to CPU memory and CPU/GPU atomic support:

device = wp.get_device("cuda:0")
counters = wp.zeros(1, dtype=wp.int32, device="cpu")

if (
    device.is_cpu_memory_access_from_gpu_supported
    and device.is_cpu_gpu_atomic_supported
):
    wp.launch(update_counters, dim=n, inputs=[counters], device=device)
    wp.synchronize_device(device)
    print(counters.numpy()[0])

The same requirements apply when CPU and GPU work overlap. If a CPU kernel and a GPU kernel both write the same allocation concurrently, all conflicting accesses must use atomic operations, and the device must report CPU/GPU atomic support. Atomicity prevents lost updates, but it does not provide a deterministic ordering for non-commutative operations or floating-point accumulation:

# Assume both kernels call wp.atomic_add(counters, 0, 1) once per thread.
counters = wp.zeros(1, dtype=wp.int32, device="cpu")

if (
    device.is_cpu_memory_access_from_gpu_supported
    and device.is_cpu_gpu_atomic_supported
):
    wp.launch(gpu_increment, dim=num_gpu_threads, inputs=[counters], device=device)
    wp.launch(cpu_increment, dim=num_cpu_threads, inputs=[counters], device="cpu")

    wp.synchronize_device(device)
    assert counters.numpy()[0] == num_gpu_threads + num_cpu_threads

If Device.is_cpu_gpu_atomic_supported is False, do not rely on concurrent CPU/GPU atomics, even on systems where the GPU can directly load and store CPU memory.

That does not make non-managed CUDA allocations CPU-accessible. CPU code should still copy arrays backed by those allocations before reading or writing them:

values = wp.zeros(1024, dtype=float, device=device)
values_cpu = values.to("cpu")

Choosing a memory access pattern#

Use the same-device pattern unless you need zero-copy CPU/GPU access. When you already have an array, use wp.can_access(device, array) to decide whether a specific launch device can directly access that allocation. Capability flags are most useful before allocation, when deciding what kind of allocation or access pattern to create:

  • GPU kernel reads or writes unpinned CPU arrays: check device.is_cpu_memory_access_from_gpu_supported.

  • GPU kernel reads or writes pinned CPU arrays: use pinned=True and check device.is_uva.

  • CPU code reads or writes arrays backed by non-managed CUDA allocations: copy the data to "cpu" first.

  • CPU code accesses externally provided GPU-resident CUDA managed memory: check device.is_gpu_memory_access_from_cpu_supported.

  • CPU and GPU both use atomics on the same allocation: make sure the allocation is accessible from both the CPU and GPU, and check device.is_cpu_gpu_atomic_supported.

  • GPU kernels use arrays from another GPU: enable peer access for default CUDA allocations, or memory-pool access for CUDA memory-pool allocations, then check the concrete array with wp.can_access(device, array).

  • Debugging mixed-device launch failures: temporarily set warp.config.launch_array_access_mode to wp.config.LaunchArrayAccessMode.CHECKED.

Prefer capability checks over platform-name checks. They make code portable across discrete GPUs, HMM-enabled systems, Jetson, Grace, and future coherent CPU/GPU platforms.