CPU/GPU cross-device memory access ================================== .. currentmodule:: warp Warp arrays are associated with an allocation :class:`Device` such as ``"cpu"`` or ``"cuda:0"``, and kernels run on a launch device. The portable default is to launch kernels on the same device as their array arguments. On systems with hardware-supported CPU/GPU memory access, some cross-device patterns can also be valid: a GPU may be able to read or write unpinned CPU memory directly, and some systems can let CPU code directly access GPU-resident CUDA managed memory. This page describes how Warp exposes those hardware capabilities and how to use them when writing mixed CPU/GPU code. Launching with arrays on the same device ---------------------------------------- The launch device determines where a kernel runs, and the array device describes where the array allocation lives: .. code:: python cpu_array = wp.zeros(1024, dtype=float, device="cpu") gpu_array = wp.zeros(1024, dtype=float, device="cuda:0") wp.launch(kernel, dim=cpu_array.size, inputs=[cpu_array], device="cpu") wp.launch(kernel, dim=gpu_array.size, inputs=[gpu_array], device="cuda:0") The same-device pattern works on all supported systems. Passing an array from one device to a kernel running on another device depends on the capabilities of the device that performs the access. Device capability properties ---------------------------- Each device exposes three CPU/GPU memory access properties: - :attr:`Device.is_cpu_memory_access_from_gpu_supported ` - :attr:`Device.is_gpu_memory_access_from_cpu_supported ` - :attr:`Device.is_cpu_gpu_atomic_supported ` This deep dive focuses on how those capabilities affect cross-device launches, managed memory, atomics, and diagnostics. On CPU devices, these properties are always ``False``. On GPU devices, each property describes a specific access path or operation; support for one does not imply support for another. For example, a system can allow GPU access to CPU memory without allowing CPU access to GPU-resident managed memory. Common CPU/GPU memory models ---------------------------- The exact values are reported by the CUDA driver and may vary by platform, driver, kernel, and GPU generation. The following table summarizes the models advanced users commonly need to reason about: .. list-table:: :header-rows: 1 :widths: 24 25 25 26 * - System model - GPU access to CPU arrays - CPU access to GPU-resident managed memory - CPU/GPU atomics * - Discrete GPU without HMM - Usually no - Usually no - Usually no * - Discrete GPU with Linux HMM - Yes - Usually no - Usually no * - Jetson Thor-style ATS - Yes - Platform-dependent for managed memory - Yes, when reported by the driver * - Host-page-table ATS with distinct CPU/GPU physical memory - Yes - Only when reported by the driver - Yes, when reported by the driver HMM stands for Heterogeneous Memory Management; for background, see NVIDIA's `HMM overview `__. ATS stands for Address Translation Services. Warp does not require users to classify the platform manually. Query the :class:`Device` properties and branch on the behavior your program needs. For the CUDA-level model behind these categories, see the CUDA Programming Guide's `Unified and System Memory chapter `__, especially its `Unified Memory paradigms table `__. Do not infer CPU access to GPU-resident CUDA managed memory from ATS, C2C, or a product family name. For example, a DGX Spark-class GB10 system can report ATS and GPU access to CPU memory while ``device.is_gpu_memory_access_from_cpu_supported`` is ``False``. Query the property directly before CPU code reads or writes GPU-resident managed memory. Launching GPU kernels with CPU arrays ------------------------------------- When ``device.is_cpu_memory_access_from_gpu_supported`` is true, a GPU kernel can directly read or write a CPU array: .. code:: python device = wp.get_device("cuda:0") a = wp.zeros(1024, dtype=float, device="cpu") if device.is_cpu_memory_access_from_gpu_supported: wp.launch(kernel, dim=a.size, inputs=[a], device=device) else: a_gpu = a.to(device) wp.launch(kernel, dim=a_gpu.size, inputs=[a_gpu], device=device) This can avoid explicit copies on HMM and coherent CPU/GPU systems. If the capability is false and the kernel actually dereferences the CPU pointer, CUDA will report a runtime error such as an illegal memory access. Accessing GPU data from CPU code -------------------------------- CPU access to GPU-resident managed memory is a separate capability: .. code:: python device = wp.get_device("cuda:0") if device.is_gpu_memory_access_from_cpu_supported: ... .. important:: ``device.is_gpu_memory_access_from_cpu_supported`` reports a hardware capability for CUDA managed memory. Warp exposes the property today, but standard Warp CUDA arrays are not managed-memory allocations. Until Warp provides managed-memory allocation APIs, copy CUDA arrays to ``"cpu"`` before CPU code reads or writes them. CUDA arrays created by standard Warp array constructors, such as :func:`zeros`, :func:`empty`, and :func:`ones`, are not CUDA managed-memory allocations. This is true whether the array comes from Warp's :ref:`mempool allocator ` or the built-in default CUDA allocator. For those arrays, use an explicit copy before CPU code reads or writes the data: .. code:: python a = wp.zeros(1024, dtype=float, device=device) a_cpu = a.to("cpu") wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu") Do not assume that GPU access to CPU memory implies CPU access to GPU-resident memory. Some systems support the former but not the latter. Checking access for a specific array with ``wp.can_access()`` ------------------------------------------------------------- The function :func:`warp.can_access` answers whether code running on one device can directly access a specific Warp array: .. code:: python launch_device = wp.get_device("cuda:0") data = wp.empty(1024, dtype=float, device="cpu") if wp.can_access(launch_device, data): ... For CPU arrays passed to CUDA kernels, pinned CPU arrays are accepted on CUDA devices with unified virtual addressing, and unpinned CPU arrays require ``is_cpu_memory_access_from_gpu_supported``. For CUDA arrays, default CUDA allocations use CUDA peer-access state, while memory pool allocations use memory-pool access state. See :ref:`mempool_access` for the distinction between peer access for default CUDA allocations and memory-pool access for mempool allocations. ``wp.can_access(device, array)`` returns ``False`` when Warp cannot verify that the array is directly accessible. This includes cross-device arrays backed by custom allocators or externally wrapped allocations whose allocation kind is not known to Warp. A ``False`` result means "not verified accessible"; it does not prove that the hardware could never access the pointer. ``wp.can_access()`` is a resource-oriented API. In this release, the second argument must be a concrete Warp array instance. Annotation-only arrays such as ``wp.array(dtype=float)`` or ``wp.array[float]`` and device objects are not supported. Checking coarse device access with ``Device.can_access()`` ---------------------------------------------------------- The method :meth:`Device.can_access` is a coarse device-level query for cases where no concrete array is available: .. code:: python launch_device = wp.get_device("cuda:0") array_device = wp.get_device("cpu") if launch_device.can_access(array_device): ... For GPU kernels accessing CPU arrays, this method uses ``is_cpu_memory_access_from_gpu_supported`` because standard Warp CPU arrays use unpinned CPU memory. For CPU code accessing CUDA arrays, it returns ``False`` for Warp CUDA arrays because the built-in CUDA allocators do not create CUDA managed-memory allocations. For GPU/GPU pairs, it reflects the target device's current built-in allocator mode: memory-pool access when memory pools are enabled on the target device, and peer access otherwise. ``Device.can_access()`` is not authoritative for existing arrays. An array may have been allocated before memory-pool settings changed, may use a custom allocator, or may wrap external memory. Code that has an actual array should use ``wp.can_access(device, array)`` instead. .. _launch_array_access_checks: Checking array access before launch ----------------------------------- By default, Warp launches kernels after type, dtype, and dimension validation without checking array accessibility. This keeps the launch path lightweight and allows hardware-supported mixed CPU/GPU launches to work. If you want a clear Python error before the kernel runs, set :attr:`warp.config.launch_array_access_mode`: .. code:: python wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED - ``wp.config.LaunchArrayAccessMode.RELAXED`` is the default and performs no pre-launch array access checks beyond type, dtype, and dimension validation. - ``wp.config.LaunchArrayAccessMode.STRICT`` restores Warp's original same-device rule and requires every Warp array argument to be allocated on the launch device. - ``wp.config.LaunchArrayAccessMode.CHECKED`` raises an error before launch when Warp can determine that a cross-device Warp array argument is not accessible from the launch device. This is useful when debugging mixed-device launches on systems that do not support direct CPU/GPU memory access or on multi-GPU systems where peer and memory-pool access are configured separately. Arrays backed by custom or externally wrapped allocators are a limitation of this diagnostic. Warp does not know the allocation kind for those arrays, so ``wp.config.LaunchArrayAccessMode.CHECKED`` emits a ``UserWarning`` once per ``(kernel, argument name, source device, launch device)`` pattern and allows the launch to proceed. Use ``wp.config.LaunchArrayAccessMode.STRICT`` if unknown allocation provenance should be rejected, or ``wp.config.LaunchArrayAccessMode.RELAXED`` to suppress the diagnostic. Objects exposing ``__array_interface__`` are accepted only for CPU launches. Warp treats that protocol as a CPU-addressable pointer and does not infer CUDA allocation provenance from it, so ``wp.config.LaunchArrayAccessMode.CHECKED`` has no cross-device access decision to make for that protocol. Directly passing an object that exposes ``__cuda_array_interface__`` is different from passing a Warp array. The protocol lets Warp construct the kernel argument at launch time, but it does not identify the allocation device or allocation kind. In this phase, ``wp.config.LaunchArrayAccessMode.CHECKED`` does not fully verify directly passed objects exposing this protocol. Advanced users who know such an allocation is valid are responsible for ensuring that the launch device can legally access the pointer. .. code:: python wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED wp.launch(kernel, dim=a.size, inputs=[a], device="cuda:0") :attr:`warp.config.launch_array_access_mode` can add launch overhead in ``wp.config.LaunchArrayAccessMode.STRICT`` and ``wp.config.LaunchArrayAccessMode.CHECKED`` modes. Use ``wp.config.LaunchArrayAccessMode.RELAXED`` in performance-sensitive code that has already validated its launch accessibility assumptions. Unlike :attr:`warp.config.verify_cuda`, :attr:`warp.config.launch_array_access_mode` can be used during CUDA graph capture because ``wp.config.LaunchArrayAccessMode.CHECKED`` checks run before each launch is recorded. For cross-GPU graph capture, enable peer access or memory-pool access with Warp APIs before capture begins so verification can use the recorded access state during capture. When a CUDA graph captures a launch with CPU array arguments, replay uses the same captured CPU pointers. If the arrays remain alive, CPU updates made between replays are visible to kernels on devices that can access CPU memory. Checking CPU/GPU atomic support ------------------------------- Direct loads and stores do not imply atomic safety. Code that uses atomics from both CPU and GPU code paths on the same allocation should also check :attr:`Device.is_cpu_gpu_atomic_supported `: .. code:: python device = wp.get_device("cuda:0") if not device.is_cpu_gpu_atomic_supported: raise RuntimeError("This algorithm requires CPU/GPU atomic support") :attr:`Device.is_cpu_gpu_atomic_supported ` answers only whether CPU/GPU atomic operations are supported for an otherwise accessible allocation. The allocation must still be accessible from both the CPU and GPU, and the program must provide any required synchronization. For example, GPU atomics into a CPU allocation require both GPU access to CPU memory and CPU/GPU atomic support: .. code:: python device = wp.get_device("cuda:0") counters = wp.zeros(1, dtype=wp.int32, device="cpu") if ( device.is_cpu_memory_access_from_gpu_supported and device.is_cpu_gpu_atomic_supported ): wp.launch(update_counters, dim=n, inputs=[counters], device=device) wp.synchronize_device(device) print(counters.numpy()[0]) The same requirements apply when CPU and GPU work overlap. If a CPU kernel and a GPU kernel both write the same allocation concurrently, all conflicting accesses must use atomic operations, and the device must report CPU/GPU atomic support. Atomicity prevents lost updates, but it does not provide a deterministic ordering for non-commutative operations or floating-point accumulation: .. code:: python # Assume both kernels call wp.atomic_add(counters, 0, 1) once per thread. counters = wp.zeros(1, dtype=wp.int32, device="cpu") if ( device.is_cpu_memory_access_from_gpu_supported and device.is_cpu_gpu_atomic_supported ): wp.launch(gpu_increment, dim=num_gpu_threads, inputs=[counters], device=device) wp.launch(cpu_increment, dim=num_cpu_threads, inputs=[counters], device="cpu") wp.synchronize_device(device) assert counters.numpy()[0] == num_gpu_threads + num_cpu_threads If :attr:`Device.is_cpu_gpu_atomic_supported ` is ``False``, do not rely on concurrent CPU/GPU atomics, even on systems where the GPU can directly load and store CPU memory. That does not make non-managed CUDA allocations CPU-accessible. CPU code should still copy arrays backed by those allocations before reading or writing them: .. code:: python values = wp.zeros(1024, dtype=float, device=device) values_cpu = values.to("cpu") Choosing a memory access pattern -------------------------------- Use the same-device pattern unless you need zero-copy CPU/GPU access. When you already have an array, use :func:`wp.can_access(device, array) ` to decide whether a specific launch device can directly access that allocation. Capability flags are most useful before allocation, when deciding what kind of allocation or access pattern to create: - GPU kernel reads or writes unpinned CPU arrays: check ``device.is_cpu_memory_access_from_gpu_supported``. - GPU kernel reads or writes pinned CPU arrays: use ``pinned=True`` and check ``device.is_uva``. - CPU code reads or writes arrays backed by non-managed CUDA allocations: copy the data to ``"cpu"`` first. - CPU code accesses externally provided GPU-resident CUDA managed memory: check ``device.is_gpu_memory_access_from_cpu_supported``. - CPU and GPU both use atomics on the same allocation: make sure the allocation is accessible from both the CPU and GPU, and check ``device.is_cpu_gpu_atomic_supported``. - GPU kernels use arrays from another GPU: enable peer access for default CUDA allocations, or :ref:`memory-pool access ` for CUDA memory-pool allocations, then check the concrete array with :func:`wp.can_access(device, array) `. - Debugging mixed-device launch failures: temporarily set :attr:`warp.config.launch_array_access_mode` to ``wp.config.LaunchArrayAccessMode.CHECKED``. Prefer capability checks over platform-name checks. They make code portable across discrete GPUs, HMM-enabled systems, Jetson, Grace, and future coherent CPU/GPU platforms.