Python Reference#

This section describes the Warp Python runtime API, how to manage memory, launch kernels, and high-level functionality for dealing with objects such as meshes and volumes. The APIs described in this section are intended to be used at the Python Scope and run inside the CPython interpreter. For a comprehensive list of functions available at the Kernel Scope, please see the Built-Ins Reference section.

Kernels#

Kernels are defined via Python functions that are annotated with the @wp.kernel decorator. All arguments of the Python function must be annotated with their respective type. The following example shows a simple kernel that adds two arrays together:

import warp as wp

@wp.kernel
def add_kernel(a: wp.array(dtype=float), b: wp.array(dtype=float), c: wp.array(dtype=float)):
    tid = wp.tid()
    c[tid] = a[tid] + b[tid]

Kernels are launched with the wp.launch() function on a specific device (CPU/GPU):

wp.launch(add_kernel, dim=1024, inputs=[a, b], outputs=[c], device="cuda")

Note that all the kernel inputs and outputs must live on the target device or a runtime exception will be raised.

Unless you are using the Graph visualization tool, the outputs argument is optional – all kernel arguments may be passed as inputs, but for readability it is sometimes useful to distinguish between the kernel arguments that are read from (inputs) and the kernel arguments that are written to (outputs). So in the above example, it would be equally valid to write inputs=[a, b, c] but since we are writing to c, we list it in the outputs argument. Note that the combined inputs followed by outputs list should match the ordering of the kernel arguments.

Kernels may be launched with multi-dimensional grid bounds. In this case, threads are not assigned a single index, but a coordinate in an n-dimensional grid, e.g.:

wp.launch(complex_kernel, dim=(128, 128, 3), ...)

Launches a 3D grid of threads with dimension 128 x 128 x 3. To retrieve the 3D index for each thread, use the following syntax:

i,j,k = wp.tid()

Note

Currently, kernels launched on CPU devices will be executed in serial. Kernels launched on CUDA devices will be launched in parallel with a fixed block-size.

In the Warp Compilation Model, kernels are just-in-time compiled into dynamic libraries and PTX using C++/CUDA as an intermediate representation. To avoid excessive runtime recompilation of kernel code, these files are stored in a cache directory named with a module-dependent hash to allow for the reuse of previously compiled modules. The location of the kernel cache is printed when Warp is initialized. wp.clear_kernel_cache() can be used to clear the kernel cache of previously generated compilation artifacts as Warp does not automatically try to keep the cache below a certain size.

warp.kernel(f=None, *, enable_backward=None, module=None)[source]#

Decorator to register a Warp kernel from a Python function. The function must be defined with type annotations for all arguments. The function must not return anything.

Example:

@wp.kernel
def my_kernel(a: wp.array(dtype=float), b: wp.array(dtype=float)):
    tid = wp.tid()
    b[tid] = a[tid] + 1.0


@wp.kernel(enable_backward=False)
def my_kernel_no_backward(a: wp.array(dtype=float, ndim=2), x: float):
    # the backward pass will not be generated
    i, j = wp.tid()
    a[i, j] = x


@wp.kernel(module="unique")
def my_kernel_unique_module(a: wp.array(dtype=float), b: wp.array(dtype=float)):
    # the kernel will be registered in new unique module created just for this
    # kernel and its dependent functions and structs
    tid = wp.tid()
    b[tid] = a[tid] + 1.0

Parameters:

f (Callable | None) – The function to be registered as a kernel.
enable_backward (bool | None) – If False, the backward pass will not be generated.
module (Module | Literal['unique'] | None) – The warp.context.Module to which the kernel belongs. Alternatively, if a string “unique” is provided, the kernel is assigned to a new module named after the kernel name and hash. If None, the module is inferred from the function’s module.

Returns:

The registered kernel.

warp.launch( kernel, dim, inputs=[], outputs=[], adj_inputs=[], adj_outputs=[], device=None, stream=None, adjoint=False, record_tape=True, record_cmd=False, max_blocks=0, block_dim=256, )[source]#

Launch a Warp kernel on the target device

Kernel launches are asynchronous with respect to the calling Python thread.

Parameters:

kernel – The name of a Warp kernel function, decorated with the @wp.kernel decorator
dim (int | Sequence[int]) – The number of threads to launch the kernel, can be an integer or a sequence of integers with a maximum of 4 dimensions.
inputs (Sequence) – The input parameters to the kernel (optional)
outputs (Sequence) – The output parameters (optional)
adj_inputs (Sequence) – The adjoint inputs (optional)
adj_outputs (Sequence) – The adjoint outputs (optional)
device (Device | str | None) – The device to launch on.
stream (Stream | None) – The stream to launch on.
adjoint (bool) – Whether to run forward or backward pass (typically use False).
record_tape (bool) – When True, the launch will be recorded the global wp.Tape() object when present.
record_cmd (bool) – When True, the launch will return a Launch object. The launch will not occur until the user calls Launch.launch().
max_blocks (int) – The maximum number of CUDA thread blocks to use. Only has an effect for CUDA kernel launches. If negative or zero, the maximum hardware value will be used.
block_dim (int) – The number of threads per block (always 1 for “cpu” devices).

warp.launch_tiled(*args, **kwargs)[source]#

A helper method for launching a grid with an extra trailing dimension equal to the block size.

For example, to launch a 2D grid, where each element has 64 threads assigned you would use the following:

wp.launch_tiled(kernel, [M, N], inputs=[...], block_dim=64)

Which is equivalent to the following:

wp.launch(kernel, [M, N, 64], inputs=[...], block_dim=64)

Inside your kernel code you can retrieve the first two indices of the thread as usual, ignoring the implicit third dimension if desired:

@wp.kernel
def compute()

    i, j = wp.tid()

    ...

warp.clear_kernel_cache()[source]#

Clear the kernel cache directory of previously generated source code and compiler artifacts.

Only directories beginning with wp_ will be deleted. This function only clears the cache for the current Warp version. LTO artifacts are not affected.

Return type:: None

Runtime Kernel Creation#

Warp allows generating kernels on-the-fly with various customizations, including closure support. Refer to the Code Generation section for the latest features.

Launch Objects#

Launch objects are one way to reduce the overhead of launching a kernel multiple times. Launch objects are returned from calling wp.launch() with record_cmd=True. This stores the results of various overhead operations that are needed to launch a kernel but defers the actual kernel launch until the Launch.launch() method is called.

In contrast to Graphs, Launch objects only record the launch of a single kernel and do not reduce the driver overhead of preparing the kernel for execution on a GPU. On the other hand, Launch objects do not have the storage and initialization overheads of CUDA graphs and also allow for the modification of launch dimensions with Launch.set_dim() and kernel parameters with functions such as Launch.set_params() and Launch.set_param_by_name(). Additionally, Launch objects can also be used to reduce the overhead of launching kernels running on the CPU.

Note

Kernels launched via Launch objects currently do not get recorded onto the Tape.

class warp.Launch( kernel, device, hooks=None, params=None, params_addr=None, bounds=None, max_blocks=0, block_dim=256, adjoint=False, )[source]#

Represents all data required for a kernel launch so that launches can be replayed quickly.

Users should not directly instantiate this class, instead use wp.launch(..., record_cmd=True) to record a launch.

Parameters:

device (Device)
hooks (KernelHooks | None)
params (Sequence[Any] | None)
params_addr (Sequence[ctypes.c_void_p] | None)
bounds (launch_bounds_t | None)
max_blocks (int)
block_dim (int)
adjoint (bool)

device: Device#: The device to launch on. This should not be changed after the launch object is created.

bounds: launch_bounds_t#: The launch bounds. Update with set_dim().

max_blocks: int#: The maximum number of CUDA thread blocks to use.

block_dim: int#: The number of threads per block.

adjoint: bool#: Whether to run the adjoint kernel instead of the forward kernel.

set_dim(dim)[source]#

Set the launch dimensions.

Parameters:: dim (int | list[int] | tuple[int, ...]) – The dimensions of the launch.

set_param_at_index(index, value, adjoint=False)[source]#

Set a kernel parameter at an index.

Parameters:

index (int) – The index of the param to set.
value (Any) – The value to set the param to.
adjoint (bool)

set_param_at_index_from_ctype(index, value)[source]#

Set a kernel parameter at an index without any type conversion.

Parameters:

index (int) – The index of the param to set.
value (Structure | int | float) – The value to set the param to.

set_param_by_name(name, value, adjoint=False)[source]#

Set a kernel parameter by argument name.

Parameters:

name (str) – The name of the argument to set.
value (Any) – The value to set the argument to.
adjoint (bool) – If True, set the adjoint of this parameter instead of the forward parameter.

set_param_by_name_from_ctype(name, value)[source]#

Set a kernel parameter by argument name with no type conversions.

Parameters:

name (str) – The name of the argument to set.
value (Structure) – The value to set the argument to.

set_params(values)[source]#

Set all parameters.

Parameters:: values (Sequence[Any]) – A list of values to set the params to.

set_params_from_ctypes(values)[source]#

Set all parameters without performing type-conversions.

Parameters:: values (Sequence[Structure]) – A list of ctypes or basic int / float types.

launch(stream=None)[source]#

Launch the kernel.

Parameters:: stream (Stream | None) – The stream to launch on.
Return type:: None

Arrays#

Arrays are the fundamental memory abstraction in Warp. They can be created through the following global constructor:

wp.empty(shape=1024, dtype=wp.vec3, device="cpu")
wp.zeros(shape=1024, dtype=float, device="cuda")
wp.full(shape=1024, value=10, dtype=int, device="cuda")

Arrays can also be constructed directly from NumPy ndarrays as follows:

r = np.random.rand(1024)

# copy to Warp owned array
a = wp.array(r, dtype=float, device="cpu")

# return a Warp array wrapper around the NumPy data (zero-copy)
a = wp.array(r, dtype=float, copy=False, device="cpu")

# return a Warp copy of the array data on the GPU
a = wp.array(r, dtype=float, device="cuda")

Note that for multi-dimensional data, the dtype parameter must be specified explicitly, e.g.:

r = np.random.rand((1024, 3))

# initialize as an array of vec3 objects
a = wp.array(r, dtype=wp.vec3, device="cuda")

If the shapes are incompatible, an error will be raised.

Warp arrays can also be constructed from objects that define the __cuda_array_interface__ attribute. For example:

import cupy
import warp as wp

device = wp.get_cuda_device()

r = cupy.arange(10)

# return a Warp array wrapper around the cupy data (zero-copy)
a = wp.array(r, device=device)

Arrays can be moved between devices using array.to():

host_array = wp.array(a, dtype=float, device="cpu")

# allocate and copy to GPU
device_array = host_array.to("cuda")

Additionally, data can be copied between arrays in different memory spaces using wp.copy():

src_array = wp.array(a, dtype=float, device="cpu")
dest_array = wp.empty_like(host_array)

# copy from source CPU buffer to GPU
wp.copy(dest_array, src_array)

When indexing an array with an array of integers, the result is an indexed array:

import warp as wp

arr = wp.array((1, 2, 3, 4, 5, 6))
sub = arr[wp.array((0, 2, 4), dtype=wp.int32)] # advanced indexing -> wp.indexedarray

print(type(arr), arr.shape)
print(type(sub), sub.shape)
print(sub)

<class 'warp.types.array'> (6,)
<class 'warp.types.indexedarray'> (3,)
[1 3 5]

class warp.array(*args, **kwargs)[source]#

A fixed-size multi-dimensional array containing values of the same type.

dtype#

The data type of the array.

Type:: DType

ndim#

The number of array dimensions.

Type:: int

size#

The number of items in the array.

Type:: int

capacity#

The amount of memory in bytes allocated for this array.

Type:: int

shape#

Dimensions of the array.

Type:: tuple[int]

strides#

Number of bytes in each dimension between successive elements of the array.

Type:: tuple[int]

ptr#

Pointer to underlying memory allocation backing the array.

Type:: int

device#

The device where the array’s memory allocation resides.

Type:: Device

pinned#

Indicates whether the array was allocated in pinned host memory.

Type:: bool

is_contiguous#

Indicates whether this array has a contiguous memory layout.

Type:: bool

deleter#

A function to be called when the array is deleted, taking two arguments: pointer and size. If None, then no function is called.

Type:: Callable[[int, int], None]

__init__( data=None, dtype=Any, shape=None, strides=None, ptr=None, capacity=None, device=None, pinned=False, copy=True, deleter=None, ndim=None, grad=None, requires_grad=False, )[source]#

Constructs a new Warp array object

When the data argument is a valid list, tuple, or ndarray the array will be constructed from this object’s data. For objects that are not stored sequentially in memory (e.g.: a list), then the data will first be flattened before being transferred to the memory space given by device.

The second construction path occurs when the ptr argument is a non-zero uint64 value representing the start address in memory where existing array data resides, e.g.: from an external or C-library. The memory allocation should reside on the same device given by the device argument, and the user should set the length and dtype parameter appropriately.

If neither data nor ptr are specified, the shape argument is checked next. This construction path can be used to create new uninitialized arrays, but users are encouraged to call wp.empty(), wp.zeros(), or wp.full() instead to create new arrays.

If none of the above arguments are specified, a simple type annotation is constructed. This is used when annotating kernel arguments or struct members (e.g.,``arr: wp.array(dtype=float)``). In this case, only dtype and ndim are taken into account and no memory is allocated for the array.

Parameters:

data (list | tuple | ndarray[tuple[Any, ...], dtype[_ScalarT]] | None) – An object to construct the array from, can be a Tuple, List, or generally any type convertible to an np.array
dtype (Any) – One of the available data types, such as warp.float32, warp.mat33, or a custom struct. If dtype is Any and data is an ndarray, then it will be inferred from the array data type
shape (int | tuple[int, ...] | list[int] | None) – Dimensions of the array
strides (tuple[int, ...] | None) – Number of bytes in each dimension between successive elements of the array
ptr (int | None) – Address of an external memory address to alias (data should be None)
capacity (int | None) – Maximum size in bytes of the ptr allocation (data should be None)
device (Devicelike) – Device the array lives on
copy (bool) – Whether the incoming data will be copied or aliased. Aliasing requires that the incoming data already lives on the device specified and the data types match.
deleter (Callable[[int, int], None] | None) – Function to be called when the array is deleted, taking two arguments: pointer and size
requires_grad (bool) – Whether or not gradients will be tracked for this array, see warp.Tape for details
grad (array | None) – The array in which to accumulate gradients in the backward pass. If None and requires_grad is True, then a gradient array will be allocated automatically.
pinned (bool) – Whether to allocate pinned host memory, which allows asynchronous host–device transfers (only applicable with device="cpu")
ndim (int | None)

property grad[source]#

property requires_grad[source]#

mark_init()[source]#: Resets this array’s read flag

mark_read()[source]#: Marks this array as having been read from in a kernel or recorded function on the tape.

mark_write(**kwargs)[source]#: Detect if we are writing to an array that has already been read from

zero_()[source]#: Zeroes-out the array entries.

fill_(value)[source]#

Set all array entries to value

Parameters:: value – The value to set every array entry to. Must be convertible to the array’s dtype.
Raises:: ValueError – If value cannot be converted to the array’s dtype.

Examples

fill_() can take lists or other sequences when filling arrays of vectors or matrices.

>>> arr = wp.zeros(2, dtype=wp.mat22)
>>> arr.numpy()
array([[[0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.]]], dtype=float32)
>>> arr.fill_([[1, 2], [3, 4]])
>>> arr.numpy()
array([[[1., 2.],
        [3., 4.]],

       [[1., 2.],
        [3., 4.]]], dtype=float32)

assign(src)[source]#: Wraps src in an warp.array if it is not already one and copies the contents to self.

numpy()[source]#: Converts the array to a numpy.ndarray (aliasing memory through the array interface protocol) If the array is on the GPU, a synchronous device-to-host copy (on the CUDA default stream) will be automatically performed to ensure that any outstanding work is completed.

cptr()[source]#

Return a ctypes cast of the array address.

Notes:

Only CPU arrays support this method.
The array must be contiguous.
Accesses to this object are not bounds checked.
For float16 types, a pointer to the internal uint16 representation is returned.

list()[source]#: Returns a flattened list of items in the array as a Python list.

to(device, requires_grad=None)[source]#: Returns a Warp array with this array’s data moved to the specified device, no-op if already on device.

flatten()[source]#: Returns a zero-copy view of the array collapsed to 1-D. Only supported for contiguous arrays.

reshape(shape)[source]#

Returns a reshaped array. Only supported for contiguous arrays.

Parameters:: shape – An int or tuple of ints specifying the shape of the returned array.

view(dtype)[source]#: Returns a zero-copy view of this array’s memory with a different data type. dtype must have the same byte size of the array’s native dtype.

contiguous()[source]#: Returns a contiguous array with this array’s data. No-op if array is already contiguous.

transpose(axes=None)[source]#

Returns an zero-copy view of the array with axes transposed.

Note: The transpose operation will return an array with a non-contiguous access pattern.

Parameters:: axes (optional) – Specifies the how the axes are permuted. If not specified, the axes order will be reversed.

ipc_handle()[source]#

Return an IPC handle of the array as a 64-byte bytes object

from_ipc_handle() can be used with this handle in another process to obtain a array that shares the same underlying memory allocation.

IPC is currently only supported on Linux. Additionally, IPC is only supported for arrays allocated using the default memory allocator.

Event objects created with the interprocess=True argument may similarly be shared between processes to synchronize GPU work.

Example

Temporarily using the default memory allocator to allocate an array and get its IPC handle:

with wp.ScopedMempool("cuda:0", False):
    test_array = wp.full(1024, value=42.0, dtype=wp.float32, device="cuda:0")
    ipc_handle = test_array.ipc_handle()

Raises:

RuntimeError – The array is not associated with a CUDA device.
RuntimeError – The CUDA device does not appear to support IPC.
RuntimeError – The array was allocated using the mempool memory allocator.

Return type:

bytes

Multi-dimensional Arrays#

Multi-dimensional arrays up to four dimensions can be constructed by passing a tuple of sizes for each dimension.

The following constructs a 2D array of size 1024 x 16:

wp.zeros(shape=(1024, 16), dtype=float, device="cuda")

When passing multi-dimensional arrays to kernels users must specify the expected array dimension inside the kernel signature, e.g. to pass a 2D array to a kernel the number of dims is specified using the ndim=2 parameter:

@wp.kernel
def test(input: wp.array(dtype=float, ndim=2)):

Type-hint helpers are provided for common array sizes, e.g.: array2d(), array3d(), which are equivalent to calling array(..., ndim=2)`, etc. To index a multi-dimensional array, use the following kernel syntax:

# returns a float from the 2d array
value = input[i,j]

To create an array slice, use the following syntax, where the number of indices is less than the array dimensions:

# returns an 1d array slice representing a row of the 2d array
row = input[i]

Slice operators can be concatenated, e.g.: s = array[i][j][k]. Slices can be passed to wp.func user functions provided the function also declares the expected array dimension. Currently, only single-index slicing is supported.

The following construction methods are provided for allocating zero-initialized and empty (non-initialized) arrays:

warp.zeros(

shape=None,

dtype=float,

device=None,

requires_grad=False,

pinned=False,

**kwargs,

)[source]#

Return a zero-initialized array

Parameters:

shape (int | tuple[int, ...] | list[int] | None) – Array dimensions
dtype (type) – Type of each element, e.g.: warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.zeros_like(src, device=None, requires_grad=None, pinned=None)[source]#

Return a zero-initialized array with the same type and dimension of another array

Parameters:

src (Array) – The template array to use for shape, data type, and device
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.ones(

shape=None,

dtype=float,

device=None,

requires_grad=False,

pinned=False,

**kwargs,

)[source]#

Return a one-initialized array

Parameters:

shape (int | tuple[int, ...] | list[int] | None) – Array dimensions
dtype (type) – Type of each element, e.g.: warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.ones_like(src, device=None, requires_grad=None, pinned=None)[source]#

Return a one-initialized array with the same type and dimension of another array

Parameters:

src (Array) – The template array to use for shape, data type, and device
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.full(

shape=None,

value=0,

dtype=Any,

device=None,

requires_grad=False,

pinned=False,

**kwargs,

)[source]#

Return an array with all elements initialized to the given value

Parameters:

shape (int | tuple[int, ...] | list[int] | None) – Array dimensions
value – Element value
dtype – Type of each element, e.g.: float, warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.full_like(src, value, device=None, requires_grad=None, pinned=None)[source]#

Return an array with all elements initialized to the given value with the same type and dimension of another array

Parameters:

src (Array) – The template array to use for shape, data type, and device
value (Any) – Element value
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.empty(

shape=None,

dtype=float,

device=None,

requires_grad=False,

pinned=False,

**kwargs,

)[source]#

Returns an uninitialized array

Parameters:

shape (int | tuple[int, ...] | list[int] | None) – Array dimensions
dtype – Type of each element, e.g.: warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.empty_like(src, device=None, requires_grad=None, pinned=None)[source]#

Return an uninitialized array with the same type and dimension of another array

Parameters:

src (Array) – The template array to use for shape, data type, and device
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

warp.copy(dest, src, dest_offset=0, src_offset=0, count=0, stream=None)[source]#

Copy array contents from src to dest.

Parameters:

dest (array) – Destination array, must be at least as large as source buffer
src (array) – Source array
dest_offset (int) – Element offset in the destination array
src_offset (int) – Element offset in the source array
count (int) – Number of array elements to copy (will copy all elements if set to 0)
stream (Stream | None) – The stream on which to perform the copy

The stream, if specified, can be from any device. If the stream is omitted, then Warp selects a stream based on the following rules: (1) If the destination array is on a CUDA device, use the current stream on the destination device. (2) Otherwise, if the source array is on a CUDA device, use the current stream on the source device.

If neither source nor destination are on a CUDA device, no stream is used for the copy.

warp.clone(src, device=None, requires_grad=None, pinned=None)[source]#

Clone an existing array, allocates a copy of the src memory

Parameters:

src (array) – The source array to copy
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)

Returns:

A warp.array object representing the allocation

Return type:

array

Indexed Arrays#

An indexed array is a lightweight view into an existing warp.array instance that references elements through an explicit integer index list, thus allowing to run kernels on an arbitrary subset of data without any copy.

class warp.indexedarray(data=None, indices=None, dtype=None, ndim=None)[source]#

Parameters:

data (array | None)
indices (array | list[array] | None)
dtype (type)
ndim (int | None)

__init__(data=None, indices=None, dtype=None, ndim=None)[source]#

Parameters:

data (array | None)
indices (array | list[array] | None)
ndim (int | None)

Creating an Indexed Array#

Pass the data array together with a list of wp.int32 index arrays, one for each dimension:

import warp as wp

# Base data.
arr = wp.array((1.23, 2.34, 3.45, 4.56, 5.67, 6.78), device="cuda")

# Only view elements at odd indices.
idx = wp.array((1, 3, 5), dtype=wp.int32, device="cuda")
sub = wp.indexedarray(arr, [idx])  # Same as wp.indexedarray1d(...)
print(sub)

[2.34 4.56 6.78]

Additionally, None can be passed to select all elements for any given dimension.

import numpy as np
import warp as wp

mat = wp.array(np.arange(25, dtype=np.float32).reshape((5, 5)))
rows = wp.array((1, 3), dtype=wp.int32)

block = wp.indexedarray2d(mat, (rows, None))  # shape == (2, 5)
print(block)

[[ 5.  6.  7.  8.  9.]
 [15. 16. 17. 18. 19.]]

The resulting view keeps the dtype of the source and has a shape given by the lengths of the supplied index arrays.

Alternative constructors are available for convenience:

warp.indexedarray1d(*args, **kwargs)[source]#

warp.indexedarray2d(*args, **kwargs)[source]#

warp.indexedarray3d(*args, **kwargs)[source]#

warp.indexedarray4d(*args, **kwargs)[source]#

Interoperability With Other Frameworks#

Frameworks such as PyTorch or JAX do not have a concept equivalent to Warp’s indexed arrays. Converting an wp.indexedarray directly therefore raises an exception. Two common workarounds are:

Make a contiguous copy and share that:

import warp as wp

arr = wp.array((1.0, 2.0, 3.0, 4.0), device="cuda")
idx = wp.array((0, 3), dtype=int, device="cuda")
sub = wp.indexedarray1d(arr, idx)
t = wp.to_torch(sub.contiguous())

Share the underlying data and index buffers independently (zero-copy):

import warp as wp

arr = wp.array((1.0, 2.0, 3.0, 4.0), device="cuda")
idx = wp.array((0, 3), dtype=int, device="cuda")
sub = wp.indexedarray1d(arr, idx)
t_data = wp.to_torch(sub.data)
t_ind = wp.to_torch(sub.indices[0])

PyTorch can index with integer tensors, but doing so always copies the data.

Structured Arrays#

Structured arrays in Warp allow you to work with arrays of user-defined structs, enabling efficient, named access to heterogeneous data fields across the CPU and GPU.

Creating and Viewing Struct Arrays#

When you define a Warp struct, you can allocate a Warp array of that type on the CPU and convert it to a NumPy structured array view (zero-copy):

import warp as wp
import numpy as np

@wp.struct
class Foo:
    i: int
    f: float

# allocate a Warp array on the CPU
a = wp.zeros(5, dtype=Foo, device="cpu")

# view it in NumPy without copying
na = a.numpy()

# modify via NumPy
na["i"][0] = 42
na["f"][2] = 13.37

print(a)

[(42,  0.  ) ( 0,  0.  ) ( 0, 13.37) ( 0,  0.  ) ( 0,  0.  )]

Initializing via NumPy and Converting to a Warp Array#

You can also create a NumPy structured array first, then convert it to a Warp array, which works well for batch initialization:

import warp as wp
import numpy as np
import math

rng = np.random.default_rng()

@wp.struct
class Boid:
    vel: wp.vec3f
    wander_angles: wp.vec2f
    mass: float
    group: int

num_boids = 3
npboids = np.zeros(num_boids, dtype=Boid.numpy_dtype())

angles = math.pi - 2 * math.pi * rng.random(num_boids)
npboids["vel"][:, 0] = 20 * np.sin(angles)
npboids["vel"][:, 2] = 20 * np.cos(angles)

npboids["wander_angles"][:, 0] = math.pi * rng.random(num_boids)
npboids["wander_angles"][:, 1] = 2 * math.pi * rng.random(num_boids)

npboids["mass"][:] = 0.5 + 0.5 * rng.random(num_boids)

# create Warp array from prepared NumPy array
boids = wp.array(npboids, dtype=Boid)

This approach leverages NumPy’s vectorized operations to initialize all array elements efficiently, avoiding Python loops.

Nested Structs and Vector Types#

Structured arrays fully support nested structs and Warp vector (and matrix) types:

import warp as wp
import numpy as np

@wp.struct
class Bar:
    x: wp.vec3

@wp.struct
class Foo:
    i: int
    f: float
    bar: Bar

na = np.zeros(5, dtype=Foo.numpy_dtype())

na["i"][0] = 42
na["f"][2] = 13.37
na["bar"]["x"][4] = wp.vec3(1.0)

a = wp.array(na, dtype=Foo, device="cuda:0")

print(a.numpy())

[(42,  0.  , ([0., 0., 0.],)) ( 0,  0.  , ([0., 0., 0.],))
 ( 0, 13.37, ([0., 0., 0.],)) ( 0,  0.  , ([0., 0., 0.],))
 ( 0,  0.  , ([1., 1., 1.],))]

Data Types#

Scalar Types#

The following scalar storage types are supported for array structures:

bool	boolean
int8	signed byte
uint8	unsigned byte
int16	signed short
uint16	unsigned short
int32	signed integer
uint32	unsigned integer
int64	signed long integer
uint64	unsigned long integer
float16	half-precision float
float32	single-precision float
float64	double-precision float

Warp supports float and int as aliases for wp.float32 and wp.int32 respectively.

Vectors#

Warp provides built-in math and geometry types for common simulation and graphics problems. A full reference for operators and functions for these types is available in the Built-Ins Reference.

Warp supports vectors of numbers with an arbitrary length/numeric type. The built-in concrete types are as follows:

vec2 vec3 vec4	2D, 3D, 4D vector of single-precision floats
vec2b vec3b vec4b	2D, 3D, 4D vector of signed bytes
vec2ub vec3ub vec4ub	2D, 3D, 4D vector of unsigned bytes
vec2s vec3s vec4s	2D, 3D, 4D vector of signed shorts
vec2us vec3us vec4us	2D, 3D, 4D vector of unsigned shorts
vec2i vec3i vec4i	2D, 3D, 4D vector of signed integers
vec2ui vec3ui vec4ui	2D, 3D, 4D vector of unsigned integers
vec2l vec3l vec4l	2D, 3D, 4D vector of signed long integers
vec2ul vec3ul vec4ul	2D, 3D, 4D vector of unsigned long integers
vec2h vec3h vec4h	2D, 3D, 4D vector of half-precision floats
vec2f vec3f vec4f	2D, 3D, 4D vector of single-precision floats
vec2d vec3d vec4d	2D, 3D, 4D vector of double-precision floats
spatial_vector	6D vector of single-precision floats
spatial_vectorf	6D vector of single-precision floats
spatial_vectord	6D vector of double-precision floats
spatial_vectorh	6D vector of half-precision floats

Vectors support most standard linear algebra operations, e.g.:

@wp.kernel
def compute( ... ):

    # basis vectors
    a = wp.vec3(1.0, 0.0, 0.0)
    b = wp.vec3(0.0, 1.0, 0.0)

    # take the cross product
    c = wp.cross(a, b)

    # compute
    r = wp.dot(c, c)

    ...

It’s possible to declare additional vector types with different lengths and data types. This is done in outside of kernels in Python scope using warp.types.vector(), for example:

# declare a new vector type for holding 5 double precision floats:
vec5d = wp.types.vector(length=5, dtype=wp.float64)

Once declared, the new type can be used when allocating arrays or inside kernels:

# create an array of vec5d
arr = wp.zeros(10, dtype=vec5d)

# use inside a kernel
@wp.kernel
def compute( ... ):

    # zero initialize a custom named vector type
    v = vec5d()
    ...

    # component-wise initialize a named vector type
    v = vec5d(wp.float64(1.0),
              wp.float64(2.0),
              wp.float64(3.0),
              wp.float64(4.0),
              wp.float64(5.0))
  ...

In addition, it’s possible to directly create anonymously typed instances of these vectors without declaring their type in advance. In this case the type will be inferred by the constructor arguments. For example:

@wp.kernel
def compute( ... ):

    # zero initialize vector of 5 doubles:
    v = wp.vector(dtype=wp.float64, length=5)

    # scalar initialize a vector of 5 doubles to the same value:
    v = wp.vector(wp.float64(1.0), length=5)

    # component-wise initialize a vector of 5 doubles
    v = wp.vector(wp.float64(1.0),
                  wp.float64(2.0),
                  wp.float64(3.0),
                  wp.float64(4.0),
                  wp.float64(5.0))

These can be used with all the standard vector arithmetic operators, e.g.: +, -, scalar multiplication, and can also be transformed using matrices with compatible dimensions, potentially returning vectors with a different length.

Matrices#

Matrices with arbitrary shapes/numeric types are also supported. The built-in concrete matrix types are as follows:

mat22 mat33 mat44	2x2, 3x3, 4x4 matrix of single-precision floats
mat22f mat33f mat44f	2x2, 3x3, 4x4 matrix of single-precision floats
mat22d mat33d mat44d	2x2, 3x3, 4x4 matrix of double-precision floats
mat22h mat33h mat44h	2x2, 3x3, 4x4 matrix of half-precision floats
spatial_matrix	6x6 matrix of single-precision floats
spatial_matrixf	6x6 matrix of single-precision floats
spatial_matrixd	6x6 matrix of double-precision floats
spatial_matrixh	6x6 matrix of half-precision floats

Matrices are stored in row-major format and support most standard linear algebra operations:

@wp.kernel
def compute( ... ):

    # initialize matrix
    m = wp.mat22(1.0, 2.0,
                 3.0, 4.0)

    # compute inverse
    minv = wp.inverse(m)

    # transform vector
    v = minv * wp.vec2(0.5, 0.3)

    ...

In a similar manner to vectors, it’s possible to declare new matrix types with arbitrary shapes and data types using wp.types.matrix(), for example:

# declare a new 3x2 half precision float matrix type:
mat32h = wp.types.matrix(shape=(3,2), dtype=wp.float64)

# create an array of this type
a = wp.zeros(10, dtype=mat32h)

These can be used inside a kernel:

@wp.kernel
def compute( ... ):
    ...

    # initialize a mat32h matrix
    m = mat32h(wp.float16(1.0), wp.float16(2.0),
               wp.float16(3.0), wp.float16(4.0),
               wp.float16(5.0), wp.float16(6.0))

    # declare a 2 component half precision vector
    v2 = wp.vec2h(wp.float16(1.0), wp.float16(1.0))

    # multiply by the matrix, returning a 3 component vector:
    v3 = m * v2
    ...

It’s also possible to directly create anonymously typed instances inside kernels where the type is inferred from constructor arguments as follows:

@wp.kernel
def compute( ... ):
    ...

    # create a 3x2 half precision matrix from components (row major ordering):
    m = wp.matrix(
        wp.float16(1.0), wp.float16(2.0),
        wp.float16(1.0), wp.float16(2.0),
        wp.float16(1.0), wp.float16(2.0),
        shape=(3,2))

    # zero initialize a 3x2 half precision matrix:
    m = wp.matrix(wp.float16(0.0),shape=(3,2))

    # create a 5x5 double precision identity matrix:
    m = wp.identity(n=5, dtype=wp.float64)

As with vectors, you can do standard matrix arithmetic with these variables, along with multiplying matrices with compatible shapes and potentially returning a matrix with a new shape.

Quaternions#

Warp supports quaternions with the layout i, j, k, w where w is the real part. Here are the built-in concrete quaternion types:

quat	Single-precision floating point quaternion
quatf	Single-precision floating point quaternion
quatd	Double-precision floating point quaternion
quath	Half-precision floating point quaternion

Quaternions can be used to transform vectors as follows:

@wp.kernel
def compute( ... ):
    ...

    # construct a 30 degree rotation around the x-axis
    q = wp.quat_from_axis_angle(wp.vec3(1.0, 0.0, 0.0), wp.degrees(30.0))

    # rotate an axis by this quaternion
    v = wp.quat_rotate(q, wp.vec3(0.0, 1.0, 0.0))

As with vectors and matrices, you can declare quaternion types with an arbitrary numeric type like so:

quatd = wp.types.quaternion(dtype=wp.float64)

You can also create identity quaternion and anonymously typed instances inside a kernel like so:

@wp.kernel
def compute( ... ):
    ...

    # create a double precision identity quaternion:
    qd = wp.quat_identity(dtype=wp.float64)

    # precision defaults to wp.float32 so this creates a single precision identity quaternion:
    qf = wp.quat_identity()

    # create a half precision quaternion from components, or a vector/scalar:
    qh = wp.quaternion(wp.float16(0.0),
                       wp.float16(0.0),
                       wp.float16(0.0),
                       wp.float16(1.0))


    qh = wp.quaternion(
        wp.vector(wp.float16(0.0),wp.float16(0.0),wp.float16(0.0)),
        wp.float16(1.0))

Transforms#

Transforms are 7D vectors of floats representing a spatial rigid body transformation in format (p, q) where p is a 3D vector, and q is a quaternion.

transform	Single-precision floating point transform
transformf	Single-precision floating point transform
transformd	Double-precision floating point transform
transformh	Half-precision floating point transform

Transforms can be constructed inside kernels from translation and rotation parts:

@wp.kernel
def compute( ... ):
    ...

    # create a transform from a vector/quaternion:
    t = wp.transform(
            wp.vec3(1.0, 2.0, 3.0),
            wp.quat_from_axis_angle(wp.vec3(0.0, 1.0, 0.0), wp.degrees(30.0)))

    # transform a point
    p = wp.transform_point(t, wp.vec3(10.0, 0.5, 1.0))

    # transform a vector (ignore translation)
    p = wp.transform_vector(t, wp.vec3(10.0, 0.5, 1.0))

As with vectors and matrices, you can declare transform types with an arbitrary numeric type using wp.types.transformation(), for example:

transformd = wp.types.transformation(dtype=wp.float64)

You can also create identity transforms and anonymously typed instances inside a kernel like so:

@wp.kernel
def compute( ... ):

    # create double precision identity transform:
    qd = wp.transform_identity(dtype=wp.float64)

Structs#

Users can define custom structure types using the @wp.struct decorator as follows:

@wp.struct
class MyStruct:

    param1: int
    param2: float
    param3: wp.array(dtype=wp.vec3)

Struct attributes must be annotated with their respective type. They can be constructed in Python scope and then passed to kernels as arguments:

@wp.kernel
def compute(args: MyStruct):

    tid = wp.tid()

    print(args.param1)
    print(args.param2)
    print(args.param3[tid])

# construct an instance of the struct in Python
s = MyStruct()
s.param1 = 10
s.param2 = 2.5
s.param3 = wp.zeros(shape=10, dtype=wp.vec3)

# pass to our compute kernel
wp.launch(compute, dim=10, inputs=[s])

An array of structs can be zero-initialized as follows:

a = wp.zeros(shape=10, dtype=MyStruct)

An array of structs can also be initialized from a list of struct objects:

a = wp.array([MyStruct(), MyStruct(), MyStruct()], dtype=MyStruct)

Example: Using a struct in gradient computation#

import numpy as np

import warp as wp


@wp.struct
class TestStruct:
    x: wp.vec3
    a: wp.array(dtype=wp.vec3)
    b: wp.array(dtype=wp.vec3)


@wp.kernel
def test_kernel(s: TestStruct):
    tid = wp.tid()

    s.b[tid] = s.a[tid] + s.x


@wp.kernel
def loss_kernel(s: TestStruct, loss: wp.array(dtype=float)):
    tid = wp.tid()

    v = s.b[tid]
    wp.atomic_add(loss, 0, float(tid + 1) * (v[0] + 2.0 * v[1] + 3.0 * v[2]))


# create struct
ts = TestStruct()

# set members
ts.x = wp.vec3(1.0, 2.0, 3.0)
ts.a = wp.array(np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]), dtype=wp.vec3, requires_grad=True)
ts.b = wp.zeros(2, dtype=wp.vec3, requires_grad=True)

loss = wp.zeros(1, dtype=float, requires_grad=True)

tape = wp.Tape()
with tape:
    wp.launch(test_kernel, dim=2, inputs=[ts])
    wp.launch(loss_kernel, dim=2, inputs=[ts, loss])

tape.backward(loss)

print(loss)
print(ts.a)

[120.]
[[1. 2. 3.]
 [4. 5. 6.]]

Example: Defining Operator Overloads#

@wp.struct
class Complex:
    real: float
    imag: float

@wp.func
def add(
    a: Complex,
    b: Complex,
) -> Complex:
    return Complex(a.real + b.real, a.imag + b.imag)

@wp.func
def mul(
    a: Complex,
    b: Complex,
) -> Complex:
    return Complex(
        a.real * b.real - a.imag * b.imag,
        a.real * b.imag + a.imag * b.real,
    )

@wp.kernel
def kernel():
    a = Complex(1.0, 2.0)
    b = Complex(3.0, 4.0)

    c = a + b
    wp.printf("%.0f %+.0fi\n", c.real, c.imag)

    d = a * b
    wp.printf("%.0f %+.0fi\n", d.real, d.imag)

wp.launch(kernel, dim=(1,))
wp.synchronize()

Type Conversions#

Warp is particularly strict regarding type conversions and does not perform any implicit conversion between numeric types. The user is responsible for ensuring types for most arithmetic operators match, e.g.: x = float(0.0) + int(4) will result in an error. This can be surprising for users that are accustomed to C-style conversions but avoids a class of common bugs that result from implicit conversions.

Users should explicitly cast variables to compatible types using constructors like int(), float(), wp.float16(), wp.uint8(), etc.

Note

For performance reasons, Warp relies on native compilers to perform numeric conversions (e.g., LLVM for CPU and NVRTC for CUDA). This is generally not a problem, but in some cases the results may vary on different devices. For example, the conversion wp.uint8(-1.0) results in undefined behavior, since the floating point value -1.0 is out of range for unsigned integer types.

C++ compilers are free to handle such cases as they see fit. Numeric conversions are only guaranteed to produce correct results when the value being converted is in the range supported by the target data type.

Constants#

A Warp kernel can access Python variables defined outside of the kernel, which are treated as compile-time constants inside of the kernel.

TYPE_SPHERE = wp.constant(0)
TYPE_CUBE = wp.constant(1)
TYPE_CAPSULE = wp.constant(2)

@wp.kernel
def collide(geometry: wp.array(dtype=int)):

    t = geometry[wp.tid()]

    if t == TYPE_SPHERE:
        print("sphere")
    elif t == TYPE_CUBE:
        print("cube")
    elif t == TYPE_CAPSULE:
        print("capsule")

Note that using wp.constant() is no longer required, but it performs some type checking and can serve as a reminder that the variables are meant to be used as Warp constants.

The behavior is simple and intuitive when the referenced Python variables never change. For details and more complex scenarios, refer to External References and Constants. The Code Generation section contains additional information and tips for advanced usage.

Predefined Constants#

For convenience, Warp has a number of predefined mathematical constants that may be used both inside and outside Warp kernels. The constants in the following table also have lowercase versions defined, e.g. wp.E and wp.e are equivalent.

Name	Value
wp.E	2.71828182845904523536
wp.LOG2E	1.44269504088896340736
wp.LOG10E	0.43429448190325182765
wp.LN2	0.69314718055994530942
wp.LN10	2.30258509299404568402
wp.PHI	1.61803398874989484820
wp.PI	3.14159265358979323846
wp.HALF_PI	1.57079632679489661923
wp.TAU	6.28318530717958647692
wp.INF	math.inf
wp.NAN	float(‘nan’)

The wp.NAN constant may only be used with floating-point types. Comparisons involving wp.NAN follow the IEEE 754 standard, e.g. wp.float32(wp.NAN) == wp.float32(wp.NAN) returns False. The wp.isnan() built-in function can be used to determine whether a value is a NaN (or if a vector, matrix, or quaternion contains a NaN entry).

The following example shows how positive and negative infinity can be used with floating-point types in Warp using the wp.inf constant:

@wp.kernel
def test_infinity(outputs: wp.array(dtype=wp.float32)):
    outputs[0] = wp.float32(wp.inf)        # inf
    outputs[1] = wp.float32(-wp.inf)       # -inf
    outputs[2] = wp.float32(2.0 * wp.inf)  # inf
    outputs[3] = wp.float32(-2.0 * wp.inf) # -inf
    outputs[4] = wp.float32(2.0 / 0.0)     # inf
    outputs[5] = wp.float32(-2.0 / 0.0)    # -inf

Operators#

Boolean Operators#

a and b	True if a and b are True
a or b	True if a or b is True
not a	True if a is False, otherwise False

Note

Expressions such as if (a and b): currently do not perform short-circuit evaluation. In this case b will also be evaluated even when a is False. Users should take care to ensure that secondary conditions are safe to evaluate (e.g.: do not index out of bounds) in all cases.

Comparison Operators#

a > b	True if a strictly greater than b
a < b	True if a strictly less than b
a >= b	True if a greater than or equal to b
a <= b	True if a less than or equal to b
a == b	True if a equals b
a != b	True if a not equal to b

Arithmetic Operators#

a + b	Addition
a - b	Subtraction
a * b	Multiplication
a @ b	Matrix multiplication
a / b	Floating point division
a // b	Floored division
a ** b	Exponentiation
a % b	Modulus

Since Warp does not perform implicit type conversions, operands should have compatible data types. Users should use type constructors such as float(), int(), wp.int64(), etc. to cast variables to the correct type.

The multiplication expression a * b can also be used to perform matrix multiplication between matrix types.

Mapping Functions#

The wp.map() function can be used to apply a function to each element of an array.

warp.utils.map( func, *inputs, out=None, return_kernel=False, block_dim=256, device=None, )[source]#

Map a function over the elements of one or more arrays.

You can use a Warp function, a regular Python function, or a lambda expression to map it to a set of arrays.

a = wp.array([1, 2, 3], dtype=wp.float32)
b = wp.array([4, 5, 6], dtype=wp.float32)
c = wp.array([7, 8, 9], dtype=wp.float32)
result = wp.map(lambda x, y, z: x + 2.0 * y - z, a, b, c)
print(result)

[2. 4. 6.]

Clamp values in an array in place:

xs = wp.array([-1.0, 0.0, 1.0], dtype=wp.float32)
wp.map(wp.clamp, xs, -0.5, 0.5, out=xs)
print(xs)

[-0.5  0.   0.5]

Note that only one of the inputs must be a Warp array. For example, it is possible vectorize the function warp.transform_point() over a collection of points with a given input transform as follows:

tf = wp.transform((1.0, 2.0, 3.0), wp.quat_rpy(0.2, -0.6, 0.1))
points = wp.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype=wp.vec3)
transformed = wp.map(wp.transform_point, tf, points)

Besides regular Warp arrays, other array types, such as the indexedarray, are supported as well:

arr = wp.array(data=np.arange(10, dtype=np.float32))
indices = wp.array([1, 3, 5, 7, 9], dtype=int)
iarr = wp.indexedarray1d(arr, [indices])
out = wp.map(lambda x: x * 10.0, iarr)
print(out)

[10. 30. 50. 70. 90.]

If multiple arrays are provided, the NumPy broadcasting rules are applied to determine the shape of the output array. Two shapes are compatible when: starting from the trailing dimension,

the two dimensions are equal, or
one of the dimensions is 1.

For example, given arrays of shapes (3, 1, 4) and (5, 4), the broadcasted shape is (3, 5, 4).

If no array(s) are provided to the out argument, the output array(s) are created automatically. The data type(s) of the output array(s) are determined by the type of the return value(s) of the function. The requires_grad flag for an automatically created output array is set to True if any of the input arrays have it set to True and the respective output array’s dtype is a type that supports differentiation.

Parameters:

func (Callable | Function) – The function to map over the arrays.
*inputs (array | Any) – The input arrays or values to pass to the function.
out (array | list[array] | None) – Optional output array(s) to store the result(s). If None, the output array(s) will be created automatically.
return_kernel (bool) – If True, only return the generated kernel without performing the mapping operation.
block_dim (int) – The block dimension for the kernel launch.
device (Devicelike) – The device on which to run the kernel.

Returns:

The resulting array(s) of the mapping. If return_kernel is True, only returns the kernel used for mapping.

Return type:

array | list[array] | Kernel

Streams#

A CUDA stream is a sequence of operations that execute in order on the GPU. Operations from different streams may run concurrently and may be interleaved by the device scheduler. See the Streams documentation for more information on using streams.

class warp.Stream(*args, **kwargs)[source]#

__init__(device=None, priority=0, **kwargs)[source]#

Initialize the stream on a device with an optional specified priority.

Parameters:

device (Device | str | None) – The CUDA device on which this stream will be created.
priority (int) – An optional integer specifying the requested stream priority. Can be -1 (high priority) or 0 (low/default priority). Values outside this range will be clamped.
cuda_stream (int) – A optional external stream handle passed as an integer. The caller is responsible for ensuring that the external stream does not get destroyed while it is referenced by this object.

Raises:

RuntimeError – If function is called before Warp has completed initialization with a device that is not an instance of Device.
RuntimeError – device is not a CUDA Device.
RuntimeError – The stream could not be created on the device.
TypeError – The requested stream priority is not an integer.

record_event(event=None)[source]#

Record an event onto the stream.

Parameters:: event (Event | None) – A warp.Event instance to be recorded onto the stream. If not provided, an Event on the same device will be created.
Raises:: RuntimeError – The provided Event is from a different device than the recording stream.
Return type:: Event

wait_event(event)[source]#

Makes all future work in this stream wait until event has completed.

This function does not block the host thread.

Parameters:: event (Event)

wait_stream(other_stream, event=None)[source]#

Records an event on other_stream and makes this stream wait on it.

All work added to this stream after this function has been called will delay their execution until all preceding commands in other_stream have completed.

This function does not block the host thread.

Parameters:

other_stream (Stream) – The stream on which the calling stream will wait for previously issued commands to complete before executing subsequent commands.
event (Event | None) – An optional Event instance that will be used to record an event onto other_stream. If None, an internally managed Event instance will be used.

property is_complete: bool[source]#

A boolean indicating whether all work on the stream has completed.

This property may not be accessed during a graph capture on any stream.

property is_capturing: bool[source]#: A boolean indicating whether a graph capture is currently ongoing on this stream.

property priority: int[source]#: An integer representing the priority of the stream.

warp.get_stream(device=None)[source]#

Return the stream currently used by the given device.

Parameters:: device (Device | str | None) – An optional Device instance or device alias (e.g. “cuda:0”) for which the current stream will be returned. If None, the default device will be used.
Raises:: RuntimeError – The device is not a CUDA device.
Return type:: Stream

warp.set_stream(stream, device=None, sync=False)[source]#

Convenience function for calling Device.set_stream() on the given device.

Parameters:

device (Device | str | None) – An optional Device instance or device alias (e.g. “cuda:0”) for which the current stream is to be replaced with stream. If None, the default device will be used.
stream (Stream) – The stream to set as this device’s current stream.
sync (bool) – If True, then stream will perform a device-side synchronization with the device’s previous current stream.

Return type:

None

warp.wait_stream(other_stream, event=None)[source]#

Convenience function for calling Stream.wait_stream() on the current stream.

Parameters:

other_stream (Stream) – The stream on which the calling stream will wait for previously issued commands to complete before executing subsequent commands.
event (Event | None) – An optional Event instance that will be used to record an event onto other_stream. If None, an internally managed Event instance will be used.

warp.synchronize_stream(stream_or_device=None)[source]#

Synchronize the calling CPU thread with any outstanding CUDA work on the specified stream.

This function allows the host application code to ensure that all kernel launches and memory copies have completed on the stream.

Parameters:: stream_or_device (Stream | Device | str | None) – wp.Stream or a device. If the argument is a device, synchronize the device’s current stream.

class warp.ScopedStream(stream, sync_enter=True, sync_exit=False)[source]#

A context manager to temporarily change the current stream on a device.

Parameters:

stream (wp.Stream | None)
sync_enter (bool)
sync_exit (bool)

stream#

The stream that will temporarily become the device’s default stream within the context.

Type:: Stream or None

saved_stream#

The device’s previous current stream. This is restored as the device’s current stream on exiting the context.

Type:: Stream

sync_enter#

Whether to synchronize this context’s stream with the device’s previous current stream on entering the context.

Type:: bool

sync_exit#

Whether to synchronize the device’s previous current with this context’s stream on exiting the context.

Type:: bool

device#

The device associated with the stream.

Type:: Device

__init__(stream, sync_enter=True, sync_exit=False)[source]#

Initializes the context manager with a stream and synchronization options.

Parameters:

stream (Stream | None) – The stream that will temporarily become the device’s default stream within the context.
sync_enter (bool) – Whether to synchronize this context’s stream with the device’s previous current stream on entering the context.
sync_exit (bool) – Whether to synchronize the device’s previous current with this context’s stream on exiting the context.

Events#

Events can be inserted into streams and used to synchronize a stream with a different one. See the Events documentation for information on how to use events for cross-stream synchronization or the CUDA Events Timing documentation for information on how to use events for measuring GPU performance.

class warp.Event(*args, **kwargs)[source]#

A CUDA event that can be recorded onto a stream.

Events can be used for device-side synchronization, which do not block the host thread.

__init__( device=None, cuda_event=None, enable_timing=False, interprocess=False, )[source]#

Initializes the event on a CUDA device.

Parameters:

device (Device | str | None) – The CUDA device whose streams this event may be recorded onto. If None, then the current default device will be used.
cuda_event – A pointer to a previously allocated CUDA event. If None, then a new event will be allocated on the associated device.
enable_timing (bool) – If True this event will record timing data. get_event_elapsed_time() can be used to measure the time between two events created with enable_timing=True and recorded onto streams.
interprocess (bool) – If True this event may be used as an interprocess event.

Raises:

RuntimeError – The event could not be created.
ValueError – The combination of enable_timing=True and interprocess=True is not allowed.

ipc_handle()[source]#

Return a CUDA IPC handle of the event as a 64-byte bytes object.

The event must have been created with interprocess=True in order to obtain a valid interprocess handle.

IPC is currently only supported on Linux.

Example

Create an event and get its IPC handle:

e1 = wp.Event(interprocess=True)
event_handle = e1.ipc_handle()

Raises:: RuntimeError – Device does not support IPC.
Return type:: bytes

property is_complete: bool[source]#

A boolean indicating whether all work on the stream when the event was recorded has completed.

This property may not be accessed during a graph capture on any stream.

warp.record_event(event=None)[source]#

Convenience function for calling Stream.record_event() on the current stream.

Parameters:: event (Event | None) – Event instance to record. If None, a new Event instance will be created.
Returns:: The recorded event.

warp.wait_event(event)[source]#

Convenience function for calling Stream.wait_event() on the current stream.

Parameters:: event (Event) – Event instance to wait for.

warp.synchronize_event(event)[source]#

Synchronize the calling CPU thread with an event recorded on a CUDA stream.

This function allows the host application code to ensure that a specific synchronization point was reached.

Parameters:: event (Event) – Event to wait for.

warp.get_event_elapsed_time(start_event, end_event, synchronize=True)[source]#

Get the elapsed time between two recorded events.

Both events must have been previously recorded with record_event() or warp.Stream.record_event().

If synchronize is False, the caller must ensure that device execution has reached end_event prior to calling get_event_elapsed_time().

Parameters:

start_event (Event) – The start event.
end_event (Event) – The end event.
synchronize (bool) – Whether Warp should synchronize on the end_event.

Returns:

The elapsed time in milliseconds with a resolution about 0.5 ms.

Graphs#

Launching kernels from Python introduces significant additional overhead compared to C++ or native programs. To address this, Warp exposes the concept of CUDA graphs to allow recording large batches of kernels and replaying them with very little CPU overhead.

To record a series of kernel launches use the wp.capture_begin() and wp.capture_end() API as follows:

# begin capture
wp.capture_begin(device="cuda")

try:
    # record launches
    for i in range(100):
        wp.launch(kernel=compute1, inputs=[a, b], device="cuda")
finally:
    # end capture and return a graph object
    graph = wp.capture_end(device="cuda")

We strongly recommend the use of the try-finally pattern when capturing graphs because the finally statement will ensure wp.capture_end gets called, even if an exception occurs during capture, which would otherwise trap the stream in a capturing state.

Once a graph has been constructed it can be executed:

wp.capture_launch(graph)

The wp.ScopedCapture context manager can be used to simplify the code and ensure that wp.capture_end is called regardless of exceptions:

with wp.ScopedCapture(device="cuda") as capture:
    # record launches
    for i in range(100):
        wp.launch(kernel=compute1, inputs=[a, b], device="cuda")

wp.capture_launch(capture.graph)

Note that only launch calls are recorded in the graph; any Python executed outside of the kernel code will not be recorded. Typically it is only beneficial to use CUDA graphs when the graph will be reused or launched multiple times, as there is a graph-creation overhead.

Conditional Execution#

CUDA 12.4+ supports conditional graph nodes that enable dynamic control flow in CUDA graphs.

wp.capture_if creates a dynamic branch based on a condition. The condition value is read from a single-element int array, where a non-zero value means that the condition is True.

# create condition
cond = wp.zeros(1, dtype=int)

with wp.ScopedCapture() as capture:
    wp.launch(foo, ...)

    # execute a branch based on the condition value
    wp.capture_if(cond,
                  on_true=...,
                  on_false=...)

    wp.launch(bar, ...)

The condition value can be updated by kernels launched prior to capture_if() in the same graph (e.g. kernel foo above) or it can be updated by other means before the graph is launched. Note that during graph capture, the value of the condition is ignored. It is only used when the graph is launched, making dynamic control flow possible.

# this will execute the `on_true` branch
cond.fill_(1)
wp.capture_launch(capture.graph)

# this will execute the `on_false` branch
cond.fill_(0)
wp.capture_launch(capture.graph)

The on_true and on_false callbacks can be previously captured Graph objects or Python callback functions. These callbacks are captured as child graphs of the enclosing graph. It’s possible to specify only one or both callbacks, as needed. When the parent graph is launched, the correct child graph is executed based on the value of the condition. This is done efficiently on the device without involving the CPU.

Here is an example that uses previously captured graphs:

@wp.kernel
def hello_kernel():
    print("Hello")

@wp.kernel
def goodbye_kernel():
    print("Goodbye")

@wp.kernel
def yes_kernel():
    print("Yes!")

@wp.kernel
def no_kernel():
    print("No!")


# create condition
cond = wp.zeros(1, dtype=int)

# capture the on_true branch
with wp.ScopedCapture() as yes_capture:
    wp.launch(yes_kernel, dim=1)

# capture the on_false branch
with wp.ScopedCapture() as no_capture:
    wp.launch(no_kernel, dim=1)

# capture the main graph
with wp.ScopedCapture() as capture:
    wp.launch(hello_kernel, dim=1)

    # specify branches using subgraphs
    wp.capture_if(cond,
                  on_true=yes_capture.graph,
                  on_false=no_capture.graph)

    wp.launch(goodbye_kernel, dim=1)

# execute on_true branch
cond.fill_(1)
wp.capture_launch(capture.graph)

# execute on_false branch
cond.fill_(0)
wp.capture_launch(capture.graph)

wp.synchronize_device()

Here is an example that uses Python callback functions. These callbacks will be captured as child graphs of the main graph:

@wp.kernel
def hello_kernel():
    print("Hello")

@wp.kernel
def goodbye_kernel():
    print("Goodbye")

@wp.kernel
def yes_kernel():
    print("Yes!")

@wp.kernel
def no_kernel():
    print("No!")


# create condition
cond = wp.zeros(1, dtype=int)

# Python callback for the on_true branch
def yes_callback():
    wp.launch(yes_kernel, dim=1)

# Python callback for the on_false branch
def no_callback():
    wp.launch(no_kernel, dim=1)

# capture the main graph
with wp.ScopedCapture() as capture:
    wp.launch(hello_kernel, dim=1)

    # specify branches using Python callback functions
    wp.capture_if(cond,
                  on_true=yes_callback,
                  on_false=no_callback)

    wp.launch(goodbye_kernel, dim=1)

# execute on_true branch
cond.fill_(1)
wp.capture_launch(capture.graph)

# execute on_false branch
cond.fill_(0)
wp.capture_launch(capture.graph)

wp.synchronize_device()

When using Python callback functions, any extra keyword arguments to wp.capture_if are forwarded to the callbacks.

wp.capture_while creates a dynamic loop based on a condition. Similarly to wp.capture_if, the condition value is read from a single-element int array, where a non-zero value means that the condition is True.

# create condition
cond = wp.zeros(1, dtype=int)

with wp.ScopedCapture() as capture:
    wp.launch(foo, ...)

    # execute the while_body while the condition is true
    wp.capture_while(cond, while_body=...)

    wp.launch(bar, ...)

The while_body callback will be executed as long as the condition is non-zero. The callback is responsible for updating the condition value so that the loop eventually terminates. The while_body argument can be a previously captured graph or a Python callback function. Here is an example that will run some number of iterations, using the condition value as a counter:

@wp.kernel
def hello_kernel():
    print("Hello")

@wp.kernel
def goodbye_kernel():
    print("Goodbye")

@wp.kernel
def body_kernel(cond: wp.array(dtype=int)):
    tid = wp.tid()
    print(cond[0])
    # decrement the condition counter
    if tid == 0:
        cond[0] -= 1


# create condition
cond = wp.zeros(1, dtype=int)

# capture the while_body
with wp.ScopedCapture() as body_capture:
    wp.launch(body_kernel, dim=1, inputs=[cond])

# capture the main graph
with wp.ScopedCapture() as capture:
    wp.launch(hello_kernel, dim=1)

    # dynamic loop
    wp.capture_while(cond, while_body=body_capture.graph)

    wp.launch(goodbye_kernel, dim=1)

# loop 5 times
cond.fill_(5)
wp.capture_launch(capture.graph)

# loop 2 times
cond.fill_(2)
wp.capture_launch(capture.graph)

wp.synchronize_device()

Note

Conditional graph node support is only available if Warp is built using CUDA Toolkit 12.4+ and the NVIDIA driver supports CUDA 12.4+.

Note

Due to a current CUDA limitation, graphs with conditional nodes cannot be used as child graphs. It means that it’s not possible to create nested conditional constructs using previously captured graphs. If nesting is required, using Python callback functions is the way to go.

Note

wp.capture_if and wp.capture_while will work even without graph capture on any device. If there is no active capture, the condition will be evaluated on the CPU and the correct branch will be executed immediately. This makes it possible to write code that works similarly with and without graph capture.

Graph API Reference#

warp.capture_begin( device=None, stream=None, force_module_load=None, external=False, )[source]#

Begin capture of a CUDA graph

Captures all subsequent kernel launches and memory operations on CUDA devices. This can be used to record large numbers of kernels and replay them with low overhead.

If device is specified, the capture will begin on the CUDA stream currently associated with the device. If stream is specified, the capture will begin on the given stream. If both are omitted, the capture will begin on the current stream of the current device.

Parameters:

device (Device | str | None) – The CUDA device to capture on
stream (Stream | None) – The CUDA stream to capture on
force_module_load (bool | None) – Whether to force loading of all kernels before capture. In general it is better to use load_module() to selectively load kernels. When running with CUDA drivers that support CUDA 12.3 or newer, this option is not recommended to be set to True because kernels can be loaded during graph capture on more recent drivers. If this argument is None, then the behavior inherits from wp.config.enable_graph_capture_module_load_by_default if the driver is older than CUDA 12.3.
external (bool) – Whether the capture was already started externally

warp.capture_end(device=None, stream=None)[source]#

End the capture of a CUDA graph.

Parameters:

device (Device | str | None) – The CUDA device where capture began
stream (Stream | None) – The CUDA stream where capture began

Returns:

A Graph object that can be launched with capture_launch()

Return type:

Graph

warp.capture_launch(graph, stream=None)[source]#

Launch a previously captured CUDA graph

Parameters:

graph (Graph) – A Graph as returned by capture_end()
stream (Stream | None) – A Stream to launch the graph on

warp.capture_if(

condition,

on_true=None,

on_false=None,

stream=None,

**kwargs,

)[source]#

Create a dynamic branch based on a condition.

The condition value is retrieved from the first element of the condition array.

This function is particularly useful with CUDA graphs, but can be used without graph capture as well. CUDA 12.4+ is required to take advantage of conditional graph nodes for dynamic control flow.

Parameters:

condition (array(ndim=1, dtype=int32)) – Warp array holding the condition value.
on_true (Callable | Graph | None) – A callback function or Graph to execute if the condition is True.
on_false (Callable | Graph | None) – A callback function or Graph to execute if the condition is False.
stream (Stream) – The CUDA stream where the condition was written. If None, use the current stream on the device where condition resides.

Any additional keyword arguments are forwarded to the callback functions.

warp.capture_while(condition, while_body, stream=None, **kwargs)[source]#

Create a dynamic loop based on a condition.

The condition value is retrieved from the first element of the condition array.

The while_body callback is responsible for updating the condition value so the loop can terminate.

This function is particularly useful with CUDA graphs, but can be used without graph capture as well. CUDA 12.4+ is required to take advantage of conditional graph nodes for dynamic control flow.

Parameters:

condition (array(ndim=1, dtype=int32)) – Warp array holding the condition value.
while_body (Callable | Graph) – A callback function or Graph to execute while the loop condition is True.
stream (Stream) – The CUDA stream where the condition was written. If None, use the current stream on the device where condition resides.

Any additional keyword arguments are forwarded to the callback function.

warp.capture_debug_dot_print(graph, path, verbose=False)[source]#

Export a CUDA graph to a DOT file for visualization

Parameters:

graph (Graph) – A Graph as returned by capture_end()
path (str) – Path to save the DOT file
verbose (bool) – Whether to include additional debug information in the output

class warp.ScopedCapture( device=None, stream=None, force_module_load=None, external=False, )[source]#

Parameters:: device (Devicelike)

__init__( device=None, stream=None, force_module_load=None, external=False, )[source]#

Parameters:: device (Device | str | None)

Meshes#

Warp provides a wp.Mesh class to manage triangle mesh data. To create a mesh, users provide a points, indices and optionally a velocity array:

mesh = wp.Mesh(points, indices, velocities)

Note

Mesh objects maintain references to their input geometry buffers. All buffers should live on the same device.

Meshes can be passed to kernels using their id attribute, which is uint64 value that uniquely identifies the mesh. Once inside a kernel, you can perform geometric queries against the mesh such as ray-casts or closest-point lookups:

@wp.kernel
def raycast(mesh: wp.uint64,
            ray_origin: wp.array(dtype=wp.vec3),
            ray_dir: wp.array(dtype=wp.vec3),
            ray_hit: wp.array(dtype=wp.vec3)):

    tid = wp.tid()

    t = float(0.0)      # hit distance along ray
    u = float(0.0)      # hit face barycentric u
    v = float(0.0)      # hit face barycentric v
    sign = float(0.0)   # hit face sign
    n = wp.vec3()       # hit face normal
    f = int(0)          # hit face index

    color = wp.vec3()

    # ray cast against the mesh
    if wp.mesh_query_ray(mesh, ray_origin[tid], ray_dir[tid], 1.e+6, t, u, v, sign, n, f):

        # if we got a hit then set color to the face normal
        color = n*0.5 + wp.vec3(0.5, 0.5, 0.5)

    ray_hit[tid] = color

Users may update mesh vertex positions at runtime simply by modifying the points buffer. After modifying point locations users should call Mesh.refit() to rebuild the bounding volume hierarchy (BVH) structure and ensure that queries work correctly.

Note

Updating Mesh topology (indices) at runtime is not currently supported. Users should instead recreate a new Mesh object.

class warp.Mesh(*args, **kwargs)[source]#

__init__( points, indices, velocities=None, support_winding_number=False, bvh_constructor=None, )[source]#

Class representing a triangle mesh.

id#: Unique identifier for this mesh object, can be passed to kernels.

device#: Device this object lives on, all buffers must live on the same device.

Parameters:

points (array) – Array of vertex positions of data type warp.vec3.
indices (array) – Array of triangle indices of data type warp.int32. Should be a 1D array with shape (num_tris * 3).
velocities (array | None) – Optional array of vertex velocities of data type warp.vec3.
support_winding_number (bool) – If True, the mesh will build additional data structures to support wp.mesh_query_point_sign_winding_number() queries.
bvh_constructor (str | None) – The construction algorithm for the underlying BVH (see the docstring of Bvh for explanation). Valid choices are "sah", "median", "lbvh", or None.

refit()[source]#

Refit the BVH to points.

This should be called after users modify the points data.

property points[source]#

The array of mesh’s vertex positions of type warp.vec3.

The Mesh.points property has a custom setter method. Users can modify the vertex positions in-place, but refit() must be called manually after such modifications. Alternatively, assigning a new array to this property is also supported. The new array must have the same shape as the original, and once assigned, The Mesh will automatically perform a refit operation based on the new vertex positions.

property velocities[source]#

The array of mesh’s velocities of type warp.vec3.

This is a property with a custom setter method. Users can modify the velocities in-place, or assign a new array to this property. No refitting is needed for changing velocities.

Hash Grids#

Many particle-based simulation methods such as the Discrete Element Method (DEM), or Smoothed Particle Hydrodynamics (SPH), involve iterating over spatial neighbors to compute force interactions. Hash grids are a well-established data structure to accelerate these nearest neighbor queries, and particularly well-suited to the GPU.

To support spatial neighbor queries Warp provides a HashGrid object that may be created as follows:

grid = wp.HashGrid(dim_x=128, dim_y=128, dim_z=128, device="cuda")

grid.build(points=p, radius=r)

p is an array of wp.vec3 point positions, and r is the radius to use when building the grid. Neighbors can then be iterated over inside the kernel code using wp.hash_grid_query() and wp.hash_grid_query_next() as follows:

@wp.kernel
def sum(grid : wp.uint64,
        points: wp.array(dtype=wp.vec3),
        output: wp.array(dtype=wp.vec3),
        radius: float):

    tid = wp.tid()

    # query point
    p = points[tid]

    # create grid query around point
    query = wp.hash_grid_query(grid, p, radius)
    index = int(0)

    sum = wp.vec3()

    while(wp.hash_grid_query_next(query, index)):

        neighbor = points[index]

        # compute distance to neighbor point
        dist = wp.length(p-neighbor)
        if (dist <= radius):
            sum += neighbor

    output[tid] = sum

Note

The HashGrid query will give back all points in cells that fall inside the query radius. When there are hash conflicts it means that some points outside of query radius will be returned, and users should check the distance themselves inside their kernels. The reason the query doesn’t do the check itself for each returned point is because it’s common for kernels to compute the distance themselves, so it would redundant to check/compute the distance twice.

class warp.HashGrid(*args, **kwargs)[source]#

__init__(dim_x, dim_y, dim_z, device=None)[source]#

Class representing a hash grid object for accelerated point queries.

id#: Unique identifier for this mesh object, can be passed to kernels.

device#: Device this object lives on, all buffers must live on the same device.

Parameters:

dim_x (int) – Number of cells in x-axis
dim_y (int) – Number of cells in y-axis
dim_z (int) – Number of cells in z-axis

build(points, radius)[source]#

Updates the hash grid data structure.

This method rebuilds the underlying datastructure and should be called any time the set of points changes.

Parameters:

points (warp.array) – Array of points of type warp.vec3
radius (float) – The cell size to use for bucketing points, cells are cubes with edges of this width. For best performance the radius used to construct the grid should match closely to the radius used when performing queries.

reserve(num_points)[source]#

Volumes#

Sparse volumes are incredibly useful for representing grid data over large domains, such as signed distance fields (SDFs) for complex objects, or velocities for large-scale fluid flow. Warp supports reading sparse volumetric grids stored using the NanoVDB standard. Users can access voxels directly or use built-in closest-point or trilinear interpolation to sample grid data from world or local space.

Volume objects can be created directly from Warp arrays containing a NanoVDB grid, from the contents of a standard .nvdb file using load_from_nvdb(), from an uncompressed in-memory buffer using load_from_address(), or from a dense 3D NumPy array using load_from_numpy().

Volumes can also be created using allocate(), allocate_by_tiles() or allocate_by_voxels(). The values for a Volume object can be modified in a Warp kernel using wp.volume_store().

Note

Warp does not currently support modifying the topology of sparse volumes at runtime.

Below we give an example of creating a Volume object from an existing NanoVDB file:

# open NanoVDB file on disk
file = open("mygrid.nvdb", "rb")

# create Volume object
volume = wp.Volume.load_from_nvdb(file, device="cpu")

Note

Files written by the NanoVDB library, commonly marked by the .nvdb extension, can contain multiple grids with various compression methods, but a Volume object represents a single NanoVDB grid. The first grid is loaded by default, then Warp volumes corresponding to the other grids in the file can be created using repeated calls to load_next_grid(). NanoVDB’s uncompressed and zip-compressed file formats are supported out-of-the-box, blosc compressed files require the blosc Python package to be installed.

To sample the volume inside a kernel we pass a reference to it by ID, and use the built-in sampling modes:

@wp.kernel
def sample_grid(volume: wp.uint64,
                points: wp.array(dtype=wp.vec3),
                samples: wp.array(dtype=float)):

    tid = wp.tid()

    # load sample point in world-space
    p = points[tid]

    # transform position to the volume's local-space
    q = wp.volume_world_to_index(volume, p)

    # sample volume with trilinear interpolation
    f = wp.volume_sample(volume, q, wp.Volume.LINEAR, dtype=float)

    # write result
    samples[tid] = f

Warp also supports NanoVDB index grids, which provide a memory-efficient linearization of voxel indices that can refer to values in arbitrarily shaped arrays:

@wp.kernel
def sample_index_grid(volume: wp.uint64,
                     points: wp.array(dtype=wp.vec3),
                     voxel_values: wp.array(dtype=Any)):

    tid = wp.tid()

    # load sample point in world-space
    p = points[tid]

    # transform position to the volume's local-space
    q = wp.volume_world_to_index(volume, p)

    # sample volume with trilinear interpolation
    background_value = voxel_values.dtype(0.0)
    f = wp.volume_sample_index(volume, q, wp.Volume.LINEAR, voxel_values, background_value)

The coordinates of all indexable voxels can be recovered using get_voxels(). NanoVDB grids may also contain embedded blind data arrays; those can be accessed with the feature_array() function.

class warp.Volume(*args, **kwargs)[source]#

CLOSEST = 0#: Enum value to specify nearest-neighbor interpolation during sampling

LINEAR = 1#: Enum value to specify trilinear interpolation during sampling

__init__(data, copy=True)[source]#

Class representing a sparse grid.

Parameters:

data (array) – Array of bytes representing the volume in NanoVDB format.
copy (bool) – Whether the incoming data will be copied or aliased.

array()[source]#

Return the raw memory buffer of the Volume as an array.

Return type:: array

get_tile_count()[source]#

Return the number of tiles (NanoVDB leaf nodes) of the volume.

Return type:: int

get_tiles(out=None)[source]#

Return the integer coordinates of all allocated tiles for this volume.

Parameters:: out (array | None) – If provided, use the out array to store the tile coordinates, otherwise a new array will be allocated. out must be a contiguous array of tile_count vec3i or tile_count x 3 int32 on the same device as this volume.
Return type:: array

get_voxel_count()[source]#

Return the total number of allocated voxels for this volume

Return type:: int

get_voxels(out=None)[source]#

Return the integer coordinates of all allocated voxels for this volume.

Parameters:: out (array | None) – If provided, use the out array to store the voxel coordinates, otherwise a new array will be allocated. out must be a contiguous array of voxel_count vec3i or voxel_count x 3 int32 on the same device as this volume.
Return type:: array

get_voxel_size()[source]#

Return the voxel size, i.e, world coordinates of voxel’s diagonal vector

Return type:: tuple[float, float, float]

class GridInfo( name, size_in_bytes, grid_index, grid_count, type_str, translation, transform_matrix, )[source]#

Grid metadata

Parameters:

name (str)
size_in_bytes (int)
grid_index (int)
grid_count (int)
type_str (str)
translation (vec3f)
transform_matrix (mat33f)

name: str#: Grid name

size_in_bytes: int#: Size of this grid’s data, in bytes

grid_index: int#: Index of this grid in the data buffer

grid_count: int#: Total number of grids in the data buffer

type_str: str#: String describing the type of the grid values

translation: vec3f#: Index-to-world translation

transform_matrix: mat33f#: Linear part of the index-to-world transform

get_grid_info()[source]#

Returns the metadata associated with this Volume

Return type:: GridInfo

property dtype: type[source]#

Type of the Volume’s values as a Warp type.

If the grid does not contain values (e.g. index grids) or if the NanoVDB type is not representable as a Warp type, returns None.

property is_index: bool[source]#: Whether this Volume contains an index grid, that is, a type of grid that does not explicitly store values but associates each voxel to linearized index.

get_feature_array_count()[source]#

Return the number of supplemental data arrays stored alongside the grid

Return type:: int

class FeatureArrayInfo(name, ptr, value_size, value_count, type_str)[source]#

Metadata for a supplemental data array

Parameters:

name (str)
ptr (int)
value_size (int)
value_count (int)
type_str (str)

name: str#: Name of the data array

ptr: int#: Memory address of the start of the array

value_size: int#: Size in bytes of the array values

value_count: int#: Number of values in the array

type_str: str#: String describing the type of the array values

get_feature_array_info(feature_index)[source]#

Return the metadata associated to the feature array at feature_index.

Parameters:: feature_index (int)
Return type:: FeatureArrayInfo

feature_array(feature_index, dtype=None)[source]#

Return one the grid’s feature data arrays as a Warp array.

Parameters:

feature_index (int) – Index of the supplemental data array in the grid
dtype – Data type for the returned Warp array. If not provided, will be deduced from the array metadata.

Return type:

array

classmethod load_from_nvdb(file_or_buffer, device=None)[source]#

Create a Volume object from a serialized NanoVDB file or in-memory buffer.

Returns:: A warp.Volume object.
Return type:: Volume

save_to_nvdb(path, codec='none')[source]#

Serialize the Volume into a NanoVDB (.nvdb) file.

Parameters:

path – File path to save.
codec (Literal['none', 'zip', 'blosc']) – Compression codec used “none” - no compression “zip” - ZIP compression “blosc” - BLOSC compression, requires the blosc module to be installed

classmethod load_from_address(grid_ptr, buffer_size=0, device=None)[source]#

Creates a new Volume aliasing an in-memory grid buffer.

In contrast to load_from_nvdb() which should be used to load serialized NanoVDB grids, here the buffer must be uncompressed and must not contain file header information. If the passed address does not contain a NanoVDB grid, the behavior of this function is undefined.

Parameters:

grid_ptr (int) – Integer address of the start of the grid buffer
buffer_size (int) – Size of the buffer, in bytes. If not provided, the size will be assumed to be that of the single grid starting at grid_ptr.
device – Device of the buffer, and of the returned Volume. If not provided, the current Warp device is assumed.

Return type:

Volume

Returns the newly created Volume.

load_next_grid()[source]#

Tries to create a new warp Volume for the next grid that is linked to by this Volume.

The existence of a next grid is deduced from the grid_index and grid_count metadata as well as the size of this Volume’s in-memory buffer.

Returns the newly created Volume, or None if there is no next grid.

Return type:: Volume

classmethod load_from_numpy( ndarray, min_world=(0.0, 0.0, 0.0), voxel_size=1.0, bg_value=0.0, device=None, )[source]#

Creates a Volume object from a dense 3D NumPy array.

This function is only supported for CUDA devices.

Parameters:

min_world – The 3D coordinate of the lower corner of the volume.
voxel_size – The size of each voxel in spatial coordinates.
bg_value – Background value
device – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.
ndarray (array)

Returns:

A warp.Volume object.

Return type:

Volume

classmethod allocate( min, max, voxel_size, bg_value=0.0, translation=(0.0, 0.0, 0.0), points_in_world_space=False, device=None, )[source]#

Allocate a new Volume based on the bounding box defined by min and max.

This function is only supported for CUDA devices.

Allocate a volume that is large enough to contain voxels [min[0], min[1], min[2]] - [max[0], max[1], max[2]], inclusive. If points_in_world_space is true, then min and max are first converted to index space with the given voxel size and translation, and the volume is allocated with those.

The smallest unit of allocation is a dense tile of 8x8x8 voxels, the requested bounding box is rounded up to tiles, and the resulting tiles will be available in the new volume.

Parameters:

min (array-like) – Lower 3D coordinates of the bounding box in index space or world space, inclusive.
max (array-like) – Upper 3D coordinates of the bounding box in index space or world space, inclusive.
voxel_size (float) – Voxel size of the new volume.
bg_value (float or array-like) – Value of unallocated voxels of the volume, also defines the volume’s type, a warp.vec3 volume is created if this is array-like, otherwise a float volume is created
translation (array-like) – translation between the index and world spaces.
device (Devicelike) – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.

Return type:

Volume

classmethod allocate_by_tiles( tile_points, voxel_size=None, bg_value=0.0, translation=(0.0, 0.0, 0.0), device=None, transform=None, )[source]#

Allocate a new Volume with active tiles for each point tile_points.

This function is only supported for CUDA devices.

The smallest unit of allocation is a dense tile of 8x8x8 voxels. This is the primary method for allocating sparse volumes. It uses an array of points indicating the tiles that must be allocated.

Example use cases:

tile_points can mark tiles directly in index space as in the case this method is called by allocate.
tile_points can be a list of points used in a simulation that needs to transfer data to a volume.

Parameters:

tile_points (warp.array) – Array of positions that define the tiles to be allocated. The array may use an integer scalar type (2D N-by-3 array of warp.int32 or 1D array of warp.vec3i values), indicating index space positions, or a floating point scalar type (2D N-by-3 array of warp.float32 or 1D array of warp.vec3f values), indicating world space positions. Repeated points per tile are allowed and will be efficiently deduplicated.
voxel_size (float or array-like) – Voxel size(s) of the new volume. Ignored if transform is given.
bg_value (array-like, scalar or None) – Value of unallocated voxels of the volume, also defines the volume’s type. An index volume will be created if bg_value is None. Other supported grid types are int, float, vec3f, and vec4f.
translation (array-like) – Translation between the index and world spaces.
transform (array-like) – Linear transform between the index and world spaces. If None, deduced from voxel_size.
device (Devicelike) – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.

Return type:

Volume

classmethod allocate_by_voxels( voxel_points, voxel_size=None, translation=(0.0, 0.0, 0.0), device=None, transform=None, )[source]#

Allocate a new Volume with active voxel for each point voxel_points.

This function creates an index Volume, a special kind of volume that does not any store any explicit payload but encodes a linearized index for each active voxel, allowing to lookup and sample data from arbitrary external arrays.

This function is only supported for CUDA devices.

Parameters:

voxel_points (warp.array) – Array of positions that define the voxels to be allocated. The array may use an integer scalar type (2D N-by-3 array of warp.int32 or 1D array of warp.vec3i values), indicating index space positions, or a floating point scalar type (2D N-by-3 array of warp.float32 or 1D array of warp.vec3f values), indicating world space positions. Repeated points per tile are allowed and will be efficiently deduplicated.
voxel_size (float or array-like) – Voxel size(s) of the new volume. Ignored if transform is given.
translation (array-like) – Translation between the index and world spaces.
transform (array-like) – Linear transform between the index and world spaces. If None, deduced from voxel_size.
device (Devicelike) – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.

Return type:

Volume

Bounding Value Hierarchies (BVH)#

The wp.Bvh class can be used to create a BVH for a group of bounding volumes. This object can then be traversed to determine which parts are intersected by a ray using bvh_query_ray() and which parts overlap with a certain bounding volume using bvh_query_aabb().

The following snippet demonstrates how to create a wp.Bvh object from 100 random bounding volumes:

rng = np.random.default_rng(123)

num_bounds = 100
lowers = rng.random(size=(num_bounds, 3)) * 5.0
uppers = lowers + rng.random(size=(num_bounds, 3)) * 5.0

device_lowers = wp.array(lowers, dtype=wp.vec3, device="cuda:0")
device_uppers = wp.array(uppers, dtype=wp.vec3, device="cuda:0")

bvh = wp.Bvh(device_lowers, device_uppers)

class warp.Bvh(*args, **kwargs)[source]#

__init__(lowers, uppers, constructor=None)[source]#

Class representing a bounding volume hierarchy.

Depending on which device the input bounds live, it can be either a CPU tree or a GPU tree.

id#: Unique identifier for this BVH object, can be passed to kernels.

device#: Device this object lives on, all buffers must live on the same device.

Parameters:

lowers (array) – Array of lower bounds of data type warp.vec3.
uppers (array) – Array of upper bounds of data type warp.vec3. lowers and uppers must live on the same device.
constructor (str | None) – The construction algorithm used to build the tree. Valid choices are "sah", "median", "lbvh", or None. When None, the default constructor will be used (see the note).

Note

Explanation of BVH constructors:

"sah": A CPU-based top-down constructor where the AABBs are split based on Surface Area Heuristics (SAH). Construction takes slightly longer than others but has the best query performance.
"median": A CPU-based top-down constructor where the AABBs are split based on the median of centroids of primitives in an AABB. This constructor is faster than SAH but offers inferior query performance.
"lbvh": A GPU-based bottom-up constructor which maximizes parallelism. Construction is very fast, especially for large models. Query performance is slightly slower than "sah".
None: The constructor will be automatically chosen based on the device where the tree lives. For a GPU tree, the "lbvh" constructor will be selected; for a CPU tree, the "sah" constructor will be selected.

All three constructors are supported for GPU trees. When a CPU-based constructor is selected for a GPU tree, bounds will be copied back to the CPU to run the CPU-based constructor. After construction, the CPU tree will be copied to the GPU.

Only "sah" and "median" are supported for CPU trees. If "lbvh" is selected for a CPU tree, a warning message will be issued, and the constructor will automatically fall back to "sah".

refit()[source]#

Refit the BVH.

This should be called after users modify the lowers or uppers arrays.

Example: BVH Ray Traversal#

An example of performing a ray traversal on the data structure is as follows:

@wp.kernel
def bvh_query_ray(
    bvh_id: wp.uint64,
    start: wp.vec3,
    dir: wp.vec3,
    bounds_intersected: wp.array(dtype=wp.bool),
):
    query = wp.bvh_query_ray(bvh_id, start, dir)
    bounds_nr = wp.int32(0)

    while wp.bvh_query_next(query, bounds_nr):
        # The ray intersects the volume with index bounds_nr
        bounds_intersected[bounds_nr] = True


bounds_intersected = wp.zeros(shape=(num_bounds), dtype=wp.bool, device="cuda:0")
query_start = wp.vec3(0.0, 0.0, 0.0)
query_dir = wp.normalize(wp.vec3(1.0, 1.0, 1.0))

wp.launch(
    kernel=bvh_query_ray,
    dim=1,
    inputs=[bvh.id, query_start, query_dir, bounds_intersected],
    device="cuda:0",
)

The Warp kernel bvh_query_ray is launched with a single thread, provided the unique uint64 identifier of the wp.Bvh object, parameters describing the ray, and an array to store the results. In bvh_query_ray, wp.bvh_query_ray() is called once to obtain an object that is stored in the variable query. An integer is also allocated as bounds_nr to store the volume index of the traversal. A while statement is used for the actual traversal using wp.bvh_query_next(), which returns True as long as there are intersecting bounds.

Example: BVH Volume Traversal#

Similar to the ray-traversal example, we can perform volume traversal to find the volumes that are fully contained within a specified bounding box.

@wp.kernel
def bvh_query_aabb(
    bvh_id: wp.uint64,
    lower: wp.vec3,
    upper: wp.vec3,
    bounds_intersected: wp.array(dtype=wp.bool),
):
    query = wp.bvh_query_aabb(bvh_id, lower, upper)
    bounds_nr = wp.int32(0)

    while wp.bvh_query_next(query, bounds_nr):
        # The volume with index bounds_nr is fully contained
        # in the (lower,upper) bounding box
        bounds_intersected[bounds_nr] = True


bounds_intersected = wp.zeros(shape=(num_bounds), dtype=wp.bool, device="cuda:0")
query_lower = wp.vec3(4.0, 4.0, 4.0)
query_upper = wp.vec3(6.0, 6.0, 6.0)

wp.launch(
    kernel=bvh_query_aabb,
    dim=1,
    inputs=[bvh.id, query_lower, query_upper, bounds_intersected],
    device="cuda:0",
)

The kernel is nearly identical to the ray-traversal example, except we obtain query using wp.bvh_query_aabb().

Marching Cubes#

The wp.MarchingCubes class can be used to extract a 2-D mesh approximating an isosurface of a 3-D scalar field. The resulting triangle mesh can be saved to a USD file using the warp.renderer.UsdRenderer.

See warp/examples/core/example_marching_cubes.py for a usage example.

class warp.MarchingCubes(*args, **kwargs)[source]#

__init__(nx, ny, nz, max_verts, max_tris, device=None)[source]#

CUDA-based Marching Cubes algorithm to extract a 2D surface mesh from a 3D volume.

id#: Unique identifier for this object.

verts#

Array of vertex positions of type warp.vec3f for the output surface mesh. This is populated after running surface().

Type:: warp.array

indices#

Array containing indices of type warp.int32 defining triangles for the output surface mesh. This is populated after running surface().

Each set of three consecutive integers in the array represents a single triangle, in which each integer is an index referring to a vertex in the verts array.

Type:: warp.array

Parameters:

nx (int) – Number of cubes in the x-direction.
ny (int) – Number of cubes in the y-direction.
nz (int) – Number of cubes in the z-direction.
max_verts (int) – Maximum expected number of vertices (used for array preallocation).
max_tris (int) – Maximum expected number of triangles (used for array preallocation).
device (Devicelike) – CUDA device on which to run marching cubes and allocate memory.

Raises:

RuntimeError – device not a CUDA device.

Note

The shape of the marching cubes should match the shape of the scalar field being surfaced.

resize(nx, ny, nz, max_verts, max_tris)[source]#

Update the expected input and maximum output sizes for the marching cubes calculation.

This function has no immediate effect on the underlying buffers. The new values take effect on the next surface() call.

Parameters:

nx (int) – Number of cubes in the x-direction.
ny (int) – Number of cubes in the y-direction.
nz (int) – Number of cubes in the z-direction.
max_verts (int) – Maximum expected number of vertices (used for array preallocation).
max_tris (int) – Maximum expected number of triangles (used for array preallocation).

Return type:

None

surface(field, threshold)[source]#

Compute a 2D surface mesh of a given isosurface from a 3D scalar field.

The triangles and vertices defining the output mesh are written to the indices and verts arrays.

Parameters:

field (array(ndim=3, dtype=float32)) – Scalar field from which to generate a mesh.
threshold (float) – Target isosurface value.

Raises:

ValueError – field is not a 3D array.
ValueError – Marching cubes shape does not match the shape of field.
RuntimeError – max_verts and/or max_tris might be too small to hold the surface mesh.

Return type:

None

Profiling#

wp.ScopedTimer objects can be used to gain some basic insight into the performance of Warp applications:

with wp.ScopedTimer("grid build"):
    self.grid.build(self.x, self.point_radius)

This results in a printout at runtime to the standard output stream like:

grid build took 0.06 ms

See Profiling documentation for more information.

class warp.ScopedTimer( name, active=True, print=True, detailed=False, dict=None, use_nvtx=False, color='rapids', synchronize=False, cuda_filter=0, report_func=None, skip_tape=False, )[source]

Parameters:

name (str)
active (bool)
print (bool)
detailed (bool)
dict (dict[str, list[float]] | None)
use_nvtx (bool)
color (int | str)
synchronize (bool)
cuda_filter (int)
report_func (Callable[[list[TimingResult], str], None] | None)
skip_tape (bool)

indent = -1

enabled = True

__init__( name, active=True, print=True, detailed=False, dict=None, use_nvtx=False, color='rapids', synchronize=False, cuda_filter=0, report_func=None, skip_tape=False, )[source]

Context manager object for a timer

Parameters:

name (str) – Name of timer
active (bool) – Enables this timer
print (bool) – At context manager exit, print elapsed time to sys.stdout
detailed (bool) – Collects additional profiling data using cProfile and calls print_stats() at context exit
dict (dict[str, list[float]] | None) – A dictionary of lists to which the elapsed time will be appended using name as a key
use_nvtx (bool) – If true, timing functionality is replaced by an NVTX range
color (int | str) – ARGB value (e.g. 0x00FFFF) or color name (e.g. ‘cyan’) associated with the NVTX range
synchronize (bool) – Synchronize the CPU thread with any outstanding CUDA work to return accurate GPU timings
cuda_filter (int) – Filter flags for CUDA activity timing, e.g. warp.TIMING_KERNEL or warp.TIMING_ALL
report_func (Callable[[list[TimingResult], str], None] | None) – A callback function to print the activity report. If None, wp.timing_print() will be used.
skip_tape (bool) – If true, the timer will not be recorded in the tape

extra_msg

Can be set to a string that will be added to the printout at context exit.

Type:: str

elapsed

The duration of the with block used with this object

Type:: float

timing_results

The list of activity timing results, if collection was requested using cuda_filter

Type:: list[TimingResult]

Interprocess Communication (IPC)#

Interprocess communication can be used to share Warp arrays and events across processes without creating copies of the underlying data.

Some basic requirements for using IPC include:

Linux operating system (note however that integrated devices like NVIDIA Jetson do not support CUDA IPC)
The array must be allocated on a GPU device using the default memory allocator (see Allocators)

The wp.ScopedMempool context manager is useful for temporarily disabling memory pools for the purpose of allocating arrays that can be shared using IPC.

Support for IPC on a device is indicated by the is_ipc_supported attribute of the Device. If the Warp library has been compiled with CUDA 11, this device attribute will be None to indicate that IPC support could not be determined using the CUDA API.

To share a Warp array between processes, use array.ipc_handle() in the originating process to obtain an IPC handle for the array’s memory allocation. The handle is a bytes object with a length of 64. The IPC handle along with information about the array (data type, shape, and optionally strides) should be shared with another process, e.g. via shared memory or files. Another process can use this information to import the original array by calling from_ipc_handle().

Events can be shared in a similar manner, but they must be constructed with interprocess=True. Additionally, events cannot be created with both interprocess=True and enable_timing=True. Use Event.ipc_handle() in the originating process to obtain an IPC handle for the event. Another process can use this information to import the original event by calling event_from_ipc_handle().

warp.from_ipc_handle(handle, dtype, shape, strides=None, device=None)[source]#

Create an array from an IPC handle.

The dtype, shape, and optional strides arguments should match the values from the array from which handle was created.

Parameters:

handle (bytes) – The interprocess memory handle for an existing device memory allocation.
dtype –
One of the available data types, such as warp.float32, warp.mat33, or a custom struct.
shape (tuple[int, ...]) – Dimensions of the array.
strides (tuple[int, ...] | None) – Number of bytes in each dimension between successive elements of the array.
device (Devicelike) – Device to associate with the array.

Returns:

An array created from the existing memory allocation described by the interprocess memory handle handle.

A copy of the underlying data is not made. Modifications to the array’s data will be reflected in the original process from which handle was exported.

Raises:

RuntimeError – IPC is not supported on device.

Return type:

array

warp.event_from_ipc_handle(handle, device=None)[source]#

Create an event from an IPC handle.

Parameters:

handle – The interprocess event handle for an existing CUDA event.
device (Devicelike) – Device to associate with the array.

Returns:

An event created from the interprocess event handle handle.

Raises:

RuntimeError – IPC is not supported on device.

Return type:

Event

LTO Cache#

MathDx generates Link-Time Optimization (LTO) files for GEMM, Cholesky, and FFT tile operations. Warp caches these to speed up kernel compilation. Each LTO file maps to a specific Linear Algebra solver configuration, and is otherwise independent of the kernel in which its corresponding routine is called. Therefore, LTOs are stored in a cache that is independent of a given module’s kernel cache, and will remain cached even if wp.clear_kernel_cache() is called. wp.clear_lto_cache() can be used to clear the LTO cache.

warp.clear_lto_cache()[source]#

Clear the LTO cache directory of previously generated LTO code.

The LTO cache is stored within a subdirectory of the kernel cache directory. This function only clears the cache for the current Warp version.

Return type:: None