Runtime Reference#
This section describes the Warp Python runtime API, how to manage memory, launch kernels, and high-level functionality for dealing with objects such as meshes and volumes. The APIs described in this section are intended to be used at the Python Scope and run inside the CPython interpreter. For a comprehensive list of functions available at the Kernel Scope, please see the Kernel Reference section.
Kernels#
Kernels are launched with the wp.launch()
function on a specific device (CPU/GPU):
wp.launch(simple_kernel, dim=1024, inputs=[a, b, c], device="cuda")
Kernels may be launched with multi-dimensional grid bounds. In this case threads are not assigned a single index, but a coordinate in an n-dimensional grid, e.g.:
wp.launch(complex_kernel, dim=(128, 128, 3), ...)
Launches a 3D grid of threads with dimension 128 x 128 x 3. To retrieve the 3D index for each thread use the following syntax:
i,j,k = wp.tid()
Note
Currently kernels launched on CPU devices will be executed in serial. Kernels launched on CUDA devices will be launched in parallel with a fixed block-size.
Note
Note that all the kernel inputs must live on the target device, or a runtime exception will be raised.
- warp.launch(kernel, dim, inputs=[], outputs=[], adj_inputs=[], adj_outputs=[], device=None, stream=None, adjoint=False, record_tape=True, record_cmd=False, max_blocks=0)#
Launch a Warp kernel on the target device
Kernel launches are asynchronous with respect to the calling Python thread.
- Parameters:
kernel – The name of a Warp kernel function, decorated with the
@wp.kernel
decoratordim (Tuple[int]) – The number of threads to launch the kernel, can be an integer, or a Tuple of ints with max of 4 dimensions
inputs (Sequence) – The input parameters to the kernel (optional)
outputs (Sequence) – The output parameters (optional)
adj_inputs (Sequence) – The adjoint inputs (optional)
adj_outputs (Sequence) – The adjoint outputs (optional)
device (Device | str | None) – The device to launch on (optional)
stream (Stream | None) – The stream to launch on (optional)
adjoint – Whether to run forward or backward pass (typically use False)
record_tape – When true the launch will be recorded the global wp.Tape() object when present
record_cmd – When True the launch will be returned as a
Launch
command object, the launch will not occur until the user callscmd.launch()
max_blocks – The maximum number of CUDA thread blocks to use. Only has an effect for CUDA kernel launches. If negative or zero, the maximum hardware value will be used.
Runtime Kernel Creation#
It is often desirable to specialize kernels for different types, constants, or functions at runtime. We can achieve this through the use of runtime kernel specialization using Python closures.
For example, we might require a variety of kernels that execute particular functions for each item in an array. We might also want this function call to be valid for a variety of data types. Making use of closure and generics, we can generate these kernels using a single kernel definition:
def make_kernel(func, dtype):
def closure_kernel_fn(data: wp.array(dtype=dtype), out: wp.array(dtype=dtype)):
tid = wp.tid()
out[tid] = func(data[tid])
return wp.Kernel(closure_kernel_fn)
In practice, we might use our kernel generator, make_kernel()
as follows:
@wp.func
def sqr(x: Any) -> Any:
return x * x
@wp.func
def cube(x: Any) -> Any:
return sqr(x) * x
sqr_float = make_kernel(sqr, wp.float32)
cube_double = make_kernel(cube, wp.float64)
arr = [1.0, 2.0, 3.0]
N = len(arr)
data_float = wp.array(arr, dtype=wp.float32, device=device)
data_double = wp.array(arr, dtype=wp.float64, device=device)
out_float = wp.zeros(N, dtype=wp.float32, device=device)
out_double = wp.zeros(N, dtype=wp.float64, device=device)
wp.launch(sqr_float, dim=N, inputs=[data_float], outputs=[out_float], device=device)
wp.launch(cube_double, dim=N, inputs=[data_double], outputs=[out_double], device=device)
We can specialize kernel definitions over Warp constants similarly. The following generates kernels that add a specified constant to a generic-typed array value:
def make_add_kernel(key, constant):
def closure_kernel_fn(data: wp.array(dtype=Any), out: wp.array(dtype=Any)):
tid = wp.tid()
out[tid] = data[tid] + constant
return wp.Kernel(closure_kernel_fn, key=key)
add_ones_int = make_add_kernel("add_one", wp.constant(1))
add_ones_vec3 = make_add_kernel("add_ones_vec3", wp.constant(wp.vec3(1.0, 1.0, 1.0)))
a = wp.zeros(2, dtype=int)
b = wp.zeros(2, dtype=wp.vec3)
a_out = wp.zeros_like(a)
b_out = wp.zeros_like(b)
wp.launch(add_ones_int, dim=a.size, inputs=[a], outputs=[a_out], device=device)
wp.launch(add_ones_vec3, dim=b.size, inputs=[b], outputs=[b_out], device=device)
Arrays#
Arrays are the fundamental memory abstraction in Warp; they are created through the following global constructors:
wp.empty(shape=1024, dtype=wp.vec3, device="cpu")
wp.zeros(shape=1024, dtype=float, device="cuda")
wp.full(shape=1024, value=10, dtype=int, device="cuda")
Arrays can also be constructed directly from numpy
ndarrays as follows:
r = np.random.rand(1024)
# copy to Warp owned array
a = wp.array(r, dtype=float, device="cpu")
# return a Warp array wrapper around the NumPy data (zero-copy)
a = wp.array(r, dtype=float, copy=False, device="cpu")
# return a Warp copy of the array data on the GPU
a = wp.array(r, dtype=float, device="cuda")
Note that for multi-dimensional data the dtype
parameter must be specified explicitly, e.g.:
r = np.random.rand((1024, 3))
# initialize as an array of vec3 objects
a = wp.array(r, dtype=wp.vec3, device="cuda")
If the shapes are incompatible, an error will be raised.
Warp arrays can also be constructed from objects that define the __cuda_array_interface__
attribute. For example:
import cupy
import warp as wp
wp.init()
device = wp.get_cuda_device()
r = cupy.arange(10)
# return a Warp array wrapper around the cupy data (zero-copy)
a = wp.array(r, device=device)
Arrays can be moved between devices using the array.to()
method:
host_array = wp.array(a, dtype=float, device="cpu")
# allocate and copy to GPU
device_array = host_array.to("cuda")
Additionally, arrays can be copied directly between memory spaces:
src_array = wp.array(a, dtype=float, device="cpu")
dest_array = wp.empty_like(host_array)
# copy from source CPU buffer to GPU
wp.copy(dest_array, src_array)
- class warp.array(*args, **kwargs)#
Constructs a new Warp array object
When the
data
argument is a valid list, tuple, or ndarray the array will be constructed from this object’s data. For objects that are not stored sequentially in memory (e.g.: a list), then the data will first be flattened before being transferred to the memory space given by device.The second construction path occurs when the
ptr
argument is a non-zero uint64 value representing the start address in memory where existing array data resides, e.g.: from an external or C-library. The memory allocation should reside on the same device given by the device argument, and the user should set the length and dtype parameter appropriately.If neither
data
norptr
are specified, theshape
orlength
arguments are checked next. This construction path can be used to create new uninitialized arrays, but users are encouraged to callwp.empty()
,wp.zeros()
, orwp.full()
instead to create new arrays.If none of the above arguments are specified, a simple type annotation is constructed. This is used when annotating kernel arguments or struct members (e.g.,``arr: wp.array(dtype=float)``). In this case, only
dtype
andndim
are taken into account and no memory is allocated for the array.- Parameters:
data (Union[list, tuple, ndarray]) – An object to construct the array from, can be a Tuple, List, or generally any type convertible to an np.array
dtype (Union) – One of the built-in types, e.g.:
warp.mat33
, if dtype is Any and data an ndarray then it will be inferred from the array data typeshape (tuple) – Dimensions of the array
strides (tuple) – Number of bytes in each dimension between successive elements of the array
length (int) – Number of elements of the data type (deprecated, users should use shape argument)
ptr (uint64) – Address of an external memory address to alias (data should be None)
capacity (int) – Maximum size in bytes of the ptr allocation (data should be None)
device (Devicelike) – Device the array lives on
copy (bool) – Whether the incoming data will be copied or aliased, this is only possible when the incoming data already lives on the device specified and types match
owner (bool) – Should the array object try to deallocate memory when it is deleted (deprecated, pass deleter if you wish to transfer ownership to Warp)
deleter (Callable) – Function to be called when deallocating the array, taking two arguments, pointer and size
requires_grad (bool) – Whether or not gradients will be tracked for this array, see
warp.Tape
for detailsgrad (array) – The gradient array to use
pinned (bool) – Whether to allocate pinned host memory, which allows asynchronous host-device transfers (only applicable with device=”cpu”)
- property grad#
- property requires_grad#
- zero_()#
Zeroes-out the array entries.
- fill_(value)#
Set all array entries to value
- Parameters:
value – The value to set every array entry to. Must be convertible to the array’s
dtype
.- Raises:
ValueError – If value cannot be converted to the array’s
dtype
.
Examples
fill_()
can take lists or other sequences when filling arrays of vectors or matrices.>>> arr = wp.zeros(2, dtype=wp.mat22) >>> arr.numpy() array([[[0., 0.], [0., 0.]], [[0., 0.], [0., 0.]]], dtype=float32) >>> arr.fill_([[1, 2], [3, 4]]) >>> arr.numpy() array([[[1., 2.], [3., 4.]], [[1., 2.], [3., 4.]]], dtype=float32)
- assign(src)#
Wraps
src
in anwarp.array
if it is not already one and copies the contents toself
.
- numpy()#
Converts the array to a
numpy.ndarray
(aliasing memory through the array interface protocol) If the array is on the GPU, a synchronous device-to-host copy (on the CUDA default stream) will be automatically performed to ensure that any outstanding work is completed.
- cptr()#
Return a ctypes cast of the array address.
Notes:
Only CPU arrays support this method.
The array must be contiguous.
Accesses to this object are not bounds checked.
For
float16
types, a pointer to the internaluint16
representation is returned.
- list()#
Returns a flattened list of items in the array as a Python list.
- to(device, requires_grad=None)#
Returns a Warp array with this array’s data moved to the specified device, no-op if already on device.
- flatten()#
Returns a zero-copy view of the array collapsed to 1-D. Only supported for contiguous arrays.
- reshape(shape)#
Returns a reshaped array. Only supported for contiguous arrays.
- Parameters:
shape – An int or tuple of ints specifying the shape of the returned array.
- view(dtype)#
Returns a zero-copy view of this array’s memory with a different data type.
dtype
must have the same byte size of the array’s nativedtype
.
- contiguous()#
Returns a contiguous array with this array’s data. No-op if array is already contiguous.
- transpose(axes=None)#
Returns an zero-copy view of the array with axes transposed.
Note: The transpose operation will return an array with a non-contiguous access pattern.
- Parameters:
axes (optional) – Specifies the how the axes are permuted. If not specified, the axes order will be reversed.
Multi-dimensional Arrays#
Multi-dimensional arrays can be constructed by passing a tuple of sizes for each dimension, e.g.: the following constructs a 2d array of size 1024x16:
wp.zeros(shape=(1024, 16), dtype=float, device="cuda")
When passing multi-dimensional arrays to kernels users must specify the expected array dimension inside the kernel signature,
e.g. to pass a 2d array to a kernel the number of dims is specified using the ndim=2
parameter:
@wp.kernel
def test(input: wp.array(dtype=float, ndim=2)):
Type-hint helpers are provided for common array sizes, e.g.: array2d()
, array3d()
, which are equivalent to calling array(..., ndim=2)`
, etc. To index a multi-dimensional array use a the following kernel syntax:
# returns a float from the 2d array
value = input[i,j]
To create an array slice use the following syntax, where the number of indices is less than the array dimensions:
# returns an 1d array slice representing a row of the 2d array
row = input[i]
Slice operators can be concatenated, e.g.: s = array[i][j][k]
. Slices can be passed to wp.func
user functions provided
the function also declares the expected array dimension. Currently only single-index slicing is supported.
Note
Currently Warp limits arrays to 4 dimensions maximum. This is in addition to the contained datatype, which may be 1-2 dimensional for vector and matrix types such as vec3
, and mat33
.
The following construction methods are provided for allocating zero-initialized and empty (non-initialized) arrays:
- warp.zeros(shape=None, dtype=<class 'float'>, device=None, requires_grad=False, pinned=False, **kwargs)#
Return a zero-initialized array
- Parameters:
shape (Tuple | None) – Array dimensions
dtype – Type of each element, e.g.: warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.zeros_like(src, device=None, requires_grad=None, pinned=None)#
Return a zero-initialized array with the same type and dimension of another array
- Parameters:
src (array) – The template array to use for shape, data type, and device
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.ones(shape=None, dtype=<class 'float'>, device=None, requires_grad=False, pinned=False, **kwargs)#
Return a one-initialized array
- Parameters:
shape (Tuple | None) – Array dimensions
dtype – Type of each element, e.g.: warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.ones_like(src, device=None, requires_grad=None, pinned=None)#
Return a one-initialized array with the same type and dimension of another array
- Parameters:
src (array) – The template array to use for shape, data type, and device
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.full(shape=None, value=0, dtype=typing.Any, device=None, requires_grad=False, pinned=False, **kwargs)#
Return an array with all elements initialized to the given value
- Parameters:
shape (Tuple | None) – Array dimensions
value – Element value
dtype – Type of each element, e.g.: float, warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.full_like(src, value, device=None, requires_grad=None, pinned=None)#
Return an array with all elements initialized to the given value with the same type and dimension of another array
- Parameters:
src (array) – The template array to use for shape, data type, and device
value (Any) – Element value
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.empty(shape=None, dtype=<class 'float'>, device=None, requires_grad=False, pinned=False, **kwargs)#
Returns an uninitialized array
- Parameters:
shape (Tuple | None) – Array dimensions
dtype – Type of each element, e.g.: warp.vec3, warp.mat33, etc
device (Device | str | None) – Device that array will live on
requires_grad (bool) – Whether the array will be tracked for back propagation
pinned (bool) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.empty_like(src, device=None, requires_grad=None, pinned=None)#
Return an uninitialized array with the same type and dimension of another array
- Parameters:
src (array) – The template array to use for shape, data type, and device
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
- warp.copy(dest, src, dest_offset=0, src_offset=0, count=0, stream=None)#
Copy array contents from src to dest.
- Parameters:
dest (array) – Destination array, must be at least as big as source buffer
src (array) – Source array
dest_offset (int) – Element offset in the destination array
src_offset (int) – Element offset in the source array
count (int) – Number of array elements to copy (will copy all elements if set to 0)
stream (Stream | None) – The stream on which to perform the copy (optional)
The stream, if specified, can be from any device. If the stream is omitted, then Warp selects a stream based on the following rules: (1) If the destination array is on a CUDA device, use the current stream on the destination device. (2) Otherwise, if the source array is on a CUDA device, use the current stream on the source device.
If neither source nor destination are on a CUDA device, no stream is used for the copy.
- warp.clone(src, device=None, requires_grad=None, pinned=None)#
Clone an existing array, allocates a copy of the src memory
- Parameters:
src (array) – The source array to copy
device (Device | str | None) – The device where the new array will be created (defaults to src.device)
requires_grad (bool | None) – Whether the array will be tracked for back propagation
pinned (bool | None) – Whether the array uses pinned host memory (only applicable to CPU arrays)
- Returns:
A warp.array object representing the allocation
- Return type:
Matrix Multiplication#
Warp 2D array multiplication is built on NVIDIA’s CUTLASS library, which enables fast matrix multiplication of large arrays on the GPU.
If no GPU is detected, matrix multiplication falls back to Numpy’s implementation on the CPU.
Matrix multiplication is fully differentiable, and can be recorded on the tape like so:
tape = wp.Tape()
with tape:
wp.matmul(A, B, C, D, device=device)
wp.launch(loss_kernel, dim=(m, n), inputs=[D, loss], device=device)
tape.backward(loss=loss)
A_grad = A.grad.numpy()
Using the @
operator (D = A @ B
) will default to the same CUTLASS algorithm used in wp.matmul
.
- warp.matmul(a, b, c, d, alpha=1.0, beta=0.0, allow_tf32x3_arith=False)#
Computes a generic matrix-matrix multiplication (GEMM) of the form: d = alpha * (a @ b) + beta * c.
- Parameters:
a (array2d) – two-dimensional array containing matrix A
b (array2d) – two-dimensional array containing matrix B
c (array2d) – two-dimensional array containing matrix C
d (array2d) – two-dimensional array to which output D is written
alpha (float) – parameter alpha of GEMM
beta (float) – parameter beta of GEMM
allow_tf32x3_arith (bool) – whether to use CUTLASS’s 3xTF32 GEMMs, which enable accuracy similar to FP32 while using Tensor Cores
- warp.batched_matmul(a, b, c, d, alpha=1.0, beta=0.0, allow_tf32x3_arith=False)#
Computes a batched generic matrix-matrix multiplication (GEMM) of the form: d = alpha * (a @ b) + beta * c.
- Parameters:
a (array3d) – three-dimensional array containing A matrices. Overall array dimension is {batch_count, M, K}
b (array3d) – three-dimensional array containing B matrices. Overall array dimension is {batch_count, K, N}
c (array3d) – three-dimensional array containing C matrices. Overall array dimension is {batch_count, M, N}
d (array3d) – three-dimensional array to which output D is written. Overall array dimension is {batch_count, M, N}
alpha (float) – parameter alpha of GEMM
beta (float) – parameter beta of GEMM
allow_tf32x3_arith (bool) – whether to use CUTLASS’s 3xTF32 GEMMs, which enable accuracy similar to FP32 while using Tensor Cores
Data Types#
Scalar Types#
The following scalar storage types are supported for array structures:
bool |
boolean |
int8 |
signed byte |
uint8 |
unsigned byte |
int16 |
signed short |
uint16 |
unsigned short |
int32 |
signed integer |
uint32 |
unsigned integer |
int64 |
signed long integer |
uint64 |
unsigned long integer |
float16 |
half-precision float |
float32 |
single-precision float |
float64 |
double-precision float |
Warp supports float
and int
as aliases for wp.float32
and wp.int32
respectively.
Vectors#
Warp provides built-in math and geometry types for common simulation and graphics problems. A full reference for operators and functions for these types is available in the Kernel Reference.
Warp supports vectors of numbers with an arbitrary length/numeric type. The built-in concrete types are as follows:
vec2 vec3 vec4 |
2D, 3D, 4D vector of single-precision floats |
vec2b vec3b vec4b |
2D, 3D, 4D vector of signed bytes |
vec2ub vec3ub vec4ub |
2D, 3D, 4D vector of unsigned bytes |
vec2s vec3s vec4s |
2D, 3D, 4D vector of signed shorts |
vec2us vec3us vec4us |
2D, 3D, 4D vector of unsigned shorts |
vec2i vec3i vec4i |
2D, 3D, 4D vector of signed integers |
vec2ui vec3ui vec4ui |
2D, 3D, 4D vector of unsigned integers |
vec2l vec3l vec4l |
2D, 3D, 4D vector of signed long integers |
vec2ul vec3ul vec4ul |
2D, 3D, 4D vector of unsigned long integers |
vec2h vec3h vec4h |
2D, 3D, 4D vector of half-precision floats |
vec2f vec3f vec4f |
2D, 3D, 4D vector of single-precision floats |
vec2d vec3d vec4d |
2D, 3D, 4D vector of double-precision floats |
spatial_vector |
6D vector of single-precision floats |
spatial_vectorf |
6D vector of single-precision floats |
spatial_vectord |
6D vector of double-precision floats |
spatial_vectorh |
6D vector of half-precision floats |
Vectors support most standard linear algebra operations, e.g.:
@wp.kernel
def compute( ... ):
# basis vectors
a = wp.vec3(1.0, 0.0, 0.0)
b = wp.vec3(0.0, 1.0, 0.0)
# take the cross product
c = wp.cross(a, b)
# compute
r = wp.dot(c, c)
...
It’s possible to declare additional vector types with different lengths and data types. This is done in outside of kernels in Python scope using warp.types.vector()
, for example:
# declare a new vector type for holding 5 double precision floats:
vec5d = wp.types.vector(length=5, dtype=wp.float64)
Once declared, the new type can be used when allocating arrays or inside kernels:
# create an array of vec5d
arr = wp.zeros(10, dtype=vec5d)
# use inside a kernel
@wp.kernel
def compute( ... ):
# zero initialize a custom named vector type
v = vec5d()
...
# component-wise initialize a named vector type
v = vec5d(wp.float64(1.0),
wp.float64(2.0),
wp.float64(3.0),
wp.float64(4.0),
wp.float64(5.0))
...
In addition, it’s possible to directly create anonymously typed instances of these vectors without declaring their type in advance. In this case the type will be inferred by the constructor arguments. For example:
@wp.kernel
def compute( ... ):
# zero initialize vector of 5 doubles:
v = wp.vector(dtype=wp.float64, length=5)
# scalar initialize a vector of 5 doubles to the same value:
v = wp.vector(wp.float64(1.0), length=5)
# component-wise initialize a vector of 5 doubles
v = wp.vector(wp.float64(1.0),
wp.float64(2.0),
wp.float64(3.0),
wp.float64(4.0),
wp.float64(5.0))
These can be used with all the standard vector arithmetic operators, e.g.: +
, -
, scalar multiplication, and can also be transformed using matrices with compatible dimensions, potentially returning vectors with a different length.
Matrices#
Matrices with arbitrary shapes/numeric types are also supported. The built-in concrete matrix types are as follows:
mat22 mat33 mat44 |
2x2, 3x3, 4x4 matrix of single-precision floats |
mat22f mat33f mat44f |
2x2, 3x3, 4x4 matrix of single-precision floats |
mat22d mat33d mat44d |
2x2, 3x3, 4x4 matrix of double-precision floats |
mat22h mat33h mat44h |
2x2, 3x3, 4x4 matrix of half-precision floats |
spatial_matrix |
6x6 matrix of single-precision floats |
spatial_matrixf |
6x6 matrix of single-precision floats |
spatial_matrixd |
6x6 matrix of double-precision floats |
spatial_matrixh |
6x6 matrix of half-precision floats |
Matrices are stored in row-major format and support most standard linear algebra operations:
@wp.kernel
def compute( ... ):
# initialize matrix
m = wp.mat22(1.0, 2.0,
3.0, 4.0)
# compute inverse
minv = wp.inverse(m)
# transform vector
v = minv * wp.vec2(0.5, 0.3)
...
In a similar manner to vectors, it’s possible to declare new matrix types with arbitrary shapes and data types using wp.types.matrix()
, for example:
# declare a new 3x2 half precision float matrix type:
mat32h = wp.types.matrix(shape=(3,2), dtype=wp.float64)
# create an array of this type
a = wp.zeros(10, dtype=mat32h)
These can be used inside a kernel:
@wp.kernel
def compute( ... ):
...
# initialize a mat32h matrix
m = mat32h(wp.float16(1.0), wp.float16(2.0),
wp.float16(3.0), wp.float16(4.0),
wp.float16(5.0), wp.float16(6.0))
# declare a 2 component half precision vector
v2 = wp.vec2h(wp.float16(1.0), wp.float16(1.0))
# multiply by the matrix, returning a 3 component vector:
v3 = m * v2
...
It’s also possible to directly create anonymously typed instances inside kernels where the type is inferred from constructor arguments as follows:
@wp.kernel
def compute( ... ):
...
# create a 3x2 half precision matrix from components (row major ordering):
m = wp.matrix(
wp.float16(1.0), wp.float16(2.0),
wp.float16(1.0), wp.float16(2.0),
wp.float16(1.0), wp.float16(2.0),
shape=(3,2))
# zero initialize a 3x2 half precision matrix:
m = wp.matrix(wp.float16(0.0),shape=(3,2))
# create a 5x5 double precision identity matrix:
m = wp.identity(n=5, dtype=wp.float64)
As with vectors, you can do standard matrix arithmetic with these variables, along with multiplying matrices with compatible shapes and potentially returning a matrix with a new shape.
Quaternions#
Warp supports quaternions with the layout i, j, k, w
where w
is the real part. Here are the built-in concrete quaternion types:
quat |
Single-precision floating point quaternion |
quatf |
Single-precision floating point quaternion |
quatd |
Double-precision floating point quaternion |
quath |
Half-precision floating point quaternion |
Quaternions can be used to transform vectors as follows:
@wp.kernel
def compute( ... ):
...
# construct a 30 degree rotation around the x-axis
q = wp.quat_from_axis_angle(wp.vec3(1.0, 0.0, 0.0), wp.degrees(30.0))
# rotate an axis by this quaternion
v = wp.quat_rotate(q, wp.vec3(0.0, 1.0, 0.0))
As with vectors and matrices, you can declare quaternion types with an arbitrary numeric type like so:
quatd = wp.types.quaternion(dtype=wp.float64)
You can also create identity quaternion and anonymously typed instances inside a kernel like so:
@wp.kernel
def compute( ... ):
...
# create a double precision identity quaternion:
qd = wp.quat_identity(dtype=wp.float64)
# precision defaults to wp.float32 so this creates a single precision identity quaternion:
qf = wp.quat_identity()
# create a half precision quaternion from components, or a vector/scalar:
qh = wp.quaternion(wp.float16(0.0),
wp.float16(0.0),
wp.float16(0.0),
wp.float16(1.0))
qh = wp.quaternion(
wp.vector(wp.float16(0.0),wp.float16(0.0),wp.float16(0.0)),
wp.float16(1.0))
Transforms#
Transforms are 7D vectors of floats representing a spatial rigid body transformation in format (p, q) where p is a 3D vector, and q is a quaternion.
transform |
Single-precision floating point transform |
transformf |
Single-precision floating point transform |
transformd |
Double-precision floating point transform |
transformh |
Half-precision floating point transform |
Transforms can be constructed inside kernels from translation and rotation parts:
@wp.kernel
def compute( ... ):
...
# create a transform from a vector/quaternion:
t = wp.transform(
wp.vec3(1.0, 2.0, 3.0),
wp.quat_from_axis_angle(wp.vec3(0.0, 1.0, 0.0), wp.degrees(30.0)))
# transform a point
p = wp.transform_point(t, wp.vec3(10.0, 0.5, 1.0))
# transform a vector (ignore translation)
p = wp.transform_vector(t, wp.vec3(10.0, 0.5, 1.0))
As with vectors and matrices, you can declare transform types with an arbitrary numeric type using wp.types.transformation()
, for example:
transformd = wp.types.transformation(dtype=wp.float64)
You can also create identity transforms and anonymously typed instances inside a kernel like so:
@wp.kernel
def compute( ... ):
# create double precision identity transform:
qd = wp.transform_identity(dtype=wp.float64)
Structs#
Users can define custom structure types using the @wp.struct
decorator as follows:
@wp.struct
class MyStruct:
param1: int
param2: float
param3: wp.array(dtype=wp.vec3)
Struct attributes must be annotated with their respective type. They can be constructed in Python scope and then passed to kernels as arguments:
@wp.kernel
def compute(args: MyStruct):
tid = wp.tid()
print(args.param1)
print(args.param2)
print(args.param3[tid])
# construct an instance of the struct in Python
s = MyStruct()
s.param1 = 10
s.param2 = 2.5
s.param3 = wp.zeros(shape=10, dtype=wp.vec3)
# pass to our compute kernel
wp.launch(compute, dim=10, inputs=[s])
An array of structs can be zero-initialized as follows:
a = wp.zeros(shape=10, dtype=MyStruct)
An array of structs can also be initialized from a list of struct objects:
a = wp.array([MyStruct(), MyStruct(), MyStruct()], dtype=MyStruct)
Example: Using a struct in gradient computation#
import numpy as np
import warp as wp
wp.init()
@wp.struct
class TestStruct:
x: wp.vec3
a: wp.array(dtype=wp.vec3)
b: wp.array(dtype=wp.vec3)
@wp.kernel
def test_kernel(s: TestStruct):
tid = wp.tid()
s.b[tid] = s.a[tid] + s.x
@wp.kernel
def loss_kernel(s: TestStruct, loss: wp.array(dtype=float)):
tid = wp.tid()
v = s.b[tid]
wp.atomic_add(loss, 0, float(tid + 1) * (v[0] + 2.0 * v[1] + 3.0 * v[2]))
# create struct
ts = TestStruct()
# set members
ts.x = wp.vec3(1.0, 2.0, 3.0)
ts.a = wp.array(np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]), dtype=wp.vec3, requires_grad=True)
ts.b = wp.zeros(2, dtype=wp.vec3, requires_grad=True)
loss = wp.zeros(1, dtype=float, requires_grad=True)
tape = wp.Tape()
with tape:
wp.launch(test_kernel, dim=2, inputs=[ts])
wp.launch(loss_kernel, dim=2, inputs=[ts, loss])
tape.backward(loss)
print(loss)
print(ts.a)
Type Conversions#
Warp is particularly strict regarding type conversions and does not perform any implicit conversion between numeric types.
The user is responsible for ensuring types for most arithmetic operators match, e.g.: x = float(0.0) + int(4)
will result in an error.
This can be surprising for users that are accustomed to C-style conversions but avoids a class of common bugs that result from implicit conversions.
Note
Warp does not currently perform implicit type conversions between numeric types.
Users should explicitly cast variables to compatible types using constructors like
int()
, float()
, wp.float16()
, wp.uint8()
, etc.
Constants#
In general, Warp kernels cannot access variables in the global Python interpreter state. One exception to this is for compile-time constants, which may be declared globally (or as class attributes) and folded into the kernel definition.
Constants are defined using the wp.constant()
function. An example is shown below:
TYPE_SPHERE = wp.constant(0)
TYPE_CUBE = wp.constant(1)
TYPE_CAPSULE = wp.constant(2)
@wp.kernel
def collide(geometry: wp.array(dtype=int)):
t = geometry[wp.tid()]
if (t == TYPE_SPHERE):
print("sphere")
if (t == TYPE_CUBE):
print("cube")
if (t == TYPE_CAPSULE):
print("capsule")
- class warp.constant(x)#
Function to declare compile-time constants accessible from Warp kernels
- Parameters:
x – Compile-time constant value, can be any of the built-in math types.
Predefined Constants#
For convenience, Warp has a number of predefined mathematical constants that
may be used both inside and outside Warp kernels.
The constants in the following table also have lowercase versions defined,
e.g. wp.E
and wp.e
are equivalent.
Name |
Value |
---|---|
wp.E |
2.71828182845904523536 |
wp.LOG2E |
1.44269504088896340736 |
wp.LOG10E |
0.43429448190325182765 |
wp.LN2 |
0.69314718055994530942 |
wp.LN10 |
2.30258509299404568402 |
wp.PHI |
1.61803398874989484820 |
wp.PI |
3.14159265358979323846 |
wp.HALF_PI |
1.57079632679489661923 |
wp.TAU |
6.28318530717958647692 |
wp.INF |
math.inf |
The following example shows how positive and negative infinity
can be used with floating-point types in Warp using the wp.inf
constant:
@wp.kernel
def test_infinity(outputs: wp.array(dtype=wp.float32)):
outputs[0] = wp.float32(wp.inf) # inf
outputs[1] = wp.float32(-wp.inf) # -inf
outputs[2] = wp.float32(2.0 * wp.inf) # inf
outputs[3] = wp.float32(-2.0 * wp.inf) # -inf
outputs[4] = wp.float32(2.0 / 0.0) # inf
outputs[5] = wp.float32(-2.0 / 0.0) # -inf
Operators#
Boolean Operators#
a and b |
True if a and b are True |
a or b |
True if a or b is True |
not a |
True if a is False, otherwise False |
Note
Expressions such as if (a and b):
currently do not perform short-circuit evaluation.
In this case b
will also be evaluated even when a
is False
.
Users should take care to ensure that secondary conditions are safe to evaluate (e.g.: do not index out of bounds) in all cases.
Comparison Operators#
a > b |
True if a strictly greater than b |
a < b |
True if a strictly less than b |
a >= b |
True if a greater than or equal to b |
a <= b |
True if a less than or equal to b |
a == b |
True if a equals b |
a != b |
True if a not equal to b |
Arithmetic Operators#
a + b |
Addition |
a - b |
Subtraction |
a * b |
Multiplication |
a / b |
Floating point division |
a // b |
Floored division |
a ** b |
Exponentiation |
a % b |
Modulus |
Note
Since implicit conversions are not performed arguments types to operators should match.
Users should use type constructors, e.g.: float()
, int()
, wp.int64()
, etc. to cast variables
to the correct type. Also note that the multiplication expression a * b
is used to represent scalar
multiplication and matrix multiplication. The @
operator is not currently supported.
Graphs#
Launching kernels from Python introduces significant additional overhead compared to C++ or native programs. To address this, Warp exposes the concept of CUDA graphs to allow recording large batches of kernels and replaying them with very little CPU overhead.
To record a series of kernel launches use the wp.capture_begin()
and
wp.capture_end()
API as follows:
# begin capture
wp.capture_begin(device="cuda")
try:
# record launches
for i in range(100):
wp.launch(kernel=compute1, inputs=[a, b], device="cuda")
finally:
# end capture and return a graph object
graph = wp.capture_end(device="cuda")
We strongly recommend the use of the the try-finally pattern when capturing graphs because the finally
statement will ensure wp.capture_end
gets called, even if an exception occurs during
capture, which would otherwise trap the stream in a capturing state.
Once a graph has been constructed it can be executed:
wp.capture_launch(graph)
The wp.ScopedCapture
context manager can be used to simplify the code and
ensure that wp.capture_end
is called regardless of exceptions:
with wp.ScopedCapture(device="cuda") as capture:
# record launches
for i in range(100):
wp.launch(kernel=compute1, inputs=[a, b], device="cuda")
wp.capture_launch(capture.graph)
Note that only launch calls are recorded in the graph, any Python executed outside of the kernel code will not be recorded. Typically it is only beneficial to use CUDA graphs when the graph will be reused or launched multiple times.
- warp.capture_begin(device=None, stream=None, force_module_load=None, external=False)#
Begin capture of a CUDA graph
Captures all subsequent kernel launches and memory operations on CUDA devices. This can be used to record large numbers of kernels and replay them with low overhead.
If device is specified, the capture will begin on the CUDA stream currently associated with the device. If stream is specified, the capture will begin on the given stream. If both are omitted, the capture will begin on the current stream of the current device.
- Parameters:
device (Device | str | None) – The CUDA device to capture on
stream – The CUDA stream to capture on
force_module_load – Whether or not to force loading of all kernels before capture. In general it is better to use
load_module()
to selectively load kernels. When running with CUDA drivers that support CUDA 12.3 or newer, this option is not recommended to be set toTrue
because kernels can be loaded during graph capture on more recent drivers. If this argument isNone
, then the behavior inherits fromwp.config.enable_graph_capture_module_load_by_default
if the driver is older than CUDA 12.3.external – Whether the capture was already started externally
- warp.capture_end(device=None, stream=None)#
Ends the capture of a CUDA graph
- Parameters:
- Returns:
A Graph object that can be launched with
capture_launch()
- Return type:
Graph
- warp.capture_launch(graph, stream=None)#
Launch a previously captured CUDA graph
- Parameters:
graph (Graph) – A Graph as returned by
capture_end()
stream (Stream | None) – A Stream to launch the graph on (optional)
- class warp.ScopedCapture(device=None, stream=None, force_module_load=None, external=False)#
Meshes#
Warp provides a wp.Mesh
class to manage triangle mesh data. To create a mesh users provide a points, indices and optionally a velocity array:
mesh = wp.Mesh(points, indices, velocities)
Note
Mesh objects maintain references to their input geometry buffers. All buffers should live on the same device.
Meshes can be passed to kernels using their id
attribute which uniquely identifies the mesh by a unique uint64
value.
Once inside a kernel you can perform geometric queries against the mesh such as ray-casts or closest point lookups:
@wp.kernel
def raycast(mesh: wp.uint64,
ray_origin: wp.array(dtype=wp.vec3),
ray_dir: wp.array(dtype=wp.vec3),
ray_hit: wp.array(dtype=wp.vec3)):
tid = wp.tid()
t = float(0.0) # hit distance along ray
u = float(0.0) # hit face barycentric u
v = float(0.0) # hit face barycentric v
sign = float(0.0) # hit face sign
n = wp.vec3() # hit face normal
f = int(0) # hit face index
color = wp.vec3()
# ray cast against the mesh
if wp.mesh_query_ray(mesh, ray_origin[tid], ray_dir[tid], 1.e+6, t, u, v, sign, n, f):
# if we got a hit then set color to the face normal
color = n*0.5 + wp.vec3(0.5, 0.5, 0.5)
ray_hit[tid] = color
Users may update mesh vertex positions at runtime simply by modifying the points buffer.
After modifying point locations users should call Mesh.refit()
to rebuild the bounding volume hierarchy (BVH) structure and ensure that queries work correctly.
Note
Updating Mesh topology (indices) at runtime is not currently supported. Users should instead recreate a new Mesh object.
- class warp.Mesh(points=None, indices=None, velocities=None, support_winding_number=False)#
Class representing a triangle mesh.
- id#
Unique identifier for this mesh object, can be passed to kernels.
- device#
Device this object lives on, all buffers must live on the same device.
- Parameters:
points (
warp.array
) – Array of vertex positions of typewarp.vec3
indices (
warp.array
) – Array of triangle indices of typewarp.int32
, should be a 1d array with shape (num_tris, 3)velocities (
warp.array
) – Array of vertex velocities of typewarp.vec3
(optional)support_winding_number (bool) – If true the mesh will build additional datastructures to support wp.mesh_query_point_sign_winding_number() queries
- refit()#
Refit the BVH to points. This should be called after users modify the points data.
Hash Grids#
Many particle-based simulation methods such as the Discrete Element Method (DEM), or Smoothed Particle Hydrodynamics (SPH), involve iterating over spatial neighbors to compute force interactions. Hash grids are a well-established data structure to accelerate these nearest neighbor queries, and particularly well-suited to the GPU.
To support spatial neighbor queries Warp provides a HashGrid
object that may be created as follows:
grid = wp.HashGrid(dim_x=128, dim_y=128, dim_z=128, device="cuda")
grid.build(points=p, radius=r)
p
is an array of wp.vec3
point positions, and r
is the radius to use when building the grid.
Neighbors can then be iterated over inside the kernel code using wp.hash_grid_query()
and wp.hash_grid_query_next()
as follows:
@wp.kernel
def sum(grid : wp.uint64,
points: wp.array(dtype=wp.vec3),
output: wp.array(dtype=wp.vec3),
radius: float):
tid = wp.tid()
# query point
p = points[tid]
# create grid query around point
query = wp.hash_grid_query(grid, p, radius)
index = int(0)
sum = wp.vec3()
while(wp.hash_grid_query_next(query, index)):
neighbor = points[index]
# compute distance to neighbor point
dist = wp.length(p-neighbor)
if (dist <= radius):
sum += neighbor
output[tid] = sum
Note
The HashGrid
query will give back all points in cells that fall inside the query radius.
When there are hash conflicts it means that some points outside of query radius will be returned, and users should
check the distance themselves inside their kernels. The reason the query doesn’t do the check itself for each
returned point is because it’s common for kernels to compute the distance themselves, so it would redundant to
check/compute the distance twice.
- class warp.HashGrid(dim_x, dim_y, dim_z, device=None)#
Class representing a hash grid object for accelerated point queries.
- id#
Unique identifier for this mesh object, can be passed to kernels.
- device#
Device this object lives on, all buffers must live on the same device.
- Parameters:
- build(points, radius)#
Updates the hash grid data structure.
This method rebuilds the underlying datastructure and should be called any time the set of points changes.
- Parameters:
points (
warp.array
) – Array of points of typewarp.vec3
radius (float) – The cell size to use for bucketing points, cells are cubes with edges of this width. For best performance the radius used to construct the grid should match closely to the radius used when performing queries.
Volumes#
Sparse volumes are incredibly useful for representing grid data over large domains, such as signed distance fields (SDFs) for complex objects, or velocities for large-scale fluid flow. Warp supports reading sparse volumetric grids stored using the NanoVDB standard. Users can access voxels directly or use built-in closest-point or trilinear interpolation to sample grid data from world or local space.
Volume objects can be created directly from Warp arrays containing a NanoVDB grid, from the contents of a
standard .nvdb
file using load_from_nvdb()
,
or from a dense 3D NumPy array using load_from_numpy()
.
Volumes can also be created using allocate()
or
allocate_by_tiles()
. The values for a Volume object can be modified in a Warp
kernel using wp.volume_store_f()
, wp.volume_store_v()
, and
wp.volume_store_i()
.
Note
Warp does not currently support modifying the topology of sparse volumes at runtime.
Below we give an example of creating a Volume object from an existing NanoVDB file:
# open NanoVDB file on disk
file = open("mygrid.nvdb", "rb")
# create Volume object
volume = wp.Volume.load_from_nvdb(file, device="cpu")
Note
Files written by the NanoVDB library, commonly marked by the .nvdb
extension, can contain multiple grids with
various compression methods, but a Volume
object represents a single NanoVDB grid therefore only files with
a single grid are supported. NanoVDB’s uncompressed and zip-compressed file formats are supported.
To sample the volume inside a kernel we pass a reference to it by ID, and use the built-in sampling modes:
@wp.kernel
def sample_grid(volume: wp.uint64,
points: wp.array(dtype=wp.vec3),
samples: wp.array(dtype=float)):
tid = wp.tid()
# load sample point in world-space
p = points[tid]
# transform position to the volume's local-space
q = wp.volume_world_to_index(volume, p)
# sample volume with trilinear interpolation
f = wp.volume_sample_f(volume, q, wp.Volume.LINEAR)
# write result
samples[tid] = f
- class warp.Volume(data)#
Class representing a sparse grid.
- Parameters:
data (
warp.array
) – Array of bytes representing the volume in NanoVDB format
- CLOSEST = 0#
Enum value to specify nearest-neighbor interpolation during sampling
- LINEAR = 1#
Enum value to specify trilinear interpolation during sampling
- classmethod load_from_nvdb(file_or_buffer, device=None)#
Creates a Volume object from a NanoVDB file or in-memory buffer.
- Returns:
A
warp.Volume
object.- Return type:
- classmethod load_from_numpy(ndarray, min_world=(0.0, 0.0, 0.0), voxel_size=1.0, bg_value=0.0, device=None)#
Creates a Volume object from a dense 3D NumPy array.
This function is only supported for CUDA devices.
- Parameters:
min_world – The 3D coordinate of the lower corner of the volume.
voxel_size – The size of each voxel in spatial coordinates.
bg_value – Background value
device – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.
ndarray (array) –
- Returns:
A
warp.Volume
object.- Return type:
- classmethod allocate(min, max, voxel_size, bg_value=0.0, translation=(0.0, 0.0, 0.0), points_in_world_space=False, device=None)#
Allocate a new Volume based on the bounding box defined by min and max.
This function is only supported for CUDA devices.
Allocate a volume that is large enough to contain voxels [min[0], min[1], min[2]] - [max[0], max[1], max[2]], inclusive. If points_in_world_space is true, then min and max are first converted to index space with the given voxel size and translation, and the volume is allocated with those.
The smallest unit of allocation is a dense tile of 8x8x8 voxels, the requested bounding box is rounded up to tiles, and the resulting tiles will be available in the new volume.
- Parameters:
min (array-like) – Lower 3D coordinates of the bounding box in index space or world space, inclusive.
max (array-like) – Upper 3D coordinates of the bounding box in index space or world space, inclusive.
voxel_size (float) – Voxel size of the new volume.
bg_value (float or array-like) – Value of unallocated voxels of the volume, also defines the volume’s type, a
warp.vec3
volume is created if this is array-like, otherwise a float volume is createdtranslation (array-like) – translation between the index and world spaces.
device (Devicelike) – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.
- Return type:
- classmethod allocate_by_tiles(tile_points, voxel_size, bg_value=0.0, translation=(0.0, 0.0, 0.0), device=None)#
Allocate a new Volume with active tiles for each point tile_points.
This function is only supported for CUDA devices.
The smallest unit of allocation is a dense tile of 8x8x8 voxels. This is the primary method for allocating sparse volumes. It uses an array of points indicating the tiles that must be allocated.
- Example use cases:
tile_points can mark tiles directly in index space as in the case this method is called by allocate.
tile_points can be a list of points used in a simulation that needs to transfer data to a volume.
- Parameters:
tile_points (
warp.array
) – Array of positions that define the tiles to be allocated. The array can be a 2D, N-by-3 array ofwarp.int32
values, indicating index space positions, or can be a 1D array ofwarp.vec3
values, indicating world space positions. Repeated points per tile are allowed and will be efficiently deduplicated.voxel_size (float) – Voxel size of the new volume.
bg_value (float or array-like) – Value of unallocated voxels of the volume, also defines the volume’s type, a
warp.vec3
volume is created if this is array-like, otherwise a float volume is createdtranslation (array-like) – Translation between the index and world spaces.
device (Devicelike) – The CUDA device to create the volume on, e.g.: “cuda” or “cuda:0”.
- Return type:
See also
Reference for the volume functions available in kernels.
Bounding Value Hierarchies (BVH)#
The wp.Bvh
class can be used to create a BVH for a group of bounding volumes. This object can then be traversed
to determine which parts are intersected by a ray using bvh_query_ray()
and which parts are fully contained
within a certain bounding volume using bvh_query_aabb()
.
The following snippet demonstrates how to create a wp.Bvh
object from 100 random bounding volumes:
rng = np.random.default_rng(123)
num_bounds = 100
lowers = rng.random(size=(num_bounds, 3)) * 5.0
uppers = lowers + rng.random(size=(num_bounds, 3)) * 5.0
device_lowers = wp.array(lowers, dtype=wp.vec3, device="cuda:0")
device_uppers = wp.array(uppers, dtype=wp.vec3, device="cuda:0")
bvh = wp.Bvh(device_lowers, device_uppers)
- class warp.Bvh(lowers, uppers)#
Class representing a bounding volume hierarchy.
- id#
Unique identifier for this bvh object, can be passed to kernels.
- device#
Device this object lives on, all buffers must live on the same device.
- Parameters:
lowers (
warp.array
) – Array of lower boundswarp.vec3
uppers (
warp.array
) – Array of upper boundswarp.vec3
- refit()#
Refit the BVH. This should be called after users modify the lowers and uppers arrays.
Example: BVH Ray Traversal#
An example of performing a ray traversal on the data structure is as follows:
@wp.kernel
def bvh_query_ray(
bvh_id: wp.uint64,
start: wp.vec3,
dir: wp.vec3,
bounds_intersected: wp.array(dtype=wp.bool),
):
query = wp.bvh_query_ray(bvh_id, start, dir)
bounds_nr = wp.int32(0)
while wp.bvh_query_next(query, bounds_nr):
# The ray intersects the volume with index bounds_nr
bounds_intersected[bounds_nr] = True
bounds_intersected = wp.zeros(shape=(num_bounds), dtype=wp.bool, device="cuda:0")
query_start = wp.vec3(0.0, 0.0, 0.0)
query_dir = wp.normalize(wp.vec3(1.0, 1.0, 1.0))
wp.launch(
kernel=bvh_query_ray,
dim=1,
inputs=[bvh.id, query_start, query_dir, bounds_intersected],
device="cuda:0",
)
The Warp kernel bvh_query_ray
is launched with a single thread, provided the unique uint64
identifier of the wp.Bvh
object, parameters describing the ray, and an array to store the results.
In bvh_query_ray
, wp.bvh_query_ray()
is called once to obtain an object that is stored in the
variable query
. An integer is also allocated as bounds_nr
to store the volume index of the traversal.
A while statement is used for the actual traversal using wp.bvh_query_next()
,
which returns True
as long as there are intersecting bounds.
Example: BVH Volume Traversal#
Similar to the ray-traversal example, we can perform volume traversal to find the volumes that are fully contained within a specified bounding box.
@wp.kernel
def bvh_query_aabb(
bvh_id: wp.uint64,
lower: wp.vec3,
upper: wp.vec3,
bounds_intersected: wp.array(dtype=wp.bool),
):
query = wp.bvh_query_aabb(bvh_id, lower, upper)
bounds_nr = wp.int32(0)
while wp.bvh_query_next(query, bounds_nr):
# The volume with index bounds_nr is fully contained
# in the (lower,upper) bounding box
bounds_intersected[bounds_nr] = True
bounds_intersected = wp.zeros(shape=(num_bounds), dtype=wp.bool, device="cuda:0")
query_lower = wp.vec3(4.0, 4.0, 4.0)
query_upper = wp.vec3(6.0, 6.0, 6.0)
wp.launch(
kernel=bvh_query_aabb,
dim=1,
inputs=[bvh.id, query_lower, query_upper, bounds_intersected],
device="cuda:0",
)
The kernel is nearly identical to the ray-traversal example, except we obtain query
using
wp.bvh_query_aabb()
.
Profiling#
wp.ScopedTimer
objects can be used to gain some basic insight into the performance of Warp applications:
with wp.ScopedTimer("grid build"):
self.grid.build(self.x, self.point_radius)
This results in a printout at runtime to the standard output stream like:
grid build took 0.06 ms
See Profiling documentation for more information.
- class warp.ScopedTimer(name, active=True, print=True, detailed=False, dict=None, use_nvtx=False, color='rapids', synchronize=False, cuda_filter=0, report_func=None, skip_tape=False)
Context manager object for a timer
- Parameters:
name (str) – Name of timer
active (bool) – Enables this timer
print (bool) – At context manager exit, print elapsed time to sys.stdout
detailed (bool) – Collects additional profiling data using cProfile and calls
print_stats()
at context exitdict (dict) – A dictionary of lists to which the elapsed time will be appended using
name
as a keyuse_nvtx (bool) – If true, timing functionality is replaced by an NVTX range
color (int or str) – ARGB value (e.g. 0x00FFFF) or color name (e.g. ‘cyan’) associated with the NVTX range
synchronize (bool) – Synchronize the CPU thread with any outstanding CUDA work to return accurate GPU timings
cuda_filter (int) – Filter flags for CUDA activity timing, e.g.
warp.TIMING_KERNEL
orwarp.TIMING_ALL
report_func (Callable) – A callback function to print the activity report (
wp.timing_print()
is used by default)skip_tape (bool) – If true, the timer will not be recorded in the tape
- elapsed
The duration of the
with
block used with this object- Type:
- timing_results
The list of activity timing results, if collection was requested using
cuda_filter
- Type: