Tiles#
Block-based programming models such as those in OpenAI Triton have proved to be effective ways of expressing high-performance kernels that can leverage cooperative operations on modern GPUs. With Warp 1.5.0 [1], developers now have access to new tile-based programming primitives in Warp kernels. Leveraging cuBLASDx and cuFFTDx, these new tools provide developers with efficient matrix multiplication and Fourier transforms for accelerated simulation and scientific computing.
Requirements#
Tile-based operations are fully supported on versions of Warp built with CUDA Toolkit 12.6.3 or newer due to the use
of the MathDx library to back linear-algebra tile operations
like wp.tile_cholesky,
wp.tile_fft, and wp.tile_matmul.
See Building with MathDx for more details when building the Warp locally with support for
these linear-algebra tile operations.
Execution Model#
Warp’s execution model allows users to specify a grid of logical threads with up to 4 dimensions for kernel execution at launch time. With the introduction of tile primitives, users can now specify the block size for kernel launches, which partitions the thread grid into smaller sets of threads that are executed on a single compute unit.
Inside kernels, tile operations are executed cooperatively across each block of threads, allowing them to take advantage of efficient memory access, local memory, and dedicated hardware units like Tensor Cores.
In the following example, we launch a grid of threads where each block is responsible for loading a row of data from a 2D array and computing its sum:
TILE_SIZE = wp.constant(256)
TILE_THREADS = 64
@wp.kernel
def compute(a: wp.array2d[float], b: wp.array2d[float]):
# obtain our block index
i = wp.tid()
# load a row from global memory
t = wp.tile_load(a[i], TILE_SIZE)
# cooperatively compute the sum of the tile elements; s is a single element tile
s = wp.tile_sum(t)
# store s in global memory
wp.tile_store(b[i], s)
N = 10
a_np = np.arange(N).reshape(-1, 1) * np.ones((1, 256), dtype=float)
a = wp.array(a_np, dtype=float)
b = wp.zeros((N,1), dtype=float)
wp.launch_tiled(compute, dim=[a.shape[0]], inputs=[a, b], block_dim=TILE_THREADS)
print(f"b = {b[:,0]}")
b = [ 0. 256. 512. 768. 1024. 1280. 1536. 1792. 2048. 2304.]
Here, we have used the new warp.launch_tiled() function which assigns TILE_THREADS threads to each of the elements in the launch grid. Each block of TILE_THREADS threads then loads an entire row of 256 values from the global memory array and computes its sum (cooperatively).
Tile Properties#
In Warp, tile objects are arrays of data where the tile elements may be scalars, vectors, matrices, or user-defined structures. Tiles can have up to four dimensions. We can load 2D tiles directly from 2D global memory arrays as follows:
TILE_M = wp.constant(16)
TILE_N = wp.constant(16)
TILE_THREADS = 64
@wp.kernel
def compute(a: wp.array2d[float]):
# obtain our 2d block index
i, j = wp.tid()
# load a 2d tile from global memory
t = wp.tile_load(array, shape=(TILE_M, TILE_N), offset=(i*TILE_M, j*TILE_N))
s = wp.tile_sum(t)
...
wp.launch_tiled(compute, dim=[a.shape[0]/TILE_M, a.shape[1]/TILE_N], inputs=[a], block_dim=TILE_THREADS)
Here, we divide the array a into 2D tiles of shape 16 x 16.
Each block cooperatively loads a tile from the input array and computes its sum.
Tile Storage#
When tiles are created, they are placed in either register or shared memory. In general, Warp tries to determine the best storage location for tiles. By default, tiles are allocated in register storage, but some operations such as matrix multiplication may migrate data from register to shared as necessary.
Register Tiles#
Values in register tiles are stored across the entire block.
For example, if the block dimension at launch is set to 64, a register tile with shape=(1, 256) will result in each thread storing 4 elements.
Register-based storage is the fastest storage on most hardware, but an individual thread cannot randomly access data that is assigned to another thread efficiently
because the tile storage is spread across the threads in the block.
For this reason, operations on tiles tend to be expressed as higher-level maps, reductions, and reshaping operations that may transfer values through shared memory.
Example: General Matrix Multiply (GEMM)#
import numpy as np
import warp as wp
# tile size
TILE_M = wp.constant(8)
TILE_N = wp.constant(4)
TILE_K = wp.constant(8)
# num threads per-tile
TILE_THREADS = 64
@wp.kernel
def tile_gemm(A: wp.array2d[float], B: wp.array2d[float], C: wp.array2d[float]):
# output tile index
i, j = wp.tid()
sum = wp.tile_zeros(shape=(TILE_M, TILE_N), dtype=wp.float32)
M = A.shape[0]
N = B.shape[1]
K = A.shape[1]
count = int(K / TILE_K)
for k in range(0, count):
a = wp.tile_load(A, shape=(TILE_M, TILE_K), offset=(i*TILE_M, k*TILE_K))
b = wp.tile_load(B, shape=(TILE_K, TILE_N), offset=(k*TILE_K, j*TILE_N))
# sum += a*b
wp.tile_matmul(a, b, sum)
wp.tile_store(C, sum, offset=(i*TILE_M, j*TILE_N))
if __name__ == "__main__":
# generate some tile aligned matrix dimensions
M = TILE_M * 7
K = TILE_K * 6
N = TILE_N * 5
rng = np.random.default_rng(42)
A = rng.random((M, K), dtype=np.float32)
B = rng.random((K, N), dtype=np.float32)
C = np.zeros((M, N), dtype=np.float32)
A_wp = wp.array(A)
B_wp = wp.array(B)
C_wp = wp.array(C)
with wp.Tape() as tape:
wp.launch_tiled(
tile_gemm,
dim=(int(M / TILE_M), int(N / TILE_N)),
inputs=[A_wp, B_wp, C_wp],
block_dim=TILE_THREADS)
np.testing.assert_allclose(C_wp.numpy(), A @ B, rtol=1e-3)
print("Example matrix multiplication passed")
Tile Operations#
Construction#
Load/Store#
Maps/Reductions#
Arithmetic#
Tiles support standard Python arithmetic operators for element-wise operations. These operations are cooperative and execute across all threads in the block.
Addition and Subtraction
The + and - operators perform element-wise addition and subtraction between two tiles
of the same shape and dtype:
@wp.kernel
def add_sub_example(arr_a: wp.array[float], arr_b: wp.array[float]):
a = wp.tile_load(arr_a, shape=TILE_SIZE)
b = wp.tile_load(arr_b, shape=TILE_SIZE)
c = a + b # element-wise addition
d = a - b # element-wise subtraction
e = -a # element-wise negation
Tiles also support the += and -= in-place operators.
Multiplication (*)
The * operator supports three forms:
Expression |
Operand Types |
Result |
|---|---|---|
|
Both tiles have matching shape; at least one must have scalar dtype |
Element-wise multiplication |
|
Tile and a scalar, vector, or matrix constant |
Multiply each tile element by the constant |
|
A scalar, vector, or matrix constant and a tile |
Multiply each tile element by the constant |
When one operand is a tile and the other is a constant, the constant is broadcast to all elements. At least one of the tile’s element type or the constant type must be a scalar. The underlying scalar types must match. For example:
@wp.kernel
def mul_example(arr: wp.array[float]):
a = wp.tile_load(arr, shape=TILE_SIZE) # a tile of floats
# tile * tile (element-wise)
b = a * a
# tile * scalar
c = a * 2.0
# scalar * tile
d = 2.0 * a
# float tile * vec3f constant -> vec3f tile
e = a * wp.vec3(1.0, 2.0, 3.0)
# vec3f constant * float tile -> vec3f tile
f = wp.vec3(1.0, 2.0, 3.0) * a
Division (/)
The / operator supports three forms with the same type rules as multiplication:
Expression |
Operand Types |
Result |
|---|---|---|
|
Both tiles have matching shape; at least one must have scalar dtype |
Element-wise division |
|
Tile and a scalar, vector, or matrix constant |
Divide each tile element by the constant |
|
A scalar, vector, or matrix constant and a tile |
Divide the constant by each tile element |
As with multiplication, at least one of the tile’s element type or the constant type must be a scalar, and the underlying scalar types must match. For example:
@wp.kernel
def div_example(arr: wp.array[float], vec_arr: wp.array[wp.vec3]):
a = wp.tile_load(arr, shape=TILE_SIZE) # a tile of floats
# tile / tile (element-wise)
b = a / a
# tile / scalar
c = a / 2.0
# scalar / tile (divides constant by each element)
d = 1.0 / a
# vec3f tile / scalar
v = wp.tile_load(vec_arr, shape=TILE_SIZE) # a tile of vec3f
e = v / 2.0
# float tile / vec3f constant -> vec3f tile
f = a / wp.vec3(1.0, 2.0, 4.0)
# vec3f constant / float tile -> vec3f tile
g = wp.vec3(1.0, 2.0, 4.0) / a
Type Promotion Rules
The following table summarizes the result type for * and / operations between
tiles and constants:
Tile Element Type |
Constant Type |
Result Type |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Combinations where both the tile element type and the constant type are non-scalar
(e.g., tile<vec3f> * vec3f) are not supported. Use tile_map
with wp.cw_mul or wp.cw_div for
component-wise vector/matrix operations.
All arithmetic operators support automatic differentiation.
Linear Algebra#
Spatial Queries#
Stack#
Tile Stack#
Tile stacks provide a cooperative, block-scoped stack data structure in shared memory. They enable patterns such as stream compaction and dynamic load balancing where threads within a block need to collectively build and consume a shared work-list.
Most tile stack operations are cooperative — every thread in the block must call them,
even if a particular thread has no data to contribute, because they contain internal
synchronization barriers. The exception is wp.tile_stack_count, which contains no barrier and may be called by a
single thread or from within a divergent branch.
Creating a Stack#
Use wp.tile_stack() to allocate a stack in shared memory.
The capacity must be a compile-time constant and dtype specifies the element type:
CAPACITY = wp.constant(256)
@wp.kernel
def my_kernel(data: wp.array[float]):
i, j = wp.tid()
s = wp.tile_stack(capacity=CAPACITY, dtype=float)
...
Push and Pop#
wp.tile_stack_push() conditionally pushes a value
onto the stack. Each thread provides a value and a boolean has_value flag — only threads
with has_value=True write to the stack. The function returns the slot index where the
value was written, or -1 if the thread did not push (either because has_value was
False or because the stack overflowed):
idx = wp.tile_stack_push(s, value, keep)
wp.tile_stack_pop() removes values from the stack.
Each calling thread races for a slot. It returns a tuple (value, slot) where slot
identifies which stack slot was popped (in [0, capacity-1] when non-negative), or
-1 if the stack was empty — consistent with
wp.tile_stack_push() which also returns -1 on failure:
value, slot = wp.tile_stack_pop(s)
if slot != -1:
out[slot] = value
Clear and Count#
wp.tile_stack_clear() resets the stack count to zero,
allowing it to be reused within the same kernel invocation:
wp.tile_stack_clear(s)
wp.tile_stack_count() returns the current number of
elements. Unlike the other operations it is not cooperative — it contains no
synchronization barrier and may be called by a single thread or from within a divergent
branch. It is safe to call after any push, pop, or clear, all of which end with a barrier
that makes the count visible:
n = wp.tile_stack_count(s)
# may be called by a single thread or from within a branch
if j == 0:
n = wp.tile_stack_count(s)
Example: Stream Compaction#
A common use case is filtering elements that satisfy a condition. Because
tile_stack is a within-block primitive, the slot returned by
wp.tile_stack_pop() is compact only
within a single tile — for a single-tile launch this maps directly to a compact
output array:
BLOCK_DIM = 256
CAPACITY = wp.constant(BLOCK_DIM) # at most one output per thread
@wp.kernel
def compact_kernel(data: wp.array[float], out: wp.array[float]):
_i, j = wp.tid()
val = data[j]
keep = val > 0.5
# Cooperative push — only threads with keep=True write to the stack
s = wp.tile_stack(capacity=CAPACITY, dtype=float)
wp.tile_stack_push(s, val, keep)
# Each thread races for a slot. The returned slot is a unique compact
# index in [0, capacity-1] — use it directly as the output index.
result, slot = wp.tile_stack_pop(s)
if slot != -1:
out[slot] = result
n = BLOCK_DIM
data = wp.array(..., dtype=float)
out = wp.zeros(n, dtype=float) # sized to CAPACITY: safe for any filter result
wp.launch_tiled(compact_kernel, dim=[1], inputs=[data, out], block_dim=BLOCK_DIM)
Note
In a multi-tile launch, each tile’s pop slots are compact only within that
tile and would overlap across tiles. Multi-tile compaction requires additional
coordination, such as a 2D output array indexed by [tile, slot] or a
global atomic counter to claim output indices.
Note
Tile stacks are allocated in shared memory, which is a limited resource. The capacity should be chosen to avoid exceeding hardware limitations. The stack supports any Warp data type including scalars, vectors, and matrices.
Note
Tile stack operations are not differentiable and cannot be used in the backward pass of a differentiable kernel.
Tiles and SIMT Code#
Traditionally, Warp kernels are primarily written in the SIMT programming model, where each thread’s execution happens independently.
Tiles, on the other hand, allow threads to work cooperatively to perform operations.
Warp exposes the warp.tile, and warp.untile methods to convert data between per-thread value types and
the equivalent tile representation. For example:
TILE_THREADS = 64
@wp.kernel
def compute():
i = wp.tid()
# perform some per-thread computation
x = i*2.0 + wp.sin(float(i))
# tile the value x across the block
# returns a tile with shape=(1, TILE_THREADS)
t = wp.tile(x)
...
# launch as regular SIMT kernel
wp.launch(compute, dim=[N], inputs=[], block_dim=TILE_THREADS)
In this example, we have launched a regular SIMT grid with N logical threads using wp.launch().
The kernel performs some per-thread computations and then converts the scalar x value into a tile object using warp.tile.
This function takes a single value as input and returns a tile with the same dimensions as the number of threads in the block (as set by the block_dim argument in wp.launch()),
which implies that each thread in the block is assigned to a particular tile element.
From here, the tile can be used in other regular cooperative operations such as reductions, GEMMs, etc.
Similarly, we can untile tile objects back to their per-thread scalar equivalent values.
Note
All threads in a block must execute tile operations, but code surrounding tile operations may contain arbitrary conditional logic.
Extra consideration is needed when using tiles in SIMT kernels that are meant to run on both the GPU and the CPU.
On the CPU, block_dim is set to 1, which can change the behavior of kernels using wp.tile(). Consider the following example:
import warp as wp
TILE_DIM = 4
@wp.kernel
def tile_reduce_blockwise_simt_kernel(output: wp.array[int]):
i = wp.tid()
t = wp.tile(i)
s = wp.tile_sum(t)
wp.tile_store(output, s, offset=i)
N = TILE_DIM * 3
output = wp.zeros(shape=N, dtype=int)
wp.launch(tile_reduce_blockwise_simt_kernel, dim=N, outputs=[output], block_dim=TILE_DIM)
print(output.numpy())
[ 6 0 0 0 22 0 0 0 38 0 0 0]
Here, we launch N=12 logical threads. The tile size is 4, so there are three blocks in total that are created with wp.tile().
The tile reduction operation stores the block’s sum in the first thread of the block, so we see 6, 22, and 38 stored at indices 0, 4, and 8.
If we instead launch this kernel on the CPU, we get the following output:
[ 0 1 2 3 4 5 6 7 8 9 10 11]
When launching on the CPU, block_dim is set to 1, so the tile generated with wp.tile() has a size of 1, and the reduction of each tile simply returns the value of the tile.
So if you are designing a kernel that is meant to get the same result running on the GPU or the CPU, it should be designed to be independent of the value of block_dim.
For instance, if we want a full array reduction that works consistently across devices, we can use wp.tile_atomic_add() to accumulate results from all blocks:
import warp as wp
TILE_DIM = 4
@wp.kernel
def tile_reduce_simt_kernel(output: wp.array[int]):
i = wp.tid()
t = wp.tile(i)
s = wp.tile_sum(t)
# update global sum
wp.tile_atomic_add(output, s)
N = TILE_DIM * 3
output = wp.zeros(shape=1, dtype=int)
wp.launch(tile_reduce_simt_kernel, dim=N, outputs=[output], block_dim=TILE_DIM)
print(output.numpy())
[66]
Type Preservation#
warp.tile includes the optional parameter preserve_type, which is False by default.
When preserve_type is False, this function expands non-scalar inputs into a multi-dimensional tile.
Vectors are expanded into a 2D tile of scalar values with shape (length(vector), block_dim),
while matrices are expanded into a 3D tile of scalar values with shape (rows, cols, block_dim).
When preserve_type is True, this function outputs a 1D tile of length block_dim with the same
data type as the input value. So if you tile a vector across the block with preserve_type=True, a 1D
tile of vectors will be returned. This is useful for collective operations that operate on the entire
vector or matrix rather than their individual components, as in the following example demonstrating
a matrix tile reduction:
import warp as wp
TILE_DIM = 32
@wp.kernel
def matrix_reduction_kernel(y: wp.array[wp.mat33]):
i = wp.tid()
I = wp.identity(3, dtype=wp.float32)
m = wp.float32(i) * I
t = wp.tile(m, preserve_type=True)
sum = wp.tile_reduce(wp.add, t)
wp.tile_store(y, sum)
y = wp.zeros(shape=1, dtype=wp.mat33)
wp.launch(matrix_reduction_kernel, dim=TILE_DIM, inputs=[], outputs=[y], block_dim=TILE_DIM)
print(y.numpy()[0])
[[496. 0. 0.]
[ 0. 496. 0.]
[ 0. 0. 496.]]
Example: Using tiles to accelerate array-wide reductions#
Prior to the addition of tile support in Warp, array-wide reductions were commonly performed in a single kernel
using a built-in atomic function like wp.atomic_add().
This could be very inefficient when compared to optimized mechanisms like
cub::BlockReduce.
Consider the following sum-of-squares reduction on an array:
import numpy as np
import warp as wp
@wp.kernel
def reduce_array_simt(
a: wp.array2d[wp.float64],
result: wp.array[wp.float64],
):
i, j = wp.tid()
local_val = a[i, j]*a[i, j]
wp.atomic_add(result, 0, local_val)
rng = np.random.default_rng(42)
data = wp.array(rng.random((4096, 4096), dtype=np.float64), dtype=wp.float64)
result = wp.zeros((1,), dtype=wp.float64)
wp.launch(reduce_array_simt, (4096, 4096), inputs=[data], outputs=[result])
The above kernel in Warp 1.6.1 runs in about 27.5 ms on an RTX 3090 GPU.
We can use tiles to accelerate this reduction by first creating a tile from the scalar local_val
and then using wp.tile_sum() to cooperatively compute
the tile sum using shared memory. We can then accumulate the result of the reduced
tile into global memory using wp.tile_atomic_add():
import numpy as np
import warp as wp
@wp.kernel
def reduce_array_tile(
a: wp.array2d[wp.float64],
result: wp.array[wp.float64],
):
i, j = wp.tid()
local_val = a[i, j]*a[i, j]
t = wp.tile(local_val)
s = wp.tile_sum(t)
wp.tile_atomic_add(result, s)
rng = np.random.default_rng(42)
data = wp.array(rng.random((4096, 4096), dtype=np.float64), dtype=wp.float64)
result = wp.zeros((1,), dtype=wp.float64)
wp.launch(reduce_array_tile, (4096, 4096), inputs=[data], outputs=[result])
The reduction kernel using tiles runs in about 0.528 ms, a 52x improvement over the original kernel.
Further speed improvements could be obtained by experimenting with different block sizes.
If we reduce the block size from the default of 256 threads to 128 threads
in this example by adding block_dim=128 to the wp.launch(),
the kernel only takes about 0.436 ms to complete, while the pure SIMT kernel is
relatively unaffected.
Automatic Differentiation#
Warp can automatically generate the backward version of tile-based programs.
In general, tile programs must obey the same rules for auto-diff as regular Warp programs.
In-place addition and subtraction (+=, -=) on tiles are differentiable;
in-place multiplication and division are not supported for tiles.
Please see the Differentiability section for more details.
Tiles in @wp.func Functions#
Tile parameters in @wp.func functions are passed by reference. This is
a departure from other Warp types (scalars, vectors, matrices, etc.), which are passed by value.
The difference is observable when a function modifies a tile in place:
@wp.func
def add_bias(t: wp.tile[float, TILE_M, TILE_N]):
t += wp.tile_ones(dtype=float, shape=(TILE_M, TILE_N), storage="register") * 5.0
@wp.kernel(enable_backward=False)
def compute(input: wp.array2d[float], out: wp.array2d[float]):
i = wp.tid()
t = wp.tile_load(input, shape=(TILE_M, TILE_N), offset=(0, 0), storage="shared")
add_bias(t) # modifies t in place — caller sees the change
wp.tile_store(out, t) # t now contains input + 5.0
This behavior applies to both shared-memory and register tiles, and matches Python’s semantics for mutable objects. The reasons for pass-by-reference are:
Shared-memory tiles are handles to a fixed region of shared memory and cannot be copied. Passing by reference lets functions operate on shared tiles directly.
Register tiles use the same calling convention for consistency, so that
@wp.funccode works identically regardless of the caller’s storage choice.
Note
Simple rebinding (alias = t) inside a @wp.func creates a new C++ variable.
For register tiles this is a full value copy; for shared tiles the new variable is a
non-owning handle to the same shared memory, so element-level writes through either
variable affect the same data. In-place assignments (+=, -=, etc.) on the
original parameter mutate the tile through the reference regardless of storage type.
Vectorized Tile Loads#
When loading shared-memory tiles, Warp automatically attempts a vectorized path
that uses 128-bit (float4) memory transactions instead of per-element loads. This
gives up to 1.3× higher bandwidth because each thread issues fewer, wider loads.
The vectorized path activates when all of the following hold:
Last-dimension alignment:
tile_last_dim × sizeof(element) % 16 == 0. Forfloat32(4 bytes), the last dimension must be divisible by 4. Forvec3(12 bytes), it must be divisible by 4 (since 4 × 12 = 48, divisible by 16). Forfloat64(8 bytes), it must be divisible by 2.Contiguous array: the source array is densely packed (no gaps between rows).
16-byte aligned base address: the start of the tile’s data in global memory is aligned to a 16-byte boundary. Warp GPU arrays are 256-byte aligned (CPU arrays are 64-byte aligned); either satisfies the 16-byte requirement for offset-zero tiles. Offset tiles are aligned when the linear byte offset
sum(offset[d] * strides[d])is a multiple of 16.Tile fits within array bounds: the tile does not extend past the array dimensions.
2D or higher tile: 1D tiles do not use the vectorized float4 path, but may use the coalesced byte-copy path for large element types (
sizeof(T) > 16).Outer-dimension strides are float4-aligned: all strides except the last dimension must be multiples of 16 bytes. A contiguous
float32array withshape[1]=35hasstrides[0]=140bytes, and140 % 16 != 0, so vectorization is rejected even if the tile’s last dimension is aligned.
When any condition fails, Warp tries a coalesced byte-copy path for large element
types (sizeof(T) > 16) such as mat33, mat44, and mat66. This path copies
raw bytes as float* with thread-striped access for perfect memory coalescing, and works
for both 1D and N-D tiles. If that path is also ineligible, Warp falls back to a scalar
load path that handles arbitrary alignment, strides, and partial (out-of-bounds) tiles
with zero-padding.
Padding for vectorized loads: if your data dimensions are not naturally aligned, pad
them to hit the vectorized path. For example, with float32 tiles of width 30 (not
divisible by 4), pad to 32:
# Pad array width from 30 → 32 for vectorized float4 loads
data_np = np.pad(data_np, ((0, 0), (0, 2)), mode='constant')
data = wp.array(data_np, dtype=float, device=device)
# Now tile_load with width=32 hits the fast vectorized path
t = wp.tile_load(data, shape=(8, 32), offset=(i * 8, 0), storage="shared")
The ``aligned`` parameter: when you know all conditions above are met, pass
aligned=True to skip the runtime eligibility checks:
# Caller guarantees: contiguous array, 16-byte aligned, tile fits in bounds
t = wp.tile_load(data, shape=(8, 32), offset=(i * 8, 0), storage="shared", aligned=True)
wp.tile_store(out, t, offset=(i * 8, 0), aligned=True)
This eliminates a small amount of per-load overhead from the runtime checks. The
aligned parameter is only meaningful for shared-memory tiles (register tiles do not
use the vectorized path).
Warning
Passing aligned=True when the conditions are not met results in undefined behavior
(incorrect results or GPU faults). Use it only when you can guarantee alignment. The
address alignment check is always active (even in release builds), but bounds and
contiguity checks are debug-only.
Software Pipelining with Register Tiles#
When a kernel iterates over many tiles, the default pattern — load, compute, store, repeat — leaves the GPU’s memory pipeline idle during compute and the ALU idle during loads.
Sequential (baseline): each iteration loads a tile, processes it, stores the result, then moves on. The next load cannot begin until the current store completes:
import time
import warp as wp
import numpy as np
@wp.func
def activation(x: float):
return wp.sin(x * 1.1 + 0.1)
@wp.func
def normalize(x: float):
return wp.exp(-wp.abs(x) * 0.5)
@wp.func
def transform(x: float):
return wp.sqrt(wp.abs(x) + 1.0)
@wp.func
def squash(x: float):
return wp.tanh(x * 0.7)
TILE_N = wp.constant(256)
N_ROWS = wp.constant(512)
@wp.kernel
def sequential(
inp: wp.array2d[float],
out: wp.array2d[float],
):
for i in range(N_ROWS):
a = wp.tile_load(inp, shape=(1, TILE_N), offset=(i, 0), storage="register")
a = wp.tile_map(activation, a)
a = wp.tile_map(normalize, a)
a = wp.tile_map(transform, a)
a = wp.tile_map(squash, a)
wp.tile_store(out, a, offset=(i, 0))
@wp.kernel
def pipelined(
inp: wp.array2d[float],
out: wp.array2d[float],
):
# Load first tile
a = wp.tile_load(inp, shape=(1, TILE_N), offset=(0, 0), storage="register")
for i in range(1, N_ROWS):
# Issue next load — GPU fetches in background during compute below
b = wp.tile_load(inp, shape=(1, TILE_N), offset=(i, 0), storage="register")
# Heavy compute overlaps with the memory fetch for b
a = wp.tile_map(activation, a)
a = wp.tile_map(normalize, a)
a = wp.tile_map(transform, a)
a = wp.tile_map(squash, a)
wp.tile_store(out, a, offset=(i - 1, 0))
a = b # by now b's data has arrived
# Epilogue
a = wp.tile_map(activation, a)
a = wp.tile_map(normalize, a)
a = wp.tile_map(transform, a)
a = wp.tile_map(squash, a)
wp.tile_store(out, a, offset=(N_ROWS - 1, 0))
rng = np.random.default_rng(42)
device = wp.get_cuda_device()
inp = wp.array(rng.random((N_ROWS, TILE_N), dtype=np.float32), dtype=float, device=device)
out_seq = wp.zeros_like(inp)
out_pipe = wp.zeros_like(inp)
# Verify correctness
wp.launch_tiled(sequential, dim=[1], inputs=[inp, out_seq], block_dim=TILE_N, device=device)
wp.launch_tiled(pipelined, dim=[1], inputs=[inp, out_pipe], block_dim=TILE_N, device=device)
wp.synchronize_device(device)
np.testing.assert_allclose(out_seq.numpy(), out_pipe.numpy(), rtol=1e-5)
# Benchmark
for _ in range(10):
wp.launch_tiled(sequential, dim=[1], inputs=[inp, out_seq], block_dim=TILE_N, device=device)
wp.synchronize_device(device)
t0 = time.perf_counter()
for _ in range(100):
wp.launch_tiled(sequential, dim=[1], inputs=[inp, out_seq], block_dim=TILE_N, device=device)
wp.synchronize_device(device)
seq_ms = (time.perf_counter() - t0) / 100 * 1000
for _ in range(10):
wp.launch_tiled(pipelined, dim=[1], inputs=[inp, out_pipe], block_dim=TILE_N, device=device)
wp.synchronize_device(device)
t0 = time.perf_counter()
for _ in range(100):
wp.launch_tiled(pipelined, dim=[1], inputs=[inp, out_pipe], block_dim=TILE_N, device=device)
wp.synchronize_device(device)
pipe_ms = (time.perf_counter() - t0) / 100 * 1000
speedup = seq_ms / pipe_ms
print(f"Sequential: {seq_ms:.3f} ms")
print(f"Pipelined: {pipe_ms:.3f} ms")
print(f"Speedup: {speedup:.1f}x")
Sequential: ... ms
Pipelined: ... ms
Speedup: ...x
The key insight: b = wp.tile_load(...) issues memory load instructions that enter the
GPU’s memory pipeline immediately, but the result registers are not read until a = b
on the next iteration. The GPU’s instruction scheduler fills the gap with the ALU-heavy
tile_map calls, effectively hiding the memory latency for free.
No special parameters or API calls are needed — the hardware pipeline handles the overlap
automatically. The benefit scales with the amount of compute per tile: with four
transcendental tile_map calls per iteration (as above), expect 1.2–1.6× speedup
depending on the GPU.
Note
This pattern uses storage="register" (the default). Shared-memory tiles require
thread-block synchronization after each load, which prevents the same overlap. Use
register tiles for pipelined workloads.
Failed to compile LTO Error Message#
Some tile operations invoke MathDx APIs to generate and compile link-time objects (LTOs) at runtime.
If the compilation fails, you may see an error message that mentions Failed to compile LTO.
A common cause of this error is that the tile sizes involved are too large for the current device, as shared memory
is a limited resource.
To get more information about the error, you can set the LIBMATHDX_LOG_LEVEL environment
variable to 5 and rerun the program. Batching the problem into smaller tiles may be required to work around the shared
memory limitations.
In the case of FFT operations, using more threads per block may help (see the
cuFFTDx requirements for more details).
Building with MathDx#
Most tile operations described in Linear Algebra require Warp to be built with the MathDx library. Starting with Warp 1.5.0, PyPI distributions will come with out-of-the-box support for tile operations leveraging MathDx APIs.
When building Warp locally using build_lib.py, the script will attempt to automatically download libmathdx
from the cuBLASDx Downloads Page.
A path to an existing libmathdx installation can also be specified using the --libmathdx-path option
when running build_lib.py or by defining the path in the LIBMATHDX_HOME environment variable.
Please note that CUDA Toolkit 12.6.3 or higher is required for full MathDx support when building Warp from source. Warp + MathDx will compile with earlier CUDA 12 versions, but matrix multiplication, triangular solves, Cholesky factorization, and the Cholesky solver will fail at runtime.
Frequently Asked Questions#
launch_tiled vs. launch#
What is the difference between wp.launch_tiled() and wp.launch() ?
The two forms are equivalent. launch_tiled() appends block_dim to the
grid automatically, so the following launch the same kernel grid:
# These are equivalent:
wp.launch_tiled(kernel, dim=[M, N], inputs=[...], block_dim=64)
wp.launch( kernel, dim=[M, N, 64], inputs=[...], block_dim=64)
launch_tiled() reduces boilerplate for tile-heavy kernels. Use plain
launch() with block_dim when you are writing a SIMT kernel that
occasionally drops into tile operations via wp.tile() and
wp.untile().
wp.tid() returns one value per launch dimension. Under
launch_tiled(), you can unpack only what you need:
@wp.kernel
def compute():
i, j = wp.tid() # just the tile coordinates
# or:
i, j, t = wp.tid() # also the intra-block thread id
Unpacking the thread index t is useful when you need to mix cooperative tile
operations with per-thread SIMT work. See Tiles and SIMT Code for more on mixing
the two programming models.
Tile operations (wp.tile_load(),
wp.tile_store(),
wp.tile_matmul(), etc.) work with either launch
function. On CPU, Warp always forces block_dim=1; see the next question for
CPU-specific caveats.
CPU vs. GPU behavior differences#
Why does my kernel using tiles behave differently depending on whether I run it on a CPU or GPU?
On CPU, Warp currently forces block_dim=1 regardless of what you specify. This is a
limitation of Warp’s single-threaded CPU execution model. The consequences are:
wp.tile()produces a 1-element tile instead of a block-wide tile ofblock_dimelements.Block-wide aggregation operations become effectively no-ops, because the “block” is a single thread, so there is nothing to reduce or cooperate across.
Kernels that use wp.tile() to aggregate values across a
block will produce per-thread results on CPU instead of block-aggregated results.
To write kernels that produce consistent results on both devices, use
wp.tile_atomic_add() to accumulate each block’s
result into a global array rather than relying on
wp.tile_store() writing only from the first thread of
each block. See the blockwise reduction example in
Tiles and SIMT Code.
Tile operations in divergent branches#
Can I use tile operations inside if statements or loops with early exits?
Most tile operations are cooperative: every thread in the block must call them,
because they contain internal synchronization barriers (analogous to CUDA’s
__syncthreads()). Calling a cooperative tile operation from a divergent branch, where
only some threads execute it, will deadlock or produce undefined behavior.
The rule of thumb: tile operations can be surrounded by arbitrary conditional logic, but the condition must evaluate identically for all threads in the block. That is, you can branch around a group of tile operations, but you cannot have some threads skip an individual tile operation that others execute.
A handful of tile operations are explicitly non-cooperative and safe to call from a single thread or a divergent branch. Check an operation’s docstring before relying on this. The default assumption should be that a tile op is cooperative.
Debugging garbage or NaN output#
My kernel produces garbage or NaNs. What are the most common causes?
Uninitialized output tile passed to
wp.tile_matmul().wp.tile_matmul()accumulates intoout, computingalpha * a @ b + beta * out. With the defaultbeta=1.0, an uninitializedouttile produces garbage. Always initialize the accumulator withwp.tile_zeros()before callingtile_matmul().aligned=True with unmet conditions. Passing
aligned=Truetowp.tile_load()orwp.tile_store()skips runtime eligibility checks for the vectorized load path. Address-alignment violations trap unconditionally (even in release builds), but bounds and contiguity violations are caught only by debug-only asserts; in release builds they cause silent data corruption. If you suspect this is the cause, setwp.config.mode = "debug"to enable the bounds and contiguity asserts. Only usealigned=Truewhen you can guarantee all conditions listed in Vectorized Tile Loads.
Out-of-bounds handling#
How do wp.tile_load() and wp.tile_store() handle out-of-bounds accesses?
wp.tile_load() handles out-of-bounds accesses
automatically: positions past the array edge are zero-padded, so boundary tiles are
always safe to load without special handling. For
wp.tile_store(), out-of-bounds writes are silently
dropped.
These behaviors apply with the default aligned=False. Passing aligned=True
disables bounds handling, and out-of-bounds access is undefined behavior in release builds
(and will trip an assert in debug builds). See Vectorized Tile Loads for details on
the aligned parameter.
Compile-time constant tile shapes#
Do tile shapes have to be compile-time constants?
Yes. Tile shapes passed to wp.tile_load(),
wp.tile_zeros(),
wp.tile_ones(), and other tile construction functions
must be compile-time constants, since they are baked into the generated CUDA/C++ code.
You can define them as module-level Python variables (Warp treats these as implicit
constants), use wp.constant() explicitly, or hard-code the values
directly:
# Both are valid:
TILE_M = 64
TILE_N = wp.constant(32)
@wp.kernel
def compute(a: wp.array2d[float]):
i, j = wp.tid()
t = wp.tile_load(a, shape=(TILE_M, TILE_N), offset=(i * TILE_M, j * TILE_N))
# or:
t = wp.tile_load(a, shape=(64, 32), offset=(i * 64, j * 32))
The key restriction is that tile shapes cannot be runtime values: kernel arguments,
array dimensions, or values computed inside the kernel cannot be used as tile shapes.
The same rule applies to the capacity argument of
wp.tile_stack().
Reduction results and tile_extract#
Why doesn’t wp.tile_sum() return a scalar I can use directly?
Reductions like wp.tile_sum() return a single-element
tile, not a Python scalar. Use wp.tile_extract()
(or the shorthand tile[0]) to extract the value. The extracted scalar is broadcast
to all threads in the block:
s = wp.tile_sum(t) # s is a single-element tile
val = s[0] # scalar available to all threads (calls tile_extract)
Tile-to-global memory exit paths#
How do I get a tile’s data back into a global array or per-thread values?
Tiles are cooperative, opaque objects that cannot be returned from a kernel. There are three exit routes, each with a different use case:
wp.tile_store(): cooperatively write the entire tile back to a global memory array. This is the most common exit path.wp.tile_atomic_add(): cooperatively accumulate the tile into a global memory array with atomic additions.wp.untile(): convert a tile back to per-thread scalar values for continued SIMT work. See Tiles and SIMT Code.