warp.get_suggested_block_size#

warp.get_suggested_block_size(kernel, device=None)[source]#

Suggest a CUDA block size that maximizes occupancy for a kernel.

Queries the CUDA driver’s occupancy API (cuOccupancyMaxPotentialBlockSize) to find the block size that maximizes per-SM occupancy and the minimum number of blocks needed to fully utilize all SMs on the device. The kernel’s shared memory requirements are accounted for automatically.

Because this optimizes per-SM occupancy, the suggested block_size tends to be large. For small launch dimensions, a smaller block size may perform better because it distributes more blocks across SMs. Compare your grid size (ceil(N / block_size)) against min_grid_size to check whether the launch is large enough to benefit from the suggestion.

For background on CUDA occupancy, see https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-cuda-kernels.html

Example

Querying and using the launch configuration for a SAXPY kernel:

@wp.kernel
def saxpy(alpha: float, x: wp.array[float], y: wp.array[float]):
    i = wp.tid()
    y[i] = alpha * x[i] + y[i]


n = 1000000
x = wp.ones(n, dtype=float)
y = wp.zeros(n, dtype=float)

block_size, min_grid_size = wp.get_suggested_block_size(saxpy)
wp.launch(saxpy, dim=n, inputs=[2.0, x, y], block_dim=block_size)
Parameters:
  • kernel – A warp.Kernel object, created with @warp.kernel or the warp.Kernel constructor.

  • device (Device | str | None) – The target device. If None, uses the current CUDA device. For CPU devices, returns (1, 1).

Returns:

A tuple (block_size, min_grid_size) where block_size is the number of threads per block that maximizes occupancy and min_grid_size is the minimum number of blocks needed to fully utilize all SMs on the device.

Raises:
Return type:

tuple[int, int]