warp.get_suggested_block_size#
- warp.get_suggested_block_size(kernel, device=None)[source]#
Suggest a CUDA block size that maximizes occupancy for a kernel.
Queries the CUDA driver’s occupancy API (
cuOccupancyMaxPotentialBlockSize) to find the block size that maximizes per-SM occupancy and the minimum number of blocks needed to fully utilize all SMs on the device. The kernel’s shared memory requirements are accounted for automatically.Because this optimizes per-SM occupancy, the suggested
block_sizetends to be large. For small launch dimensions, a smaller block size may perform better because it distributes more blocks across SMs. Compare your grid size (ceil(N / block_size)) againstmin_grid_sizeto check whether the launch is large enough to benefit from the suggestion.For background on CUDA occupancy, see https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-cuda-kernels.html
Example
Querying and using the launch configuration for a SAXPY kernel:
@wp.kernel def saxpy(alpha: float, x: wp.array[float], y: wp.array[float]): i = wp.tid() y[i] = alpha * x[i] + y[i] n = 1000000 x = wp.ones(n, dtype=float) y = wp.zeros(n, dtype=float) block_size, min_grid_size = wp.get_suggested_block_size(saxpy) wp.launch(saxpy, dim=n, inputs=[2.0, x, y], block_dim=block_size)
- Parameters:
kernel – A
warp.Kernelobject, created with@warp.kernelor thewarp.Kernelconstructor.device (Device | str | None) – The target device. If
None, uses the current CUDA device. For CPU devices, returns(1, 1).
- Returns:
A tuple
(block_size, min_grid_size)whereblock_sizeis the number of threads per block that maximizes occupancy andmin_grid_sizeis the minimum number of blocks needed to fully utilize all SMs on the device.- Raises:
TypeError – If
kernelis not a Warp kernel.RuntimeError – If the CUDA occupancy query fails.
- Return type: