cuda::make_tma_descriptor#
Defined in the <cuda/tma> header.
Function signatures
namespace cuda {
[[nodiscard]] inline
CUtensorMap make_tma_descriptor(
const DLTensor& tensor,
cuda::std::span<const int> box_sizes,
cuda::std::span<const int> elem_strides,
tma_interleave_layout interleave_layout = tma_interleave_layout::none,
tma_swizzle swizzle = tma_swizzle::none,
tma_l2_fetch_size l2_fetch_size = tma_l2_fetch_size::none,
tma_oob_fill oobfill = tma_oob_fill::none) noexcept;
[[nodiscard]] inline
CUtensorMap make_tma_descriptor(
const DLTensor& tensor,
cuda::std::span<const int> box_sizes,
tma_interleave_layout interleave_layout = tma_interleave_layout::none,
tma_swizzle swizzle = tma_swizzle::none,
tma_l2_fetch_size l2_fetch_size = tma_l2_fetch_size::none,
tma_oob_fill oobfill = tma_oob_fill::none) noexcept;
} // namespace cuda
Enumerators
namespace cuda {
enum class tma_oob_fill { none, nan };
enum class tma_l2_fetch_size { none, bytes64, bytes128, bytes256 };
enum class tma_interleave_layout { none, bytes16, bytes32 };
enum class tma_swizzle {
none,
bytes32,
bytes64,
bytes128,
bytes128_atom_32B, // only CUDA Toolkit 12.8 and later, compute capability >= 10
bytes128_atom_32B_flip_8B,// only CUDA Toolkit 12.8 and later, compute capability >= 10
bytes128_atom_64B // only CUDA Toolkit 12.8 and later, compute capability >= 10
};
} // namespace cuda
The functions construct a CUDA Tensor Memory Accelerator (TMA) descriptor from a DLTensor. The resulting CUtensorMap can be bound to TMA-based copy instructions to efficiently stage multi-dimensional tiles in shared memory on Compute Capability 9.0 and newer GPUs.
Note
DLPack assumes row-major convention for sizes and strides, where the fastest changing dimension is the last one (
rank - 1).cuTensorMap assumes column-major convention for sizes and strides, where the fastest changing dimension is the first one (
0).box_sizesandelem_stridesare expected to be in the same order as the input tensor’s dimensions provided by DLPack, namely row-major.
Parameters#
tensor: The DLPack tensor describing the logical layout in device memory.box_sizes: Extent of the shared memory tile, one entry per tensor dimension.elem_strides: Stride, in elements, between consecutive accesses inside the shared memory tile. The second overload assumes a stride of1for every dimension with the special meaning of contiguous memory.
Optional parameters:
interleave_layout: Interleaving applied to the underlying memory.swizzle: Swizzle pattern matching the chosen interleave layout.l2_fetch_size: L2 cache promotion for TMA transfers.oobfill: Out-of-bounds fill policy for floating-point tensors.
Return value#
CUtensorMapencoding all metadata required to launch TMA transfers.
Preconditions#
General preconditions:
Compute Capability 9.0 or newer is required.
dlpack/dlpack.h(DLPack v1) must be discoverable at compile time, namely available in the include path.
DLPack preconditions:
tensor.device.device_type:
Must be
kDLCUDAorkDLCUDAManaged.
tensor.device.device_id:
Must be a valid GPU device ordinal
The selected device must have Compute Capability 9.0 or newer.
tensor.ndim (rank):
Must be greater than 0 and less than or equal to 5.
Must be greater than or equal to
3when an interleaved layout is requested.
tensor.dtype:
kDLUInt:bits == 4,lanes == 16, namelyU4 x 16. Additionally, the innermost dimension must be a multiple of2when only 16-byte alignment is available. Requires CUDA Toolkit 12.8 and later, and compute capability >= 10.bits == 8,lanes == 1, namelyuint8_t.bits == 16,lanes == 1, namelyuint16_t.bits == 32,lanes == 1, namelyuint32_t.bits == 64,lanes == 1, namelyuint64_t.
kDLIntbits == 32,lanes == 1, namelyint32_t.bits == 64,lanes == 1, namelyint64_t.
kDLFloatbits == 16,lanes == 1, namely__half.bits == 32,lanes == 1, namelyfloat.bits == 64,lanes == 1, namelydouble.
kDLBfloatbits == 16,lanes == 1, namely__nv_bfloat16.
kDLFloat4_e2m1fnbits == 4,lanes == 16, mapped toU4 x 16. SeekDLUIntfor additional requirements.
kDLBool,kDLFloat8_e3m4,kDLFloat8_e4m3,kDLFloat8_e4m3b11fnuz,kDLFloat8_e4m3fn,kDLFloat8_e4m3fnuz,kDLFloat8_e5m2,kDLFloat8_e5m2fnuz,kDLFloat8_e8m0fnu: mapped touint8_t.
tensor.data (pointer):
Must be a valid GPU global address.
Must be aligned to at least 16 bytes. Must be aligned to 32 bytes when
interleave_layoutisbytes32.
tensor.shape:
Must be greater than 0 and not exceed
2^32elements per dimension.The innermost dimension must be a multiple of
2whenkDLFloat4_e2m1fnorU4 x 16are used.
tensor.strides:
Each stride in bytes, namely
tensor.strides[i] * element_size, must be greater than 0 and not exceed2^40bytes per dimension.The tensor mapping must be unique, namely
tensor.strides[i]must be greater than or equal totensor.shape[i - 1] * strides[i - 1]or equal to0.Each stride in bytes must be a multiple of the alignment 16 bytes when
interleave_layoutisnoneorbytes16. It must be a multiple of 32 bytes wheninterleave_layoutisbytes32.tensor.stridescan benullptrto indicate that the tensor is contiguous in memory.
User parameter preconditions:
box_sizes, elem_strides, and tensor.ndim must have the same rank.
box_sizes:
Must be positive and not exceed
256elements per dimension.box_sizes[i]must be less than or equal totensor.shape[i].The full size of
box_sizesmust fit in shared memory.If the
interleave_layoutistma_interleave_layout::none, the inner dimension in bytes, computed asbox_sizes[rank - 1] * element_sizehas the following additional requirements:It must be a multiple of 16 bytes.
It must not exceed the byte-width of the selected
swizzlepattern (32,64, or128bytes).
elem_strides:
Must be positive and not exceed
8elements per dimension.elem_strides[i]must be less than or equal totensor.shape[i].If the
interleave_layoutistma_interleave_layout::none, the innner dimension (elem_strides[0]) is ignored.
oobfill:
Must be
tma_oob_fill::nonefor all integer data types.
interleave_layout:
If
interleave_layoutistma_interleave_layout::bytes32,swizzlemust betma_swizzle::bytes32.
References#
DLPack C API documentation.
CUDA Tensor Memory Accelerator (TMA) documentation.
cuTensorMapEncodeTiled()CUDA driver API documentation.
Example#
#include <cuda/tma>
#include <cuda/std/cstdint>
#include <dlpack/dlpack.h>
CUtensorMap create_2d_tile_descriptor(float* device_ptr) {
// Define DLPack tensor descriptor, commonly provided externally by the user, library, or framework.
constexpr int64_t shape_storage[2] = {64, 64};
constexpr int64_t strides_storage[2] = {64, 1};
DLTensor tensor{};
tensor.data = device_ptr;
tensor.device = {kDLCUDA, 0};
tensor.ndim = 2;
tensor.dtype.code = static_cast<uint8_t>(kDLFloat);
tensor.dtype.bits = 32;
tensor.dtype.lanes = 1;
tensor.shape = const_cast<int64_t*>(shape_storage);
tensor.strides = const_cast<int64_t*>(strides_storage);
tensor.byte_offset = 0;
// Define shared memory box sizes and element strides.
constexpr int BoxSizeX = 8; // rows
constexpr int BoxSizeY = 8; // columns
int box_sizes_storage[2] = {BoxSizeX, BoxSizeY};
int elem_strides_storage[2] = {BoxSizeY, 1}; // {1, ..., 1} is also valid to specify contiguous memory
return cuda::make_tma_descriptor(tensor, box_sizes_storage, elem_strides_storage);
}