Tensor Memory Tensor

3.5. Tensor Memory Tensor

A tensor memory tensor (i.e., TMemoryTensor) is a tensor stored in the Tensor Memory (TMEM) of Blackwell GPUs (sm_100+). Tensor Memory is a dedicated on-chip memory specialized for use by the 5th-generation Tensor Cores (tcgen05).

../../_images/tmem_layout.svg

Tensor Memory layout: 128 lanes x 512 columns, each cell 32 bits.

  • dtype: the data type of the tensor elements.

  • shape: the shape of the tensor. The second-to-last dimension (shape[-2]) must be 32, 64, or 128.

Tensor Memory is organized as a 2D structure of 128 rows (called lanes) and 512 columns per CTA, with each cell being 32 bits. Memory is allocated in units of 32 columns.

3.5.1. Tensor Memory Instructions

alloc(dtype, shape[, cta_group])

Allocate a tensor in tensor memory (TMEM).

dealloc(tensor)

Deallocate a tensor memory tensor.

slice(tensor, offsets, dims, shape)

Create a sliced view of a tensor memory tensor.

view(tensor, dtype, shape)

Reinterpret a tensor memory tensor with a different dtype and shape.

relinquish_alloc_permit(cta_group)

Relinquish the tensor memory allocation permit.

load(tensor)

Load data from tensor memory into registers.

store(tensor, src)

Store data from registers into tensor memory.

wait_load()

Wait for all pending tensor memory load operations to complete.

wait_store()

Wait for all pending tensor memory store operations to complete.

copy(src, dst)

Copy data from shared memory to tensor memory.

commit(mbarrier[, cta_group, multicast_mask])

Commit pending tcgen05 async operations and signal an mbarrier.

mma(a, b, d, enable_input_d[, cta_group])

Perform tensor core matrix multiply-accumulate with TMEM accumulator.

All tensor memory allocated in a kernel must be explicitly deallocated before the kernel exits.

For more details, see Script.tcgen05.