Script.tcgen05

Script.tcgen05

Tensor Core Generation 05 (tcgen05) instructions for Blackwell GPUs.

Blackwell introduces tensor memory (TMEM), a high-bandwidth on-chip memory space dedicated to the tensor core. Unlike registers (which are per-thread), TMEM is a shared accumulator space that persists across loop iterations without the cost of register spilling.

The tcgen05 instruction group manages the full lifecycle of TMEM tensors:

  • Allocation: alloc() / dealloc() manage TMEM capacity. relinquish_alloc_permit() yields allocation rights to a peer CTA when using cta_group=2.

  • Views: slice() and view() create sub-region or reinterpreted views without copying.

  • Data movement: load() / store() transfer between TMEM and registers. copy() transfers from shared memory to TMEM. All are async and require wait_load() / wait_store() or commit() for synchronization.

  • Compute: mma() performs matrix multiply-accumulate with the accumulator in TMEM, supporting both shared-memory and TMEM operands for the A matrix.

  • Synchronization: commit() signals an mbarrier when pending async operations complete.

With cta_group=2, two CTAs in the same cluster collaborate: each CTA provides half the data (split along the M dimension) and holds half the accumulator, enabling larger tile sizes.

../../_images/tmem_layout.svg

Tensor Memory layout: 128 lanes x 512 columns, each cell 32 bits.

Instructions

alloc(dtype, shape[, cta_group])

Allocate a tensor in tensor memory (TMEM).

dealloc(tensor)

Deallocate a tensor memory tensor.

slice(tensor, offsets, dims, shape)

Create a sliced view of a tensor memory tensor.

view(tensor, dtype, shape)

Reinterpret a tensor memory tensor with a different dtype and shape.

relinquish_alloc_permit(cta_group)

Relinquish the tensor memory allocation permit.

load(tensor)

Load data from tensor memory into registers.

store(tensor, src)

Store data from registers into tensor memory.

wait_load()

Wait for all pending tensor memory load operations to complete.

wait_store()

Wait for all pending tensor memory store operations to complete.

commit(mbarrier[, cta_group, multicast_mask])

Commit pending tcgen05 async operations and signal an mbarrier.

copy(src, dst)

Copy data from shared memory to tensor memory.

mma(a, b, d, enable_input_d[, cta_group])

Perform tensor core matrix multiply-accumulate with TMEM accumulator.