Script.tcgen05¶

Tensor Core Generation 05 (tcgen05) instructions for Blackwell GPUs.

Blackwell introduces tensor memory (TMEM), a high-bandwidth on-chip memory space dedicated to the tensor core. Unlike registers (which are per-thread), TMEM is a shared accumulator space that persists across loop iterations without the cost of register spilling.

The tcgen05 instruction group manages the full lifecycle of TMEM tensors:

Allocation: alloc() / dealloc() manage TMEM capacity. relinquish_alloc_permit() yields allocation rights to a peer CTA when using cta_group=2.
Views: slice() and view() create sub-region or reinterpreted views without copying.
Data movement: load() / store() transfer between TMEM and registers. copy() transfers from shared memory to TMEM. All are async and require wait_load() / wait_store() or commit() for synchronization.
Compute: mma() performs matrix multiply-accumulate with the accumulator in TMEM, supporting both shared-memory and TMEM operands for the A matrix.
Synchronization: commit() signals an mbarrier when pending async operations complete.

With cta_group=2, two CTAs in the same cluster collaborate: each CTA provides half the data (split along the M dimension) and holds half the accumulator, enabling larger tile sizes.

../../_images/tmem_layout.svg — Tensor Memory layout: 128 lanes x 512 columns, each cell 32 bits.¶

Instructions

`alloc`(dtype, shape[, cta_group])	Allocate a tensor in tensor memory (TMEM).
`dealloc`(tensor)	Deallocate a tensor memory tensor.
`slice`(tensor, offsets, dims, shape)	Create a sliced view of a tensor memory tensor.
`view`(tensor, dtype, shape)	Reinterpret a tensor memory tensor with a different dtype and shape.
`relinquish_alloc_permit`(cta_group)	Relinquish the tensor memory allocation permit.
`load`(tensor)	Load data from tensor memory into registers.
`store`(tensor, src)	Store data from registers into tensor memory.
`wait_load`()	Wait for all pending tensor memory load operations to complete.
`wait_store`()	Wait for all pending tensor memory store operations to complete.
`commit`(mbarrier[, cta_group, multicast_mask])	Commit pending tcgen05 async operations and signal an mbarrier.
`copy`(src, dst)	Copy data from shared memory to tensor memory.
`mma`(a, b, d, enable_input_d[, cta_group])	Perform tensor core matrix multiply-accumulate with TMEM accumulator.

Script.tcgen05

Contents

Script.tcgen05¶