Script.tcgen05¶
Tensor Core Generation 05 (tcgen05) instructions for Blackwell GPUs.
Blackwell introduces tensor memory (TMEM), a high-bandwidth on-chip memory space dedicated to the tensor core. Unlike registers (which are per-thread), TMEM is a shared accumulator space that persists across loop iterations without the cost of register spilling.
The tcgen05 instruction group manages the full lifecycle of TMEM tensors:
Allocation:
alloc()/dealloc()manage TMEM capacity.relinquish_alloc_permit()yields allocation rights to a peer CTA when usingcta_group=2.Views:
slice()andview()create sub-region or reinterpreted views without copying.Data movement:
load()/store()transfer between TMEM and registers.copy()transfers from shared memory to TMEM. All are async and requirewait_load()/wait_store()orcommit()for synchronization.Compute:
mma()performs matrix multiply-accumulate with the accumulator in TMEM, supporting both shared-memory and TMEM operands for the A matrix.Synchronization:
commit()signals an mbarrier when pending async operations complete.
With cta_group=2, two CTAs in the same cluster collaborate: each CTA provides half the
data (split along the M dimension) and holds half the accumulator, enabling larger tile sizes.
Tensor Memory layout: 128 lanes x 512 columns, each cell 32 bits.¶
Instructions
|
Allocate a tensor in tensor memory (TMEM). |
|
Deallocate a tensor memory tensor. |
|
Create a sliced view of a tensor memory tensor. |
|
Reinterpret a tensor memory tensor with a different dtype and shape. |
|
Relinquish the tensor memory allocation permit. |
|
Load data from tensor memory into registers. |
|
Store data from registers into tensor memory. |
Wait for all pending tensor memory load operations to complete. |
|
Wait for all pending tensor memory store operations to complete. |
|
|
Commit pending tcgen05 async operations and signal an mbarrier. |
|
Copy data from shared memory to tensor memory. |
|
Perform tensor core matrix multiply-accumulate with TMEM accumulator. |