3.5. Tensor Memory Tensor¶
A tensor memory tensor (i.e., TMemoryTensor) is a tensor stored in the Tensor Memory (TMEM)
of Blackwell GPUs (sm_100+). Tensor Memory is a dedicated on-chip memory specialized for use by the 5th-generation
Tensor Cores (tcgen05).
Tensor Memory layout: 128 lanes x 512 columns, each cell 32 bits.¶
dtype: the data type of the tensor elements.
shape: the shape of the tensor. The second-to-last dimension (
shape[-2]) must be 32, 64, or 128.
Tensor Memory is organized as a 2D structure of 128 rows (called lanes) and 512 columns per CTA, with each cell being 32 bits. Memory is allocated in units of 32 columns.
3.5.1. Tensor Memory Instructions¶
|
Allocate a tensor in tensor memory (TMEM). |
|
Deallocate a tensor memory tensor. |
|
Create a sliced view of a tensor memory tensor. |
|
Reinterpret a tensor memory tensor with a different dtype and shape. |
|
Relinquish the tensor memory allocation permit. |
|
Load data from tensor memory into registers. |
|
Store data from registers into tensor memory. |
Wait for all pending tensor memory load operations to complete. |
|
Wait for all pending tensor memory store operations to complete. |
|
|
Copy data from shared memory to tensor memory. |
|
Commit pending tcgen05 async operations and signal an mbarrier. |
|
Perform tensor core matrix multiply-accumulate with TMEM accumulator. |
All tensor memory allocated in a kernel must be explicitly deallocated before the kernel exits.
For more details, see Script.tcgen05.