Script.tma¶
Tensor Memory Accelerator (TMA) instructions for asynchronous bulk data transfers.
TMA is a dedicated hardware engine on Hopper+ GPUs that asynchronously copies multi-dimensional tiles between global memory and shared memory, without occupying SM compute resources.
TMA transfers are asynchronous: the issuing thread returns immediately while the TMA engine performs the copy in the background. Completion is tracked through mbarriers:
Issue
global_to_shared()orshared_to_global()— the mbarrier’s tx-count is automatically increased.The TMA engine completes the transfer and decrements the mbarrier’s tx-count.
Consumers call
mbarrier.wait()to block until all transfers for a phase are done.
For legacy (non-mbarrier) async copies, use commit_group() and wait_group() to
group and synchronize transfers.
TMA supports multicast (multicast_mask) to deliver the same global tile to shared
memory of multiple CTAs in a cluster, and CTA groups (cta_group=2) for coordinated
two-CTA operations where the mbarrier can reside on a peer CTA.
Instructions
|
Asynchronously copy a tile from global memory to shared memory via TMA. |
|
Asynchronously copy a tile from shared memory to global memory via TMA. |
Commit pending TMA async copy operations into a group. |
|
|
Wait for TMA async copy commit groups to complete. |