Script.tma¶

Tensor Memory Accelerator (TMA) instructions for asynchronous bulk data transfers.

TMA is a dedicated hardware engine on Hopper+ GPUs that asynchronously copies multi-dimensional tiles between global memory and shared memory, without occupying SM compute resources.

TMA transfers are asynchronous: the issuing thread returns immediately while the TMA engine performs the copy in the background. Completion is tracked through mbarriers:

Issue global_to_shared() or shared_to_global() — the mbarrier’s tx-count is automatically increased.
The TMA engine completes the transfer and decrements the mbarrier’s tx-count.
Consumers call mbarrier.wait() to block until all transfers for a phase are done.

For legacy (non-mbarrier) async copies, use commit_group() and wait_group() to group and synchronize transfers.

TMA supports multicast (multicast_mask) to deliver the same global tile to shared memory of multiple CTAs in a cluster, and CTA groups (cta_group=2) for coordinated two-CTA operations where the mbarrier can reside on a peer CTA.

Instructions

`global_to_shared`(*, src, dst, offsets[, ...])	Asynchronously copy a tile from global memory to shared memory via TMA.
`shared_to_global`(src, dst, offsets[, dims, ...])	Asynchronously copy a tile from shared memory to global memory via TMA.
`commit_group`()	Commit pending TMA async copy operations into a group.
`wait_group`(n[, read])	Wait for TMA async copy commit groups to complete.

Script.tma

Contents

Script.tma¶