Script.tma

Script.tma

Tensor Memory Accelerator (TMA) instructions for asynchronous bulk data transfers.

TMA is a dedicated hardware engine on Hopper+ GPUs that asynchronously copies multi-dimensional tiles between global memory and shared memory, without occupying SM compute resources.

TMA transfers are asynchronous: the issuing thread returns immediately while the TMA engine performs the copy in the background. Completion is tracked through mbarriers:

  1. Issue global_to_shared() or shared_to_global() — the mbarrier’s tx-count is automatically increased.

  2. The TMA engine completes the transfer and decrements the mbarrier’s tx-count.

  3. Consumers call mbarrier.wait() to block until all transfers for a phase are done.

For legacy (non-mbarrier) async copies, use commit_group() and wait_group() to group and synchronize transfers.

TMA supports multicast (multicast_mask) to deliver the same global tile to shared memory of multiple CTAs in a cluster, and CTA groups (cta_group=2) for coordinated two-CTA operations where the mbarrier can reside on a peer CTA.

Instructions

global_to_shared(*, src, dst, offsets[, ...])

Asynchronously copy a tile from global memory to shared memory via TMA.

shared_to_global(src, dst, offsets[, dims, ...])

Asynchronously copy a tile from shared memory to global memory via TMA.

commit_group()

Commit pending TMA async copy operations into a group.

wait_group(n[, read])

Wait for TMA async copy commit groups to complete.